We are still trying to fit a function to our data, but rather than a pre-specified number of parameters, as in linear regression and neural networks, the number of parameters scales with the size of the training data. [...] In nearest neighbor, the training data itself acts as the set of parameters, so the number of parameters naturally increases as the number of training points increases. 41.4 Kernel functions as local function approximations One way to think of the kernel functions used in kernel regression is a function that helps locally approximate the function at a given input location. [...] If a point, x is close to z, then the output reaches its maximum value of one. [...] This radius, width/2, is a hyperparameter that allows us to change the width of this kernel. [...] k(x, z) = 1 if ∥x − z∥2 ≤ width 2 0 otherwise 1.5 Kernel regression 1.5.1 Naive version Our first naive attempt at fitting a function to the training data is to 1) center a kernel function at each training input x(i), 2) scaling the kernel height by the training output y(i), and then 3) summing all of these kernel functions together. [...] The kernel matrix contains the output values of the kernel function for all N 2 combinations of input values x(i) and x(j). [...] The i, j entry in the kernel matrix is: Ki,j = k(x(i), x(j)) The corrected process for kernel regression is as follows: 1. [...] 1.6.1 Kernel trick For a given feature transform function ϕ(x), a kernel function is a function that takes in two points, x and z, and returns the value ϕ(x)⊤ϕ(z). [...] The purpose of a kernel function is to compute ϕ(x)⊤ϕ(z) without having to explicitly compute the feature transforms ϕ(x) and ϕ(z) that can be prohibitively expensive in both memory and computation time. [...] The so-called kernel trick is to reformulate an optimization problem involving feature-transformed input points ϕ(x), such that ϕ(x) never appears alone in the optimization and only appears in a dot product with another feature-transformed point, ϕ(x)⊤ϕ(z).