Feature Map of the Gaussian RBF Kernel



The Radial Basis Function (RBF) is a Positive-definite kernel. Quoting Wikipedia, PD kernels

“occur naturally in Fourier analysis, probability theory, operator theory, complex function-theory, moment problems, integral equations, boundary-value problems for partial differential equations, machine learning, embedding problem, information theory, and other areas.”

The RBF is defined as follows:

$$ K_\sigma(x, y) = e^{-\frac{|x-y|_2^2}{2\sigma^2}} $$

Note that it is proportional to the standard normal distribution, but instead of having an area of 1, it has a maximum value of 1 at the center. This can be helpful to provide a measure of similarity between $x$ and $y$. For $y=0$, the 1-dimensional version looks like this:

One particularly nice and interesting property of Kernel functions is that, although they can be nonlinear, they are equivalent to a linear operator in a higher-dimensional space. An intuition is provided at the mentioned Wikipedia page:

Image: Kernel trick illustration. Here the kernel is given by $\phi((x_1, x_2)) = (x_1, x_2, x_1^2 + x_2^2)$

This propety gives place to the so-called Kernel Trick, which (quoting Wikipedia again), enables kernel functions to

“operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space”.

In other words, computing $K(x, y)$ is equivalent to extracting many features from our vectors $x$ and $y$, resulting in the high-dimensional vectors $\phi(x), \phi(y)$, and then computing $\langle \phi(x), \phi(y) \rangle$:

$$ K_\sigma(x, y) = \langle \phi(x), \phi(y) \rangle $$

This equivalence can be advantageous because usually $K$ has low computational demands, being much quicker and efficient than computing and aggregating all features in $\phi$. In some cases even, like the RBF discussed here, the space of $\phi$ has infinitely many dimensions, and therefore it is simply impossible to compute other than via kernel trick.

Note that, although kernels operate implicitly in such a higher-dimensional linear space, we don’t neccessarily know the explicit expression for the features in such space, i.e. even if we know $x$, we don’t know how $\phi_i(x)$ may look like. This article addresses exactly that for the Gaussian RBF kernel, providing an explicit expression and derivation via direct proof for the features of the $\phi$ vectors.

A more succint version of this proof can be found on this other RBF Wikipedia article. I thought this more exhaustive derivation could be helpful in order to understand the specific principles that lead to the final expression. Also, this proof explicitly incorporates $\sigma$ as a parameter, and the source for the proof linked by Wikipedia (but not the wiki itself) contains a small error in the derivation, which I hope to have fixed here. For further discussion, I’ve also uploaded a copy of this proof to StackOverflow.


RBF Kernel Trick

Given the Gaussian RBF Kernel $K_\sigma$ between two $n$-dimensional vectors ($x$ and another), for each $j$ from 0 to infinity and for every combination of $n$ indices (labeled as $k$) that add up to $j$, the feature vector $\phi(x)$ has a feature that looks like this:

$$ \phi_{\sigma, j, k}(x) = c_{\sigma, j}(x) \cdot f_{j, k}(x) $$


$ \begin{aligned} c_{\sigma, j}(x) &= \frac{K_\sigma(x, 0)}{\sigma^j \sqrt{j!}}\\ f_{j, k}(x) &= \begin{pmatrix} j \\ k_1,k_2, \dots, k_n \end{pmatrix}^{\frac{1}{2}} \prod_{d=1}^n{x_d^{k_d}} \end{aligned} $


For notational convenience, we will use the constant $\epsilon$ defined below. We also define the corresponding exponential Taylor expansion, and the Multinomial Theorem:

$ \begin{aligned} \epsilon := &e^{\frac{1}{\sigma^2}}\\ \epsilon^x = &\sum_{j=0}^{\infty}\left\{ \frac{x^j}{\sigma^{2j} \cdot j!} \right\}\\ (x_1 + \dots + x_n)^j = &\sum_{k_1 +\dots+k_n=j}\left\{ \begin{pmatrix} j \\ k_1, \dots, k_n \end{pmatrix} \prod_{d=1}^n{x_d^{k_d}} \right\}\\ \end{aligned} $

They are sufficient to carry out the proof. I won’t cover them in detail, see sources below for further details:

Direct Proof:

First, we decompose the squared euclidean distance into its components, and perform the Taylor expansion for the $xy$ component:

$$ \begin{aligned} K(x,y)= &e^{-\frac{|x-y|_2^2}{2\sigma^2}} =\epsilon^{\langle x, y \rangle} \cdot\epsilon^{-\frac{|x|_2^2}{2}} \cdot \epsilon^{-\frac{|y|_2^2}{2}}\\ = &\sum _{j=0}^{\infty}\left\{ \frac{\langle x, y \rangle^j}{\sigma^{2j} \cdot j!} \right\} \cdot\epsilon^{-\frac{|x|_2^2}{2}} \cdot \epsilon^{-\frac{|y|_2^2}{2}} \end{aligned} $$

For further convenience, we refactor the expression (using $c$ for more compact notation):

$$ \begin{aligned} K(x,y) = &\sum_{j=0}^{\infty}\left\{\frac{\epsilon^{-\frac{|x|_2^2}{2}}}{\sigma^j \cdot \sqrt{j!}} \cdot \frac{\epsilon^{-\frac{|y|_2^2}{2}}}{\sigma^j \cdot \sqrt{j!}} \cdot \langle x, y \rangle^j \right\}\\ = &\sum _{j=0}^{\infty} \left\{ c _{\sigma, j}(x) \cdot c _{\sigma, j}(y) \cdot \langle x, y \rangle^j \right\}\\ \end{aligned} $$

And with help of the multinomial theorem, we can express the power of the dot product as follows (using $f$ for more compact notation):

$$ \begin{aligned} &\langle x, y \rangle^j = \left(\sum_{d=1}^n x_d y_d \right)^j\\ = &\sum_{k_1+\dots+k_n=j}\left\{ \begin{pmatrix} j \\ k_1,k_2, \dots, k_n \end{pmatrix} \prod_{d=1}^n{(x_dy_d)^{k_d}} \right\}\\ = &\sum_{k_1+\dots+k_n=j}\left\{ \begin{pmatrix} j \\ k_1,\dots, k_n \end{pmatrix}^{\frac{1}{2}} \prod_{d=1}^n{x_d^{k_d}} \cdot
\begin{pmatrix} j \\ k_1, \dots, k_n \end{pmatrix}^{\frac{1}{2}} \prod_{d=1}^n{y_d^{k_d}} \right\}\\ =: &\sum_{k_1+\dots+k_n=j}\left\{f_{j,k}(x) \cdot f_{j, k}(y) \right\}\\ \end{aligned} $$

Now replacing in $K$ will allow us to end the proof:

$$ \begin{aligned} K(x,y) = &\sum_{j=0}^{\infty}\left\{ c_{\sigma, j}(x) \cdot c_{\sigma, j}(y) \cdot \sum_{k_1+\dots=j}\left\{f_{j,k}(x) \cdot f_{j, k}(y) \right\} \right\}\\ = &\sum_{j=0}^{\infty} \sum_{k_1+\dots+k_n=j}\left\{ c_{\sigma, j}(x) f_{j,k}(x) \cdot c_{\sigma, j}(y) f_{j, k}(y) \right\}\\ = &\langle \phi(x), \phi(y) \rangle\\ &\square \end{aligned} $$

Where each $\phi$ is a vector with one entry for every combination of $n$ indices (labeled as $k$) that add up to $j$, and this for each $j$ from 0 to infinity.

TAGS: algebra, kernel trick, machine learning, proof