Feature map of the Gaussian RBF Kernel



The Gaussian RBF function is a Positive-definite kernel. These kind of functions find many applications in many different fields. Quoting Wikipedia,

“They occur naturally in Fourier analysis, probability theory, operator theory, complex function-theory, moment problems, integral equations, boundary-value problems for partial differential equations, machine learning, embedding problem, information theory, and other areas.”

Here is the definition:

$$ K_\sigma(x, y) = e^{-\frac{|x-y|_2^2}{2\sigma^2}} $$

The 1-dimensional Gaussian RBF looks like this: Note that it is proportional to the standard normal distribution, but instead of having an area of 1, it has a maximum value of 1 at the center. This can be helpful to provide a measure of similarity between $x$ and $y$.

One particularly nice and interesting property of Kernel functions is that, although they can be nonlinear, they are equivalent to a linear operator in a higher-dimensional space. An intuition is provided at the mentioned Wikipedia page:

Image: Kernel trick illustration. Here the kernel is given by $\phi((x_1, x_2)) = (x_1, x_2, x_1^2 + x_2^2)$

This propety gives place to the so-called Kernel Trick, which (quoting Wikipedia again), enables kernel functions to

“operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space”.

In other words, computing $K(x, y)$ is equivalent to extracting many features from our vectors $x$ and $y$, resulting in the high-dimensional vectors $\phi(x), \phi(y)$, and then computing $\langle \phi(x), \phi(y) \rangle$:

$$ K_\sigma(x, y) = \langle \phi(x), \phi(y) \rangle $$

This equivalence can be advantageous because usually $K$ has low computational demands, being much quicker and efficient than computing and aggregating all features in $\phi$. In some cases even, like the RBF discussed here, the space of $\phi$ has infinitely many dimensions, and therefore it is simply impossible to compute other than via kernel trick.

Note that, although kernels operate implicitly in such a higher-dimensional linear space, we don’t neccessarily know the explicit expression for the features in such space, i.e. even if we know $x$, we don’t know how $\phi_i(x)$ may look like. This article addresses exactly that for the Gaussian RBF kernel, providing an explicit expression and derivation via direct proof for the features of the $\phi$ vectors.

Although a succint version of this proof can be found on this other RBF Wikipedia article, I thought it would be very educational to flesh it out in order to understand the specific principles that lead to the final expression. Also, this proof explicitly incorporates $\sigma$ as a parameter, and the source for the proof linked by Wikipedia (but not the wiki itself) contains a small error in the derivation. For these reasons I felt that it was justified to upload my version. A copy of this proof is also here: https://stats.stackexchange.com/a/478712/142281


Given the Gaussian RBF Kernel $K_\sigma$ between two $n$-dimensional vectors ($x$ and another), for each $j$ from 0 to infinity and for every combination of $n$ indices (labeled as $k$) that add up to $j$, the feature vector $\phi(x)$ has a feature that looks like this:

$$ \phi_{\sigma, j, k}(x) = c_{\sigma, j}(x) \cdot f_{j, k}(x) $$


$ \begin{aligned} c_{\sigma, j}(x) &= \frac{K_\sigma(x, 0)}{\sigma^j \sqrt{j!}}\\
f_{j, k}(x) &= \begin{pmatrix} j \\ k_1,k_2, \dots, k_n \end{pmatrix}^{\frac{1}{2}} \prod_{d=1}^n{x_d^{k_d}} \end{aligned} $


For notational convenience, we will use the constant $\epsilon$ defined below. We also define the corresponding exponential Taylor expansion, and the Multinomial Theorem:

$ \begin{aligned} \epsilon := &e^{\frac{1}{\sigma^2}}\\
\epsilon^x = &\sum_{j=0}^{\infty}\left\{ \frac{x^j}{\sigma^{2j} \cdot j!} \right\}\\
(x_1 + x_2 + \dots + x_n)^j = &\sum_{k_1+k_2+\dots+k_n=j}\left\{ \begin{pmatrix} j \\ k_1,k_2, \dots, k_n \end{pmatrix} \prod_{d=1}^n{x_d^{k_d}} \right\}\\
\end{aligned} $

They are sufficient to carry out the proof. All of them are basic building blocks so I won’t cover them in detail, but sources are ubiquitous, e.g.:

Direct Proof:

First, we decompose the squared euclidean distance into its components, and perform the Taylor expansion for the $xy$ component:

$$ \begin{aligned} K(x,y)= &e^{-\frac{|x-y|_2^2}{2\sigma^2}} =\epsilon^{\langle x, y \rangle} \cdot\epsilon^{-\frac{|x|_2^2}{2}} \cdot \epsilon^{-\frac{|y|2^2}{2}}\\
= &\sum
{j=0}^{\infty}\left\{ \frac{\langle x, y \rangle^j}{\sigma^{2j} \cdot j!} \right\} \cdot\epsilon^{-\frac{|x|_2^2}{2}} \cdot \epsilon^{-\frac{|y|_2^2}{2}} \end{aligned} $$

For further convenience, we refactor the expression (using $c$ for more compact notation):

$$ \begin{aligned} K(x,y) = &\sum_{j=0}^{\infty}\left\{\frac{\epsilon^{-\frac{|x|_2^2}{2}}}{\sigma^j \cdot \sqrt{j!}} \cdot \frac{\epsilon^{-\frac{|y|_2^2}{2}}}{\sigma^j \cdot \sqrt{j!}} \cdot \langle x, y \rangle^j \right\}\\
= &\sum_{j=0}^{\infty}\left\{ c_{\sigma, j}(x) \cdot c_{\sigma, j}(y) \cdot \langle x, y \rangle^j \right\}\\
\end{aligned} $$

And with help of the multinomial theorem, we can express the power of the dot product as follows (using $f$ for more compact notation):

$$ \begin{aligned} \langle x, y \rangle^j = &\left(\sum_{d=1}^n x_d y_d \right)^j\\
= &\sum_{k_1+k_2+\dots+k_n=j}\left\{ \begin{pmatrix} j \\ k_1,k_2, \dots, k_n \end{pmatrix} \prod_{d=1}^n{(x_dy_d)^{k_d}} \right\}\\
= &\sum_{k_1+k_2+\dots+k_n=j}\left\{ \begin{pmatrix} j \\ k_1,\dots, k_n \end{pmatrix}^{\frac{1}{2}} \prod_{d=1}^n{x_d^{k_d}} \cdot
\begin{pmatrix} j \\ k_1, \dots, k_n \end{pmatrix}^{\frac{1}{2}} \prod_{d=1}^n{y_d^{k_d}} \right\}\\
=: &\sum_{k_1+k_2+\dots+k_n=j}\left\{f_{j,k}(x) \cdot f_{j, k}(y) \right\}\\
\end{aligned} $$

Now replacing in $K$ will allow us to end the proof:

$$ \begin{aligned} K(x,y) = &\sum_{j=0}^{\infty}\left\{ c_{\sigma, j}(x) \cdot c_{\sigma, j}(y) \cdot \sum_{k_1+k_2+\dots+k_n=j}\left\{f_{j,k}(x) \cdot f_{j, k}(y) \right\} \right\}\\
= &\sum_{j=0}^{\infty} \sum_{k_1+k_2+\dots+k_n=j}\left\{ c_{\sigma, j}(x) f_{j,k}(x) \cdot c_{\sigma, j}(y) f_{j, k}(y) \right\}\\
= &\langle \phi(x), \phi(y) \rangle\\
&\square \end{aligned} $$

Where each $\phi$ is a vector with one entry for every combination of $n$ indices (labeled as $k$) that add up to $j$, and this for each $j$ from 0 to infinity.