Positive definiteness of squared exponential kernel matrix

$\newcommand{\R}{\mathbb R}$$\newcommand{\e}{\varepsilon}$$\newcommand{\H}{\mathcal H}$$\newcommand{\0}{\mathbf 0}$In this post I’m going to prove that the squared exponential kernel is positive definite. I’ll finish with an application to interpolation. Let $k : \R^p\times \R^p \to \R$ be defined by
$$
k(x,x’) = \exp(-\gamma\|x-x’\|^2)
$$
for some fixed $\gamma > 0$. This is the squared exponential kernel.

Suppose I have collected distinct data points $x_1,\dots,x_n\in\R^p$. Define functions $h_i : \R^p\to\R$ by $h_i(x) = \exp\left(-2\gamma \|x – x_i\|^2\right)$. Throughout I’ll be integrating with respect to $\lambda$, the Lebesgue measure on $\R^p$.

Result 1: $k(x_i,x_j) \propto \int_{\R^p} h_ih_j\,\text d\lambda$

Pf:
$$
\int_{\R^p} h_i(x)h_j(x)\,\text d\lambda(x) = \int\exp\left(-2\gamma\|x-x_i\|^2 – 2\gamma\|x-x_j\|^2\right)\,\text d\lambda(x).
$$
Factoring out $-2\gamma$ inside the exponential, I can expand and complete the square to get
$$
\begin{aligned}\|x-x_i\|^2 + \|x-x_j\|^2 &= 2x^Tx – 2 x^T(x_i + x_j) + (x_i^Tx_i + x_j^Tx_j) \\&
= 2\|x – m_{ij}\|^2 + \frac 12 \|x_i – x_j\|^2\end{aligned}
$$
with $m_{ij} = \frac 12 (x_i+x_j)$ so
$$
\int h_i(x)h_j(x)\,\text d\lambda(x) = \exp\left(-\gamma\|x_i-x_j\|^2\right)\int \exp\left(-4\gamma\|x-m_{ij}\|^2\right)\,\text d\lambda(x).
$$
That last integral is the kernel of a Gaussian density and evaluates to $\left(\frac{\pi}{4\gamma}\right)^{p/2}$ so
$$
\int h_i(x)h_j(x)\,\text d\lambda(x) \propto \exp\left(-\gamma\|x_i-x_j\|^2\right)
$$
as desired.

$\square$

I can use Result 1 to rewrite a quadratic form with $K$ in a useful way. Letting $v \in \R^n$ I have
$$
\begin{aligned}v^T K v &= \sum_{ij}v_iv_jK_{ij} = \sum_{ij} v_iv_j \int h_ih_j\,\text d\lambda \\&
=\int \sum_{ij} v_iv_jh_ih_j\,\text d\lambda \\&
= \int \left(\sum_{i=1}^n v_i h_i\right)^2\,\text d\lambda.\end{aligned}
$$
This already shows that $K$ is positive semidefinite as this cannot be negative, but I want to further show that this equals zero only if $v = \0$.

The first thing I’ll do is show that $\sum_i v_i h_i$ can’t ever be the zero function. Then I’ll show that if the integral of a non-negative continuous function is zero then the function is itself everywhere zero.

Result 2: The functions $h_1,\dots,h_n$ are linearly independent so long as the $x_i$ are distinct.

Pf: For this section I’ll subsume the factor of $2$ into $\gamma$ and use $h_i(x) = \exp(-\gamma\|x-x_i\|^2)$ for simplicity.

Suppose for $v \in \R^n$ that
$$
\sum_{i=1}^n v_i h_i(x) = 0
$$
for all $x \in \R^p$. Expanding the norm in $h_i$ I have
$$
\begin{aligned}e^{-\gamma\|x\|^2}\sum_{i} v_i e^{-\gamma\|x_i\|^2}\exp(2\gamma x_i^Tx) &= 0 \\&
\implies \sum_i w_i\exp(2\gamma x_i^Tx) = 0\end{aligned}
$$
with $w_i := v_i e^{-\gamma\|x_i\|^2}$. This is a bijection so if I show $w = \0$ then I’ll know $v = \0$ too.

This holds for any $x \in \R$ so I’m going to choose a particular $z \in \R$ and then this equation also holds for $(j-1)z / (2\gamma)$ with $j=1,2,\dots,n$. Pretending I already have my $z$, I can let
$$
\exp[2\gamma(j-1)z^Tx_i/(2\gamma)] = \exp(z^Tx_i)^{j-1}
$$
so if I let $\alpha_i = \exp(z^Tx_i)$ I now have
$$
\sum_i w_i\alpha_i^{j-1} = 0, \hspace{5mm} j=1,2,\dots,n
$$
Over all $j$ this turns out to be a linear combination of columns of a Vandermonde matrix (or the transpose of one, depending on how it’s defined). Vandermonde matrices are full rank if and only if the $\alpha_i$ are unique (I proved this here), so I just need to show that I’m able to choose a $z$ such that that is the case and I’m done.

I need to find a $z \in \R^p$ such that $z^Tx_1, \dots, z^Tx_n$ are all unique. I’ll approach this problem by finding a $z\in\R^p$ such that for $i<j$ I have $z^Tx_i \neq z^Tx_j$. To that end let $S_{ij} = \{z \in \R^p : z^T(x_i – x_j) \neq 0\}$ and note that $S_{ij}^c$ is just the hyperplane with $x_i-x_j$ as its normal vector. Because the $x_i$ are distinct, $x_i-x_j$ will never be $\0$ so all of the $S_{ij}^c$ are $p-1$ dimension subspaces. This means I want to make sure I don’t pick $z$ from
$$
\left(\bigcap_{i < j} S_{ij}\right)^c = \bigcup_{i<j}S_{ij}^c $$ which is a finite union of hyperplanes and therefore misses most of $\R^p$, so I can produce a valid $z$ with probability $1$ just by sampling from $\mathcal N_p(\0, I)$.

$\square$

Result 3: Let $f : \R^p \to\R$ be continuous, integrable w.r.t. the Lebesgue measure, and non-negative. Then $\int f \,\text d\lambda = 0\implies f = 0$ everywhere.

Pf: Suppose $f$ is not zero everywhere. Then there is some $x_0 \in \R^p$ such that $f(x_0) > 0$. $f$ is continuous so taking $\e = f(x_0) / 2$ then there is some $\delta > 0$ such that
$$
\|x-x_0\| < \delta \implies |f(x_0) – f(x)| < f(x_0) / 2.
$$
$$
-f(x_0) / 2 < f(x_0) – f(x) < f(x_0) / 2 \\ \implies \frac 32 f(x_0) > f(x) > \frac 12 f(x_0)
$$
which means $f$ is positive on $N_\delta(x_0) := \{x \in \R^p : \|x – x_0\| < \delta\}$. $$ \int_{\R^p} f\,\text d\lambda = \int_{N_\delta(x_0)} f\,\text d\lambda + \int_{N_\delta(x_0)^c} f\,\text d\lambda \\ \geq \int_{N_\delta(x_0)} f\,\text d\lambda. $$ $N_\delta(x_0)$ is an open sphere and therefore has positive measure; $f$ is also strictly positive on $N_\delta(x_0)$ therefore $\int_{N_\delta(x_0)} f\,\text d\lambda > 0$. This contradicts $\int f \,\text d\lambda = 0$ therefore it must be that there is no such $x_0$ and $f = 0$ everywhere.

$\square$

I can now finish my main result: I already showed that for $v\in\R^n$
$$
v^TK v = \int_{\R^p} \left(\sum_{i=1}^n v_i h_i\right)^2\,\text d\lambda.
$$
Suppose $v^T K v = 0$. $\left(\sum_{i=1}^n v_i h_i\right)^2$ is a continuous, integrable, and non-negative function so Result 3 tells me that $\left(\sum_{i=1}^n v_i h_i\right)^2 = 0$ everywhere. But by Result 2 the $\{h_i\}$ are linearly independent so the only way to sum them into the zero function is if $v = \0$. This means that $K$ is strictly positive definite.

Application: interpolation

Suppose I have collected pairs of data $(x_1, y_1), \dots, (x_n, y_n)$ with $x_i\in\R^p$ and $y_i\in\R$. I want to come up with a function $f$ such that
$$
f(x_i) = y_i
$$
for $i=1,\dots,n$, i.e. I want to obtain an interpolator. I’ve looked at polynomial interpolation in a previous post but here I’ll show how to use my previous results to come up with a different interpolator. I’ll be supposing that all of the $x_i$ are distinct so that such a function can exist.

I will consider functions of the form
$$
f = \sum_{i=1}^n \beta_i h_i
$$
i.e. $f$ is a linear combination of the $h_i$. This means I’m using the $h_i$ as basis functions and now picking an interpolator reduces to the finite dimension problem of picking a $\hat\beta\in\R^n$. $\gamma$ will be treated as known and fixed.

I need to find $\hat\beta$ by solving
$$
y_1 = \sum_{i=1}^n \beta_i h_i(x_1) \\
\vdots \\
y_n = \sum_{i=1}^n \beta_i h_i(x_n).
$$
I can write this as a linear system by creating a matrix with element $(i,j)$ corresponding to $h_i(x_j)$. But this is exactly $K_{ij}$ so I actually have the system
$$
\mathbf y = K \beta \\
\implies \hat\beta = K^{-1}\mathbf y
$$
where I know $K$ is invertible because I showed it’s PD in the first section. That’s all I need to fit these. Below I show an example on simulated data and the effect of $\gamma$ is clear: when $\gamma$ is small, the function fluctuates much more because farther points still matter, where as $\gamma$ increases then increasingly the function is locally determined.

 

Leave a Reply

Your email address will not be published. Required fields are marked *