Interpolating Polynomials

In this post I want to take a look at interpolation with a single polynomial. I’ll prove the existence and uniqueness of a minimum degree interpolating polynomial and then I’ll discuss a better basis as well as an example of where a bad choice of input values leads to poor performance. In general I will consider a collection of real-valued data points $D_n := \{(x_1,y_1), \dots, (x_n,y_n)\}$ with $x_1 < \dots < x_n$; the $x_i$ being distinct makes it possible for there to be a function doing the interpolation in the first place.

Interpolating polynomials

Result 1: there is a unique polynomial of degree at most $n-1$ that interpolates $D_{n}$.

Pf: first I’ll find an interpolating polynomial of minimum degree and then I’ll show it is unique. Even though I haven’t shown it yet, I’ll begin by guessing that a polynomial of at most degree $n-1$ can work, and then will show that this does indeed work.

So let $p_n(x) = \sum_{k=0}^{n-1} a_k x^k$ be my polynomial. I need to find coefficients $(a_0,\dots,a_{n-1})$ so that
$$
\begin{array}{c} p_n(x_1) = \sum_{k=0}^{n-1} a_kx_1^k = y_1 \\ \vdots \\ p_n(x_n) = \sum_{k=0}^{n-1} a_kx_n^k = y_n\end{array}
$$
At this point $D_n$ is fixed so I actually have a linear system in $a := (a_0,\dots,a_{n-1})$, which I can write as
$$
\text{solve}\hspace{.2cm} Va = y\hspace{2mm}\text{for } a \in \mathbb R^n
$$
where
$$
V = \begin{bmatrix}1 & x_1 & x_1^2 & \cdots & x_1^{n-1} \\
1 & x_2 & x_2^2 & \cdots & x_2^{n-1} \\
\vdots & \vdots & \vdots & \cdots & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^{n-1}\end{bmatrix}
$$
$V$ is an example of a Vandermonde matrix. If it is the case that a polynomial of degree $j < n-1$ can interpolate $D_n$, this would appear as $a_{j+1} = \dots = a_{n-1} = 0$ in the solution.

It turns out that $\det V = \prod_{i < j} (x_j – x_i)$ (I’ll prove this in Lemma 1 after this proof) so by my assumption that $x_1 < \dots < x_n$, I now know $\det V \neq 0$. This means $a = V^{-1}y$ gives a vector of coefficients such that $p_n$ exactly interpolates $D_n$ (ignoring numerical issues).

Now for uniqueness, suppose $b \in \mathbb R^n$ is another set of coefficients so that $Vb = y$. This means $Va=Vb$, and since $V$ is invertible I have $a = b$, so there is actually just one set of coefficients.

$\square$

So I just showed that any set of points $D_n$ with $x_1 < \dots < x_n$ can be interpolated uniquely with an at-most degree $n-1$ polynomial. The possibility of being interpolated in the first place was a result of the $x_i$ being unique, which appeared at $\det V$ being invertible and therefore there existing any solution at all (in general). The uniqueness was a result of requiring the degree to be at most $n-1$. If, for instance, I allowed the degree to go up to $2n$, say, then $V \in \mathbb R^{n\times 2n}$ would have rank $n$ (and therefore a null space of dimension $n$) so $Va=Vb$ would not imply $a=b$.

Now here’s the proof of the determinant of $V$:

Lemma 1: $\det V = \prod_{i < j} (x_j – x_i)$

Pf: by induction on $n$. Let $V_n$ denote a Vandermonde matrix of size $n$ over a vector $x = (x_1,\dots,x_n)$.

Base case:

$$
\det V_2 = \left\vert \begin{array}{cc}1 & x_1 \\ 1 & x_2 \end{array}\right\vert = x_2 – x_1
$$
so $\det V_2 = x_2 – x_1$ as desired (and I’m leaving $V_1$ undefined as it involves the empty product).

Inductive case: suppose true for $n=2,\dots,k-1$ and consider $V_{k}$.
$$
\det V_{k} = \left\vert \begin{array}{ccccc}1 & x_1 & x_1^2 & \cdots & x_1^{k-1} \\
1 & x_2 & x_2^2 & \cdots & x_2^{k-1} \\
\vdots & \vdots & \vdots & \cdots & \vdots \\ 1 & x_k & x_k^2 & \cdots & x_k^{k-1}\end{array} \right\vert
$$
Subtract row 1 from each other row (which doesn’t change the determinant) so
$$
\det V_{k} = \left\vert \begin{array}{cccccc}1 & x_1 & x_1^2 & \cdots & x_1^{k-2} & x_1^{k-1} \\
0 & x_2 – x_1 & x_2^2 – x_1^2 & \cdots & x_2^{k-2} – x_1^{k-2} & x_2^{k-1} – x_1^{k-1}\\
\vdots & \vdots & \vdots & \cdots & \vdots& \vdots \\
0 & x_k – x_1 & x_k^2 – x_1^2 & \cdots & x_k^{k-2} – x_1^{k-2} & x_k^{k-1} – x_1^{k-1}\end{array} \right\vert
$$
Next I’ll subtract $x_1$ times column $k-1$ from column $k$ (which again doesn’t change $\det V_k$) so
$$
\det V_{k} = \left\vert \begin{array}{cccccc}1 & x_1 & x_1^2 & \cdots & x_1^{k-2} & x_1^{k-1} – x_1 x_1^{k-2} \\
0 & x_2 – x_1 & x_2^2 – x_1^2 & \cdots & x_2^{k-2} – x_1^{k-2} & x_2^{k-1} – x_1^{k-1} – x_1 (x_2^{k-2} – x_1^{k-2})\\
\vdots & \vdots & \vdots & \cdots & \vdots& \vdots \\
0 & x_k – x_1 & x_k^2 – x_1^2 & \cdots & x_k^{k-2} – x_1^{k-2} & x_k^{k-1} – x_1^{k-1} – x_1(x_k^{k-2} – x_1^{k-2})\end{array} \right\vert
$$

$$
\det V_{k} = \left\vert \begin{array}{cccccc}1 & x_1 & x_1^2 & \cdots & x_1^{k-2} & 0 \\
0 & x_2 – x_1 & x_2^2 – x_1^2 & \cdots & x_2^{k-2} – x_1^{k-2} & (x_2 – x_1)x_2^{k-2}\\
\vdots & \vdots & \vdots & \cdots & \vdots& \vdots \\
0 & x_k – x_1 & x_k^2 – x_1^2 & \cdots & x_k^{k-2} – x_1^{k-2} & (x_k – x_1) x_k^{k-2} \end{array} \right\vert
$$
I can continue with subtracting $x_1$ times column $k-2$ from column $k-1$, and so on, until I subtract $x_1$ times column $2$ from column $3$. This leaves me with
$$
\det V_{k} = \left\vert \begin{array}{cccccc}1 & 0 & 0& \cdots & 0 & 0 \\
0 & x_2 – x_1 & (x_2-x_1)x_2 & \cdots & (x_2 – x_1)x_2^{k-3} & (x_2 – x_1)x_2^{k-2}\\
\vdots & \vdots & \vdots & \cdots & \vdots& \vdots \\
0 & x_k – x_1 & (x_{k} – x_1) x_k & \cdots & (x_{k} – x_1) x_k^{k-3} & (x_k – x_1) x_k^{k-2} \end{array} \right\vert.
$$
Now I can factor $x_j – x_1$ from each row to get
$$
\det V_k =\left\vert \begin{array}{cccccc}1 & 0 & 0& \cdots & 0 & 0 \\
0 & 1 & x_2 & \cdots & x_2^{k-3} & x_2^{k-2}\\
\vdots & \vdots & \vdots & \cdots & \vdots& \vdots \\
0 & 1 & x_k & \cdots & x_k^{k-3} & x_k^{k-2} \end{array} \right\vert \prod_{j=2}^k (x_j – x_1)
$$
By cofactor expansion along the first row, the determinant on the right hand side is that of a $k-1$ Vandermonde matrix, so by the inductive hypothesis
$$
\det V_k =\prod_{j=2}^k (x_j – x_1) \prod_{2 \leq i < j \leq k} (x_j – x_i) = \prod_{1\leq i < j \leq k} (x_j -x_i)
$$
and the result is proved.

$\square$

So there I have it: I can build an interpolating polynomial via $p(x) = \sum_{k=0}^{n-1} a_k x^{k-1}$ with $a = V^{-1}y$.

Here’s an example:

make_V <- function(x, n) outer(x, 0:(n-1), `^`)
set.seed(123)
n <- 10
xvec <- sort(runif(n))
y <- rnorm(n)
plot(y ~ xvec, main = bquote("Degree" ~ .(n - 1) ~ "Polynomial Interpolation"),
     ylim = 1.5 * range(y), pch = 19, xlab = "x")
V <- make_V(xvec, n)
a_hat <- solve(V, y)
curve(make_V(x, n) %*% a_hat, from = 0, to = 1,
      n = 500, col = 2, add = TRUE, lwd = 2)

So this polynomial did do the trick, but in a statistical sense this would be a terrible model. In between $x_1=0$ and $x_{10}=1$ the polynomial goes between around $-3200$ and $750$.

Thinking of this as a linear regression, I know things get unstable as the predictors get increasingly correlated, and that’s exactly what’s happening here as many of the polynomial terms become increasingly correlated. So my next goal will be to see if I can find a better-conditioned system of equations to solve.

Lagrange polynomials

I’ve been representing my polynomial in terms of the basis $\{1, x, \dots, x^{n-1}\}$ but this can lead to an extremely ill-conditioned system to solve. If I pick a basis with better properties then perhaps I can at least avoid this ill-conditioning. This can’t fix the fundamental issue of the intense fluctuations necessary for the polynomial to do its job since I’m still getting the same polynomial, but at least finding its coordinates can be easier. More specifically, in my interpolation in the previous section, I solved $y = Va$ which means that $a$ is the coordinate of $y$ with respect to the basis $\{1,x,\dots,x^{n-1}\}$. But this basis matrix can quickly become computationally singular: supposing like in my example that $0 \leq x_1 < \dots < x_n \leq 1$. Then every term in $\det V$ is in $(0, 1)$ so $\det V$ will become tiny very quickly. In the example above $\det V$ is on the order of $10^{-28}$, and that’s for only 10 points. So I want a better conditioned basis matrix. My interpolator will still be a polynomial of degree at most $n-1$, but now I’m going to represent $y$ in terms of an orthogonal basis. This is what Lagrange polynomials do.

Consider
$$
L(x) = \sum_{k=0}^{n-1} a_k \ell_k(x)
$$
where
$$
\ell_k(x) = \prod_{\substack{0 \leq m \leq n-1 \\ m \neq k}} \frac{x – x_m}{x_k – x_m}
$$
so now rather than each basis function being a single power of $x$, I’m expanding $x$ in terms of $n$ degree $n-1$ polynomials.

Here’s why these are so great: Consider evaluating $\ell_k$ on a knot $x_j$, so I have
$$
\ell_k(x_j) = \prod_{\substack{0 \leq m \leq n-1 \\ m \neq k}} \frac{x_j – x_m}{x_k – x_m}.
$$
If $k \neq j$ then $x_j – x_j$ will appear as a numerator, so $\ell_k(x_j) = 0$. Now instead suppose $j=k$ so really I have $\ell_k(x_k)$. Then every term is $1$ so $\ell_k(x_k) = 1$. Thus $\mathcal L \in \mathbb R^{n\times n}$, the basis expansion matrix of the $x_i$, is in fact $I_n$! This means $\hat a =y$. Also the assumption that each $x_i$ is distinct makes it guaranteed that $\ell_k$ will never include a division by $0$.

Now I’ll repeat the above example except with the Lagrange basis. I’ll include the results of the previous one for comparison. Before I do that, I’ll plot what these basis functions actually look like.

lagr_basis_eval_maker <- function(xnodes) {
  function(x, k) prod((x - xnodes[-k]) / (xnodes[k] - xnodes[-k]))
}
# returns closure that gives a vectorized interpolator
make_interpolator <- function(xvec, y) {
  lagr_basis <- lagr_basis_eval_maker(xvec)
  Vectorize(function(x0) sum(sapply(1:length(xvec), function(j) lagr_basis(x0, j)) * y ))
}
set.seed(123)
n <- 7
xnodes <- sort(runif(n, -1, 1))
plot(0, 0, col="white", xlim=c(-1,1), ylim=c(-4,4),
     ylab = "l(x)", xlab = "x", main = "Lagrange Basis Functions")
lagr_eval <- lagr_basis_eval_maker(xnodes)
invisible(sapply(1:n, function(k) curve(sapply(x, lagr_eval, k=k), add = TRUE, col=k, lwd=2)))
abline(v = xnodes, lty=2)

The dashed vertical lines are the data points, and I can see how at each knot there is exactly one basis function that has a value of $1$ and all the rest are at $0$, as expected. They look a lot like sines and cosines and I think that’s sensible enough as this type of alternating waviness is a good way to get orthogonal functions (just like with Fourier series).

So now I’ll use this to fit the interpolator to the same random data as in the Vandermonde example:

set.seed(123)
n <- 10
xvec <- sort(runif(n))
y <- rnorm(n)
plot(y ~ xvec, main = paste("Degree", n - 1, "Lagrange Basis"), ylim = 1.5 * range(y), pch = 19)
curve(make_interpolator(xvec, y)(x), from = 0, to = 1, n = 500, col = 4, add = TRUE, lwd = 2)
curve(make_V(x, n) %*% a_hat, from = 0, to = 1, n = 500, col = 2, add = TRUE, lty = 2, lwd=2)

I can see that the Lagrange basis polynomial interpolator is exactly the same as the original Vandermonde one, as they should be. I didn’t change the interpolator itself, I just found a better basis for it. Thus I fixed the numerical issues with inverting the poorly-conditioned Vandermonde matrix, but I didn’t do anything about the fact that high degree polynomials are prone to extreme oscillations and are just not the best interpolators, no matter the basis used.

The Runge phenomenon

On the subject of extreme oscillations, I’m going to finish this post with a standard example: the Runge phenomenon. Consider interpolating

$$
f(x) = \frac{1}{1 + 25x^2}
$$
evaluated over an evenly-spaced grid with $-1 \leq x \leq 1$. Below is a gif showing what the interpolators look like up for $n=21$.

I’ll explore this a bit more in my next post. But in the meantime, one way that this can be mitigated is by using a more intelligent grid to interpolate over. One such grid is the Chebyshev nodes which places more points in the problem areas. The same gif except with Chebyshev nodes is below.

I’ve included a rug plot to show where the evaluation points are. They are increasingly dense near the ends of the intervals, which makes sense as those are the problem areas. For $n \geq 15$ or so this looks pretty good. It definitely fixed the issue.

Nevertheless, this still makes me uncomfortable with interpolating polynomials because it’s possible for them to perform so horribly bad. In a future post I’ll introduce splines, which are piecewise polynomials. They fix this issue because each piece can bend as it needs to without causing major changes elsewhere in the function. They also avoid the issue of knot selection which definitely makes things easier. I can also interpolate using functions that aren’t polynomial at all, either piecewise or everywhere. I give one example in my post here.