Updating a linear regression with new data

$\newcommand{\R}{\mathbb R}$$\newcommand{\y}{\mathbf y}$$\newcommand{\thb}{\tilde{\hat\beta}}$$\newcommand{\L}{\Lambda}$$\newcommand{\tr}{\operatorname{tr}}$In this post I’ll look at updating a linear regression with new data. One place where this can be useful is online learning where I may have a stream of observations coming in and I may not even be able to store all of them, but as I’ll see I can do this in a way where the data I need to save only grows in the dimension of the problem, not the number of observations.

So suppose I have a full column rank data matrix $X \in \R^{n\times p}$ (so $n \geq p$), and a vector of target values $\y \in \R^n$. Regressing $\y$ on $X$ gives me $\hat\beta = (X^TX)^{-1}X^T\y$. I now will explore what happens if I acquire a new observation pair $(x, y)$ and want to update $\hat\beta$. I’ll treat $x \in \R^p$ as a column vector.

I’ll append $x$ to $X$ and $y$ to $\y$ to get
$$
\tilde X := {X \choose x^T},\;\; \tilde \y = {\y \choose y}.
$$

Claim: $\text{rank}(\tilde X) = p$ still.

Pf: The row rank of a matrix equals the column rank, so $\text{rank}(\tilde X)$ is the dimension of the span of the rows. The rows of $X$ are contained in the rows of $\tilde X$ and have the maximal span dimension of $p$, and adding another vector $x$ can’t decrease the span so $\text{rank}(\tilde X) = p$ too.

$\square$

This is good as it means that I don’t have to worry about $\thb$ existing uniquely, given that $\hat\beta$ does.

I know $\thb = (\tilde X^T\tilde X)^{-1}\tilde X^T\tilde \y$ but this requires completely recomputing everything, and I can do better than that. To that end, I can write $\thb$ in terms of $X$, $x$, $\y$, and $y$ as follows:
$$
\thb = (\tilde X^T\tilde X)^{-1}\tilde X^T\tilde \y \\
= (X^TX + xx^T)^{-1} (X^T\y + yx).
$$
$X^TX + xx^T$ is a rank 1 update to an invertible matrix so, letting $C = X^TX$, I have
$$
(C + xx^T)^{-1} = C^{-1} – \frac{C^{-1}xx^TC^{-1}}{1 + x^TC^{-1}x}
$$
by the Sherman-Morrison formula. $C^{-1}xx^TC^{-1}$ is a rank one matrix, so my rank 1 update to $C$ leads to a rank 1 update to $C^{-1}$ as well. Furthermore, $1 + x^T C^{-1}x > 0$ since $C^{-1}$ is positive definite (in my post on leverage scores I do a very similar computation except it’s removing an observation).

When the eigenvalues of $C$ are bounded away from $0$, so $C$ is invertible (which is my case), $C \mapsto C^{-1}$ is a smooth map which means I can look at what $x$ lead to large changes in just $C$ to get some intuition for when the change in $C^{-1}$ is large. Letting $f$ be my objective function, I have
$$
\begin{aligned}f(x) &:= \|C + xx^T\|^2_F = \tr((C + xx^T)^T(C + xx^T)) \\&
= \tr(C^2 + xx^TC + Cxx^T + xx^Txx^T) \\&
= \tr(C^2) + 2 x^TCx + (x^Tx)^2\end{aligned}
$$
where $\|\cdot\|_F^2$ is the squared Frobenius norm. I’ll look at maximizing this over unit vectors, so my Lagrangian is
$$
\begin{aligned}\mathcal L(x, \lambda) &= \tr(C^2) + 2 x^TCx + (x^Tx)^2 + \lambda(1-x^Tx) \\&
\implies \frac{\partial \mathcal L}{\partial x} = 4 Cx + 2x^Txx- 2\lambda x.\end{aligned}
$$
$x$ is constrained to be a unit vector so $x^Tx = 1$ which means I have to solve
$$
4 Cx + 2x- 2\lambda x = 0 \\
\implies Cx = \frac{\lambda – 1}{2} x
$$
therefore $x$ is an eigenvector of $C$. I’ll let $\nu$ be $x$’s eigenvalue. Plugging this in, I have
$$
f(x) = \tr(C^2) + 2 \nu + 1
$$
which is maximized by $\nu$ being the largest eigenvalue of $C$.

I’ll now try working on $(C + xx^T)^{-1}$ and see if this confirms my analysis: I have
$$
\begin{aligned}g(x) &:= \|(C + xx^T)^{-1}\|_F^2 \\&
= \left\|C^{-1} – \frac{C^{-1}xx^TC^{-1}}{1 + x^TC^{-1}x}\right\|_F^2 \\&
= \tr(C^{-2}) – 2\frac{x^T C^{-3}x}{1 + x^TC^{-1}x} + \left(\frac{x^T C^{-2}x}{1 + x^TC^{-1}x}\right)^2.\end{aligned}
$$
This is complicated enough that I’m not going to take derivatives but instead I’ll just check my previous result, that at least over the eigenvectors of $C$ $g$ is maximized by the top one.

Let $(x, \lambda)$ be an eigenpair of $C$. Then
$$
\begin{aligned}g(x) &= \tr(C^{-2}) – 2\frac{\lambda^{-3}}{1 + \lambda^{-1}} + \left(\frac{\lambda^{-2}}{1 + \lambda^{-1}}\right)^2 \\&
= \tr(C^{-2}) – \frac{2\lambda + 1}{\lambda^2(\lambda+1)^2}.\end{aligned}
$$
I want this to be large, which means I want the term with the $\lambda$s in it to be as small as possible. This happens when $\lambda$ is as large as possible, so this confirms my analysis that I’ll see large perturbations in $C^{-1}$ when my new observation is close to being a top eigenvector of $C$.

Finishing this off,
$$
\begin{aligned}\thb &= \left(C^{-1} – \frac{C^{-1}xx^TC^{-1}}{1 + x^TC^{-1}x}\right)(X^T\y + yx) \\&
= \hat\beta- \frac{C^{-1}xx^T\hat\beta}{1 + x^TC^{-1}x} + yC^{-1}x – y \frac{C^{-1}xx^TC^{-1}x}{1 + x^TC^{-1}x} \\&
= \hat\beta+\frac{C^{-1}x}{1 + x^TC^{-1}x}\left(-x^T\hat\beta + y(1+x^TC^{-1}x)- yx^TC^{-1}x\right) \\&
= \hat\beta + \frac{(y – x^T\hat\beta)}{1+x^TC^{-1}x} C^{-1}x\end{aligned}
$$
so I can interpret this as taking a step from $\hat\beta$ in the direction given by $C^{-1}x$. It’ll be a small step when the new point $(x,y)$ is very congruent with my model, as indicated by a small residual $y – x^T\hat\beta$, or when $x^T C^{-1}x$ is large, which most happens when $x$ is a bottom eigenvector of $X^TX$. This is also where I can see that I can do this update without storing anything that grows with $n$.

As for the direction, if $x$ is an eigenvector of $C$ then that is the direction the step is in, since I’ll have $C^{-1}x = x / \lambda$. More generally, letting $C = Q\L Q^T$ be the eigendecomposition of $C$, $C^{-1}x$ will rotate $x$ into the eigenbasis, scale the coordinates by $\L^{-1}$, and rerotate it back into the standard basis. This means that the lower eigenvectors will be emphasized and the top eigenvectors de-emphasized. Thinking about the loss surface, this makes sense because the old $\hat\beta$ won’t be in the bottom of the paraboloid and I’m trying to take one step to fix this. It’d make sense that this step will likely need to move farther in the direction of smaller eigenvectors because the loss surface is less steep in those directions.

I can arrive at this in a different way: via Newton’s method for minimization. Let $\ell(b) = \frac 12\|\y – Xb\|^2$ be the loss for a particular parameter vector. This leads to $\nabla \ell = X^TXb – X^T\y$ and a Hessian of $H = X^TX$. Since $\ell$ is a paraboloid, Newton’s method finds the minimum in a single step since the quadratic approximation is exact in this case. In particular, if I’m currently at $\hat\beta$ and want to update to $\thb$ by using $\tilde X$ and $\tilde \y$ in $\nabla \ell$ and $H$, I’ll have
$$
\thb = \hat\beta – H^{-1}\nabla \ell = \hat\beta -(\tilde X^T\tilde X)^{-1}(\tilde X^T\tilde X\hat\beta – \tilde X^T\tilde \y).
$$
I could just distribute $(\tilde X^T\tilde X)^{-1}$ through and be done, but I want to see how this is equivalent to the other update formula. Expanding $\tilde X$ and $\tilde \y$, I have
$$
\thb = \hat\beta – (C + xx^T)^{-1}(C\hat\beta – X^T\y + xx^T\hat\beta – yx).
$$
$\hat\beta$ is the unique solution to $X^TXb = X^T\y$ so $C\hat\beta – X^T\y = \mathbf 0$ and this leaves me with
$$
\begin{aligned}\thb &= \hat\beta – (C + xx^T)^{-1}(x^T\hat\beta – y)x \\&
= \hat\beta – \left[C^{-1} – \frac{C^{-1}xx^TC^{-1}}{1 + x^TC^{-1}x}\right](x^T\hat\beta – y)x \\&
= \hat\beta + \frac{y – x^T\hat\beta}{1+x^TC^{-1}x} \left[C^{-1}(1 +x^TC^{-1}x) – C^{-1}xx^TC^{-1}\right]x \\&
= \hat\beta + \frac{y – x^T\hat\beta}{1+x^TC^{-1}x} \left[C^{-1}x + x^TC^{-1}xC^{-1}x – C^{-1}xx^TC^{-1}x\right] \\&
= \hat\beta + \frac{y – x^T\hat\beta}{1+x^TC^{-1}x} C^{-1}x\end{aligned}
$$
which is exactly what I had before. So this means that I can interpret this update rule as one iteration of Newton’s rule with the updated Hessian and loss gradient, and since Newton’s is a 2nd order approximation, it exactly gets me to the minimum when I’m using it on a paraboloid.

Leave a Reply

Your email address will not be published. Required fields are marked *