Comparing a mixed model with a ridge regression

$\newcommand{\var}{\text{Var}}$$\newcommand{\cov}{\text{Cov}}$$\newcommand{\E}{\text{E}}$$\newcommand{\one}{\mathbf 1}$$\newcommand{\0}{\mathbf 0}$$\newcommand{\a}{\alpha}$$\newcommand{\l}{\lambda}$$\newcommand{\s}{\sigma}$$\newcommand{\e}{\varepsilon}$$\newcommand{\t}{\theta}$In this post I’m going to consider a linear model of the form $y = \mu\one + Z\a + \e$ where $\e \sim \mathcal N(0, \s^2 I_n)$ and $Z \in \mathbb R^{n \times m}$. My goal is to compare the results of estimating $\a$ and $\mu$ by treating this as a mixed model with $\a$ as a random effect versus fitting this model as a ridge regression via penalized least squares. I’m using $Z$ and $\a$ rather than $X$ and $\beta$ to be more consistent with mixed model notation.

Throughout this I will be considering a parameter $\l > 0$ that affects the likely magnitude of $\a$, in addition to the parameter $\s^2$ for the error variance, so for convenience I’ll collect these into $\t := (\l, \s^2)$. I also will not explicitly indicate that I’m conditioning on $Z$ even though I am doing that throughout. I won’t assuming that $Z$ and $y$ have been scaled in any way so I’m adding a mean parameter $\mu$; in practice for ridge regression I’d definitely want to make sure that the columns are all on the same scale but I’m not worrying about that here. I’ll use tildes like $\tilde \t$ for the mixed model point estimates and hats like $\hat \t$ for ridge regression.

I’m going to start with the mixed model.

Mixed model

The standard way to get point estimates in a mixed model is to integrate out the random effects and then maximize the resulting integrated likelihood for $(\tilde \mu, \tilde \t)$. $\tilde \a$ then comes from considering the posterior mean of $\a \mid y$. I won’t be considering REML here. I go through this in the general case here but I’m effectively just doing random intercepts here so I don’t need that generality.

I’m going to model my random effects as $\a \sim \mathcal N(\0, \s^2\l^{-1}I_m)$ which is like what I’d do for a 1-way random intercepts model, although I’m not assuming that $Z$ corresponds to block indicators. This also makes it clear that an increase in $\l$ will tend to flatten out $\a$ towards $\0$. I’ll use $\pi(\a\mid \t)$ to denote the density of $\a$ since this is like a prior.

My (unintegrated) likelihood is
$$
f(y\mid \a, \mu, \t) = \mathcal N(y\mid \mu\one + Z\a, \s^2 I)
$$
and I want to integrate this over $\a$ so I’m considering
$$
\E_\a\left[f(y\mid \a, \mu, \t)\right] = \int_{\mathbb R^m} f(y\mid \a, \mu, \t)\pi(\a\mid \t)\,\text d\a.
$$
I could go through this and complete the square, but this is the convolution of two Gaussians and I know that’s also Gaussian, so I can just get the mean and variance of $y$ unconditioned on $\a$. This turns out to be $\E(y\mid \mu, \t) = \mu\one$ and $\var(y\mid \mu, \t) = \s^2\l^{-1}(ZZ^T + \l I)$ so all together
$$
y\mid \mu, \t \sim \mathcal N\left(\mu\one, \s^2\l^{-1}\left(ZZ^T + \l I\right)\right).
$$
Up to a constant I then have that the integrated log likelihood is
$$
\begin{aligned}\ell(\mu, \t\mid y) &= -\frac{\l}{2\s^2}(y-\mu\one)^T(ZZ^T + \l I)^{-1}(y-\mu\one) -\\ & \frac 12 \log\det (ZZ^T + \l I) + \frac n2 \log \l – \frac n2 \log \s^2\end{aligned}
$$
and I’ll get $(\tilde \mu, \tilde \t)$ by maximizing this.

I can make this easier by profiling $\mu$ out since it turns out that there’s a closed form for $\tilde \mu(\t)$ as a function of $\t$.

$$
\begin{aligned}\frac{\partial \ell}{\partial \mu} &= \frac{\l}{\s^2}\left(\one^T(ZZ^T + \l I)^{-1}y – \mu\one^T(ZZ^T + \l I)^{-1}\one\right) \stackrel{\text{set}}= 0 \\&
\implies \tilde \mu(\theta) = \frac{\one^T(ZZ^T + \l I)^{-1}y}{\one^T(ZZ^T + \l I)^{-1}\one}.\end{aligned}
$$
Just to check that this is indeed a maximum and not a minimum, I have
$$
\frac{\partial^2 \ell}{\partial \mu^2} = -\frac{\l}{\s^2}\one^T(ZZ^T + \l I)^{-1}\one
$$
which is a negative constant times a quadratic form with a strictly PD matrix, so this is necessarily negative. That means $\tilde\mu(\t)$ is indeed the global max for that particular $\t$.

For the interpretation of $\tilde \mu$, I can note that this is exactly in the form of a generalized least squares (GLS) simple linear regression of $y$ on $\one$ with weights of $ZZ^T + \l I$. That makes sense since effectively I got here by considering the model $y = \mu\one + \e’$ where $\e’\sim \mathcal N(\0, ZZ^T + \l I)$ (the covariance scaling factor of $\s^2\l^{-1}$ doesn’t matter for this). I’ve discussed this on cross validated here but briefly I can think of $\tilde \mu$ as the coefficient that I’d get from regressing on $\one$ the part of $y$ not explained by a ridge regression on $Z$.

This will be handy later so I’ll do it now: via the Woodbury matrix identity I can rewrite $\tilde \mu$ as
$$
\begin{aligned}(\l I + ZZ^T)^{-1} &= \l^{-1}I – \l^{-2}Z(I + \l^{-1}Z^TZ)^{-1}Z^T \\&
= \l^{-1}\left(I – Z(Z^TZ + \l I)^{-1}Z^T\right) = \l^{-1}(I – S_\l)\end{aligned}
$$
where $S_\l = Z(Z^TZ + \l I)^{-1}Z^T$ is the hat matrix for a ridge regression on $Z$. This means I could instead write $\tilde \mu$ as
$$
\tilde\mu(\t) = \frac{\one^T(I – S_\l)y}{\one^T(I-S_\l)\one}
$$
although I think this makes the GLS connection less interpretable. But it will be useful later.

Ultimately my goal is to maximize $\ell$ over $(\mu, \t)$ jointly. For any given $\t$ I have found the optimal $\mu$, namely $\tilde \mu(\t)$, so I can plug that in to get the profiled likelihood which is in terms of $\t$ only.
$$
\begin{aligned}\ell_p(\t\mid y) &= \ell(\tilde \mu(\t), \t \mid y) \\&
= -\frac{\l}{2\s^2}(y-\tilde \mu\one)^T(ZZ^T + \l I)^{-1}(y-\tilde \mu\one) – \frac 12 \log\det (ZZ^T + \l I) + \frac n2 \log \l – \frac n2 \log \s^2.\end{aligned}
$$
Letting $A = (ZZ^T + \l I)^{-1}$ just for now, I can plug in $\tilde \mu$ and simplify the quadratic form as follows:
$$
\begin{aligned}(y – \tilde \mu \one)^T A (y – \tilde \mu \one) &= y^TAy – 2 \tilde \mu \one^TA y + \tilde \mu^2\one^T A \one \\&
= y^T A y – 2\one^T A y \frac{\one^TAy}{\one^TA\one} + \left(\frac{\one^TAy}{\one^TA\one}\right)^2\one^T A \one \\&
= y^T A y – \frac{(\one^T A y)^2}{\one^T A \one} = y^T A y – \tilde \mu \one^T A y \\&
= (y – \tilde \mu \one)^T A y\end{aligned}
$$
so all together the profiled log likelihood is
$$
\begin{aligned}\ell_p(\t\mid y) &= -\frac{\l}{2\s^2}(y-\tilde \mu\one)^T(ZZ^T + \l I)^{-1}y -\\& \frac 12 \log\det (ZZ^T + \l I) + \frac n2 \log \l – \frac n2 \log \s^2.\end{aligned}
$$

Once I’ve maximized this so that now I have $(\tilde \mu, \tilde \theta)$ I am ready to get $\tilde \a$. I’ve modeled $\a$ as random so a very natural way to estimate it is by considering the posterior mean $\E(\a \mid y, \mu, \t)$. To that end, I can actually work out the entire posterior pretty easily by noting that $\a$ is Gaussian and $y\mid \a$ is too therefore ${\a \choose y}$ is jointly Gaussian. Thus
$$
\begin{aligned}&{\alpha \choose y} \mid \mu, \t \sim \mathcal N\left({\0 \choose \mu\one}, \left(\begin{array}{c|c}\var(\a) & \cov(\a, y) \\ \cov(y, \a) & \var(y) \end{array}\right)\right) \\&
\stackrel{\text{d}}= \mathcal N\left({\0 \choose \mu\one}, \sigma^2\lambda^{-1}\left(\begin{array}{c|c}I_p & Z^T \\\hline Z & ZZ^T + \lambda I_n \end{array}\right)\right).\end{aligned}
$$
so using the standard result about conditional distributions of a multivariate Gaussian, I know
$$
\a \mid y, \mu, \t \sim \mathcal N(Z^T(ZZ^T + \l I)^{-1}(y – \mu\one), \s^2 \l^{-1} \left(I – Z^T(ZZ^T + \l I)^{-1}Z\right)).
$$
This means I can estimate $\a$ simply by
$$
\tilde \a := \E(\a\mid y, \tilde \mu, \tilde \t) = Z^T(ZZ^T + \tilde \l I)^{-1}(y – \tilde\mu\one).
$$

Ridge regression

Now I’m going to see what happens if instead I ignore all this stuff with likelihoods and priors and just minimize a penalized least squares criterion. I’m only going to penalize the norm of $\a$ since the mean $\mu$ should be allowed to be anything. Thus I’ll be minimizing
$$
\|y – \mu\one – Z\a\|^2 + \l \|\a\|^2 = \|y – X\gamma\|^2 + \l\gamma^T J \gamma
$$
over $\gamma$, where I’ve collected everything into $\gamma = (\mu ,\a)$ and $X = (\one \mid Z)$ and then this becomes a generalized ridge regression with penality matrix $J := \text{diag}(0,1,\dots,1)$. Taking derivatives leads to
$$
\hat \gamma(\l) = (X^TX + \l J)^{-1}X^Ty.
$$
I’m going to begin by finding what exactly $\hat\mu$ is (a lot of this is also done in the cross validated answer of mine linked to above, but knowing that the first column is $\one$ makes things a bit simpler here).

I can write $X^TX + \l J$ out as a $2\times 2$ block matrix
$$
X^TX + \l J = (\one \mid Z)^T(\one \mid Z) + \l J \\
= \left(\begin{array}{c|c}n & \one^TZ \\\hline Z^T\one & Z^T Z + \l I\end{array}\right)
$$
which means
$$
\begin{aligned}(X^TX + \l J)^{-1} &= \left(\begin{array}{c|c}n & \one^TZ \\\hline Z^T\one & Z^T Z + \l I\end{array}\right)^{-1} \\&
= \left(\begin{array}{c|c}(n – \one^TS_\l\one)^{-1} & -(n – \one^TS_\l\one)^{-1}\one^TZ(Z^TZ + \l I)^{-1}\\ \hline C & D\end{array}\right)\end{aligned}
$$
(see here, for example). I’m only doing this for $\hat\mu$ so I don’t care about $C$ and $D$.

I next have $X^Ty = (\one \mid Z)^Ty = {\one^Ty \choose Z^Ty}$ so all together
$$
\begin{aligned}\hat\mu(\l) &= (n – \one^TS_\l\one)^{-1}\one^Ty – (n – \one^TS_\l\one)^{-1}\one^TS_\l y \\&
= \frac{\one^Ty – \one^TS_\l y}{n – \one^TS_\l \one}\\&
= \frac{\one^T(I – S_\l)y}{\one^T(I – S_\l)\one}\end{aligned}
$$
which is exactly $\tilde \mu(\t)$!

For $\a$, if I let my objective function be
$$
g(\gamma) = \|y – X\gamma\|^2 + \l \gamma^T J \gamma
$$
I can find the Hessian to be
$$
\frac{\partial^2 g}{\partial \gamma^2} = 2(X^TX + \l J).
$$
Suppose $v^T (X^TX + \l J)v = 0$ for $v \neq \0$. This means $v^T J v = 0$ so it must be that the first entry of $v$ is the only non-zero one, i.e. $v \propto e_1$ (the first standard basis vector). WLOG I can just take $v = e_1$. This then means that $v^T X^TXv = \|Xe_1\|^2 = \|\one\|^2 >0$ so $X^TX + \l J$ is strictly PD and this optimization is therefore convex (this also proves that $X^TX + \l J$ is always invertible in my situation which is why I haven’t made any comments about the existence of the inverse so far).

Convexity is great here because this lets me think about minimizing $\a$ in terms of $\mu$, and then I can plug in the global $\hat\mu$ that I already found. Doing so leads me to
$$
\hat \a(\mu, \l) = \underset{\a}{\text{argmin }} \|(y – \mu\one) – Z\a\|^2 + \l \a^T\a
$$
which is just a standard ridge regression of $y – \mu\one$ onto $Z$ so
$$
\hat \a(\mu, \l) = (Z^TZ + \l I)^{-1}Z^T(y – \mu\one).
$$
This looks different from $\tilde \a(\mu, \t)$ but they’re actually equal: I found
$$
\tilde \a(\mu, \t) = Z^T(ZZ^T + \l I)^{-1}(y – \mu\one)
$$
so I’ll consider the difference of the linear operators being applied to $y – \mu\one$. This is
$$
\begin{aligned}&(Z^TZ + \l I)^{-1}Z^T – Z^T(ZZ^T + \l I)^{-1} \\&
= (Z^TZ + \l I)^{-1}\left[Z^T(ZZ^T + \l I) – (Z^TZ + \l I)Z^T\right](ZZ^T + \l I)^{-1} \\&
= (Z^TZ + \l I)^{-1}\left[Z^TZZ^T + \l Z^T – Z^TZZ^T – \l Z^T\right](ZZ^T + \l I)^{-1} = \0\end{aligned}
$$
so if I’m using the same $\mu$ and $\t$ for $\tilde \a$ and $\hat \a$ then they agree!

Conclusion

I’ve found that if the same value of $\l$ is used then $\tilde \mu = \hat \mu$ and in turn $\tilde \a = \hat \a$; in other words, aside from hyperparameter estimation, a ridge regression exactly agrees with a mixed model when the random effect covariance structure is $\var(\a) = \s^2\l I$. Ultimately this isn’t actually that surprising since the Bayesian take on ridge regression is well known, namely that a Gaussian prior on the coefficients vector leads to the maximum a posteriori estimate being the ridge estimate. And since I’m working in the all-Gaussian case the posterior is Gaussian too so the MAP and posterior mean coincide, and that’s exactly $\hat \a$ and $\tilde \a$ respectively.

In practice for a ridge regression $\hat \l$ would likely be estimated via cross-validation rather than maximum marginal likelihood but typically the two aren’t that different.

1 thought on “Comparing a mixed model with a ridge regression”

  1. Thank you for the note. I assume that in conclusion var(alpha) should be \var(\a) = \s^2\l^-1 I, i.e. lambda is in the power of (-1). Am I right?

Leave a Reply

Your email address will not be published. Required fields are marked *