Mixed models from scratch: MLEs and BLUPs

$\newcommand{\e}{\varepsilon}$$\newcommand{\Var}{\operatorname{Var}}$$\newcommand{\x}{\mathbf x}$$\newcommand{\one}{\mathbf 1}$$\newcommand{\0}{\mathbf 0}$In this post I will derive the point estimates for mixed models. I’ll focus on interpretation so I won’t be using things like Cholesky decompositions and etc. Part two will discuss REML and an example with random slopes.

The model is
$$
y = X\beta + Z\gamma + \e
$$
with $X \in \mathbb R^{n\times p}$ and $Z\in\mathbb R^{n\times m}$ both being full rank and $\gamma \sim \mathcal N\left(0, \sigma^2 \Omega^{-1}\right)$ and $\e \sim \mathcal N(0, \sigma^2 I)$ with $\gamma \perp \e$. My parameterization of $\Var(\gamma)$ will make things cleaner later on.

The general procedure for maximum likelihood estimation in mixed models is as follows:

1. parameterize $\Omega^{-1}$ with a real-valued vector $\theta$
2. integrate out $\gamma$ to get the marginal likelihood; optionally integrate $\beta$ out too if REML is desired
3. obtain the profiled likelihood by estimating the non-marginalized parameters in terms of $\theta$
4. estimate $\theta$ by numerically optimizing the profiled likelihood
5. predict $\gamma$ via the BLUP

Marginal likelihood

By assumption $\gamma \perp \e$ and both are multivariate Gaussians so I know their sum is Gaussian too (this is a consequence of how convolutions of Gaussians are still Gaussian; see here for instance).

This means marginally
$$
y \sim \mathcal N(X\beta, \sigma^2\left(Z\Omega^{-1}Z^T + I\right)).
$$
Applying the Woodbury matrix identity to $\Var(y)$ I have
$$
(I + Z\Omega^{-1}Z^T)^{-1} = I – Z(Z^TZ + \Omega)^{-1}Z^T.
$$
which is exactly the matrix for getting the residuals from a generalized ridge regression of $y$ on $Z$ with penalty matrix $\Omega$. I’ll let $S_\theta = Z(Z^TZ + \Omega)^{-1}Z^T$ be the hat matrix for this ridge regression so
$$
y \sim \mathcal N(X\beta, \sigma^2 (I – S_\theta)^{-1}).
$$
This is a pretty tidy distribution because it shows that marginally $y$ still has a mean given by $X\beta$, the fixed effects, but the variance depends on how well $Z$ explains $y$ and the magnitude of $\Omega$.

I’ll digress to study $S_\theta$ a bit.

Result 1: All eigenvalues of $S_\theta$ are in $[0,1)$ and the dimension of the eigenspace of $0$ is $n-m$.

Pf: Let $Z = UDV^T$ be the SVD of $Z$. $Z$ is full rank so all singular values are positive. This means
$$
\begin{aligned}S_\theta &= Z(Z^TZ + \Omega)^{-1}Z^T \\&
= UDV^T(VD^2V^T + \Omega)^{-1}VDU^T \\&
= U(I + D^{-1}V^T\Omega VD^{-1})^{-1}U^T.\end{aligned}
$$
Letting $v \neq \mathbf 0$ I have
$$
v^TD^{-1}V^T\Omega V D^{-1}v = u^T\Omega u > 0
$$
for $u = VD^{-1}v$ and the inequality is because $\Omega$ is PD. This shows that $D^{-1}V^T\Omega V D^{-1}$ is PD, and this means that all eigenvalues of $I + D^{-1}V^T\Omega VD^{-1}$ are strictly greater than $1$. Since that in turn is a real-valued and symmetric matrix I can apply the spectral theorem to that matrix to conclude that the eigenvalues of $(I + D^{-1}V^T\Omega VD^{-1})^{-1}$ are all in $(0,1)$.

Now considering $S_\theta$, $U$ is an $n\times m$ matrix with orthonormal columns so I can find (via Gram-Schmidt, for example) a $n\times (n-m)$ matrix $U_\perp$ with columns giving an orthonormal basis for the null space of $U$ (which has dimension $n-m$ by the rank-nullity theorem). Let $v \in \mathbb R^n$. $(U\mid U_\perp)$ is an orthonormal basis for $\mathbb R^n$ so $v$ can be represented as $v = Ua + U_\perp b$. This means
$$
v^TS_\theta v = (Ua + U_\perp b)^TS_\theta(Ua + U_\perp b) \\
= a^TU^TS_\theta Ua + b^TU_\perp^TS_\theta U_\perp b + 2 b^TU_\perp^TS_\theta Ua.$$

$S_\theta U_\perp$ and $U_\perp^T S_\theta$ both are the zero matrix since $U$ and $U_\perp$ have orthogonal columns, so
$$
v^TS_\theta v = a^TU^TS_\theta Ua \\
= a^T (I + D^{-1}V^T\Omega VD^{-1})^{-1} a.
$$
So long as $a \neq \mathbf 0$ this will be positive since it’s a quadratic form with a positive definite matrix. Any vector $v$ with $a = \mathbf 0$, meaning it is wholly in the span of $U_\perp$, will lead to $v^T S_\theta v = 0$. This shows that the dimension of the eigenspace of $0$ for $S_\theta$ is $n-m$ and $U_\perp$ gives an orthonormal basis for this; if I wanted explicit eigenvectors of $0$ I could just take the columns of $U_\perp$. I also don’t have to worry about defective matrices since $S_\theta$ is symmetric.

Now for the non-zero eigenvalues, let $v = Ua$ and suppose $v$ is a unit vector. This means
$$
1 = v^Tv = a^TU^TUa = a^Ta
$$
so $a$ is also a unit vector. Therefore
$$
v^TS_\theta v = a^T (I + D^{-1}V^T\Omega VD^{-1})^{-1} a \\
\leq \lambda_\max((I + D^{-1}V^T\Omega VD^{-1})^{-1}) < 1 $$ by Courant-Fischer. This finishes my result since it means every eigenvalue of $S_\theta$ is in $[0,1)$.

$\square$

 

This shows how $S_\theta$ is a shrinkage operator. The usual hat matrix in linear regression has all eigenvalues of $0$ and $1$ and represents how for a target $y$, the projection into the column space of the predictors leaves the parts of $y$ in the column space unchanged and eliminates the parts of $y$ perpendicular to the column space (this also gives intuition for why idempotence is so important for a matrix being a projection). With a ridge regression (generalized or not) the hat martix is no longer a projection. If I let $Q\Lambda Q^T$ be the eigendecomposition of $S_\theta$ then $$ S_\theta y = Q\Lambda Q^Ty = \sum_{i=1}^n \lambda_i \langle y, q_i\rangle q_i \\ = \sum_{i=1}^m \lambda_i \langle y, q_i\rangle q_i $$ where $q_i$ is the $i$th column of $Q$, and the last equality is because $\lambda_i = 0$ for $i > m$. $Q^Ty$ puts $y$ into the eigenbasis so $\langle y, q_i\rangle$ is the $i$th coordinate of $y$ in the eigenbasis. This means that $S_\theta$ puts $y$ into the eigenbasis, chops off all parts not in the column space of $Z$, shrinks the coordinates according to $\lambda_i$, and then puts the result back into the original basis. Since $\lambda_\max(S_\theta) < 1$ no direction is left without at least a little shrinkage.

I showed that the marginal variance of $y$ is
$$
\Var(y) \propto (I-S_\theta)^{-1}
$$
which means fundamentally the marginal density of $y$ is determined by the quadratic form
$$
y \mapsto (y-X\beta)^T(I-S_\theta)(y-X\beta).
$$
I’ll let $r = y-X\beta$ so this is the variation around the fixed effects mean. I can make the density large by having $r$ small, so $y\approx X\beta$, but more interesting is the role of $I-S_\theta$. The quadratic form $r^T(I-S_\theta)r$ will be small when $r$ is close to being a bottom eigenvector of $I-S_\theta$, so equivalently a top eigenvector of $S_\theta$. This is the direction with the closest eigenvalue to $1$ so
$$
S_\theta r \approx \lambda_\max(S_\theta)r \approx r.
$$
All together, this means $y$ has a large marginal density when it is either close to the particular fixed effects mean $X\beta$, or when $y-X\beta$, the part unexplained by the fixed effects, is close to being in $\text{ColSpace}(Z)$. $Z$ appearing only as a whole column space effect represents the integration over all possible $\gamma$s.

Estimating $\beta$ and $\sigma^2$ in terms of $\Omega$

Returning to the estimation problem at hand, I’ll first estimate $\beta$ and $\sigma^2$, after which I’ll estimate $\theta$ (and thereby $\Omega$). The marginal log likelihood is
$$
\ell(\beta,\sigma^2,\theta\mid y)=-\frac 1{2\sigma^2}(y – X\beta)^T(I-S_\theta)(y-X\beta) \\
+ \frac 12 \log\det(I – S_\theta) – \frac n2 \log\sigma^2
$$
up to some constant. Beginning with $\beta$, I can note that this is a quadratic form in $\beta$ with a positive definite matrix so it’s convex and I just need to find the single root of the derivative to have $\hat\beta$ (in terms of $\theta$). Therefore
$$
\begin{aligned}\frac{\partial \ell}{\partial \beta} &= -\frac 1{2\sigma^2}\left(2X^T(I-S_\theta)X\beta – 2X^T(I-S_\theta)y\right) \stackrel{\text{set}}= 0 \\&
\implies \hat\beta = (X^T(I-S_\theta)X)^{-1}X^T(I-S_\theta)y\end{aligned}
$$
which, when $\theta$ is treated as known, is a generalized least squares regression of $y$ on $X$. This shows how the interplay between the column spaces of $Z$ (as represented by $S_\theta$) and $X$ matters for what happens with $\hat\beta$. If $X$ is basically in the column space of $Z$ and the shrinkage is light then
$$
X^T(I – S_\theta)X^T = X^TX – X^TS_\theta X \approx \mathbf O
$$
so $\hat\beta \approx \mathbf 0$. But if the shrinkage is large or $X$ is not well explained by the column space of $Z$ then $\hat\beta$ can be large. The larger the shrinkage is the closer $\hat\beta$ will be to an unweighted regression of $y$ on $X$ since $S_\theta \to \mathbf O$.

For $\sigma^2$ I’ll again differentiate so
$$
\begin{aligned}\frac{\partial \ell}{\partial \sigma^2} &= \frac{1}{2\sigma^4}(y-X\beta)^T(I – S_\Omega)(y-X\beta) – \frac n{2\sigma^2} \stackrel{\text{set}}= 0 \\&
\implies \hat\sigma^2 = \frac 1n (y-X\hat\beta)^T(I – S_\theta)(y-X\hat\beta).\end{aligned}
$$
I’m able to put $\hat\beta$ in there because no matter what $\sigma^2$ is, as far as $\beta$ is concerned $\hat\beta$ is the MLE (given $\theta$, as with everything else here). Letting $S_\theta = Q\Lambda Q^T$ as before, and letting $y-X\beta = r$ (“$r$” for residuals), I have
$$
\hat\sigma^2 \propto r^TQ(I-\Lambda)Q^Tr = \sum_{i=1}^n (1 – \lambda_i) \langle r, q_i\rangle^2
$$
so this will be large when the fixed-effect residuals are mostly in the span of the bottom eigenvectors of $Q$. In other words, I’ll have a large estimated error variance when $y-X\beta$ is both large itself and is poorly explained by $Z$. This makes sense as it’s basically saying the error variance is large when $y$ is not explained well by either $X$ or $Z$.

Now that I have MLEs in terms of $\theta$ I can get the profiled likelihood $\ell_p$, which captures the idea that if I’m trying to jointly maximize $\ell$ over $(\beta,\sigma^2,\theta)$ but for any $\theta$ I analytically have the maximum, then I can obtain the global max by looking at
$$
\begin{aligned}\ell_p(\theta\mid y) &:= \ell(\hat\beta_\theta, \sigma^2_\theta, \theta\mid y) \\&
= -\frac 1{2\hat\sigma^2}(y – X\hat\beta)^T(I-S_\theta)(y-X\hat\beta) \\&
+ \frac 12 \log\det(I – S_\theta) – \frac n2 \log\hat\sigma^2.\end{aligned}
$$
Noting that
$$
\frac 1{\hat\sigma^2}(y – X\hat\beta)^T(I-S_\theta)(y-X\hat\beta) = n
$$
this simplifies to
$$
\ell_p(\theta\mid y) = \frac 12 \log\det(I – S_\theta) – \frac n2 \log\hat\sigma^2
$$
up to some constant.

This will probably have to be done numerically, after which I now have point estimates $(\hat\beta, \hat\sigma^2, \hat\Omega)$ of all my parameters. All that remains is to handle $\gamma$.

BLUPs

I modeled $\gamma$ as being a random variable rather than a parameter, so it will be predicted rather than estimated. I will predict it using $\operatorname{E}(\gamma\mid y)$ since that’s the minimum MSE prediction. As I’ll see, this turns out to be a linear function of $y$ hence being the BLUP (best linear unbiased predictor).

I know
$$
y \mid \gamma \sim \mathcal N(X\beta + Z\gamma, \sigma^2 I)
$$
and
$$
\gamma \sim \mathcal N(\mathbf 0, \sigma^2 \Omega^{-1})
$$
so the distribution of $\gamma\mid y$ is given by Bayes rule as
$$
p(\gamma\mid y) \propto p(y\mid \gamma)p(\gamma)\\
\propto\exp\left(-\frac 1{2\sigma^2}\left[\|y – X\beta – Z\gamma\|^2 + \gamma^T\Omega\gamma\right]\right).
$$
Expanding the inner term and completing the square leads me to
$$
\gamma\mid y \sim \mathcal N((Z^TZ + \Omega)^{-1}Z^T(y-X\beta), \sigma^2 (Z^TZ + \Omega)^{-1})
$$
which means
$$
\operatorname{E}(\gamma\mid y) = (Z^TZ + \Omega)^{-1}Z^T(y-X\beta)
$$
so
$$
\hat\gamma = (Z^TZ + \hat\Omega)^{-1}Z^T(y-X\hat\beta).
$$
This is a very interpretable form as it shows that the BLUPs are just a generalized ridge regression of the fixed-effect residuals $y-X\beta$ on the random effects design matrix $Z$.

Often this is put in a different form as
$$
\tilde\gamma =\hat \Omega^{-1} Z^T \left( Z\hat\Omega^{-1}Z^T + I \right)^{-1}(y – X\hat\beta).
$$
I claim these are equal.

Pf: I’ll show this by proving that
$$
\Omega^{-1}Z^T(Z\Omega^{-1}Z^T + I)^{-1} = (Z^TZ + \Omega)^{-1}Z^T.
$$
I already know via Woodbury that $\left( Z\Omega^{-1}Z^T + I \right)^{-1} = I – Z(Z^TZ + \Omega)^{-1}Z^T$ so
$$
\begin{aligned}&\Omega^{-1}Z^T(Z\Omega^{-1}Z^T + I)^{-1} – (Z^TZ + \Omega)^{-1}Z^T \\&
= \Omega^{-1}Z^T – \Omega^{-1}Z^TZ(Z^TZ + \Omega)^{-1}Z^T – (Z^TZ + \Omega)^{-1}Z^T \\&
= \Omega^{-1}\left[Z^TZ + \Omega – Z^TZ -\Omega\right](Z^TZ + \Omega)^{-1}Z^T \\&
= \mathbf O\end{aligned}
$$
so $\hat\gamma = \tilde \gamma$.

$\square$

In my next post on this subject I derive REML and go through an example with a random slopes model.

Leave a Reply

Your email address will not be published. Required fields are marked *