Gaussian distributions with low rank covariance matrices

$\newcommand{\one}{\mathbf 1}\newcommand{\convd}{\stackrel{\text d}\to}\newcommand{\convp}{\stackrel{\textp}\to}\newcommand{\p}{\mathbf p}\newcommand{\0}{\mathbf 0}\newcommand{\X}{\mathbf X}\newcommand{\Mult}{\text{Mult}}\newcommand{\E}{\operatorname{E}}\newcommand{\e}{\varepsilon}\newcommand{\Var}{\operatorname{Var}}\newcommand{\R}{\mathbb R}\newcommand{\rank}{\operatorname{rank}}\newcommand{\H}{\mathcal H}$Let $X$ be a continuous random vector in $\mathbb R^n$ with distribution $P_X$ and let $\lambda$ be the Lebesgue measure on $\mathbb R^n$ (I’ll write $\lambda^n$ if I need to specify the dimension). I’ll assume the means and variances of each element are well-defined and finite so I’ll have $\E(X) := \mu\in\R^n$ and $\Var(X) := \Sigma \in \R^{n\times n}$. In this post I’ll look at the behavior of $X$ when it has a low rank covariance matrix (i.e. $\Sigma$ is positive semidefinite rather than strictly positive definite).

I’ll first discuss Radon-Nikodym derivatives (RNDs) a bit because much of this section hinges on the fact that $P_X$ may or may not have one w.r.t. $\lambda^n$. For two measures $\nu$ and $\rho$ on a measurable space $(\Omega, \mathscr F)$, I’ll use $\nu \ll \rho$ to denote that $\nu$ is absolutely continuous w.r.t. $\rho$. This means $\rho(A) = 0 \implies \nu(A) = 0$ for all $A\in\mathscr F$, and when $\rho$ is additionally $\sigma$-finite then the Radon-Nikodym theorem gives me the existence of an a.e.-$\rho$ (i.e. almost everywhere w.r.t. $\rho$) unique function $f$ such that
$$
\nu(A) = \int_A f \,\text d\rho
$$
for all $A\in\mathscr F$. Intuitively this means that if the support of $\nu$ is contained within the support of $\rho$ and $\rho$ behaves in a finite way then I can rescale $\rho$ via $f$ to give $\nu$.

Result 1: if $\Sigma$ is low rank then $P_X \not\ll \lambda^n$.

Pf: let $\rank\Sigma = r < n$. $\Sigma$ is real-valued and symmetric so I can diagonalize it as $$ \Sigma = Q\Lambda Q^T $$ where $Q$ is the square orthonormal matrix containing the unit eigenvectors and $\Lambda$ is diagonal with the eigenvalues in decreasing order. Because $\rank \Sigma = r$ I know $\lambda_1,\dots,\lambda_r > 0$ and $\lambda_{r+1} = \dots = \lambda_n = 0$. This means that I can write $\Sigma$ in low rank form as
$$
\Sigma = Q_r\Lambda_r Q_r^T
$$
where $Q_r\in\mathbb R^{n\times r}$ contains the first $r$ eigenvectors (so only the eigenvectors of non-zero eigenvalues) and $\Lambda_r=\text{diag}(\lambda_1,\dots,\lambda_r)$ is invertible.

Consider the change of basis $Q^TX$. I have
$$
\Var(Q^TX) = Q^T\Sigma Q = \Lambda
$$
which means $Q^TX$ has no variance for coordinates $r+1$ through $n$ and therefore
$$
(Q^TX)_j \stackrel{\text{a.s.}}= \mu_j, \hspace{4mm} j=r+1,\dots,n
$$
This means $X$ is supported on an affine $r$-dimensional subspace, but since $r<n$ this space has measure zero w.r.t. $\lambda^n$ so $P_X\not\ll\lambda^n$.

$\square$

I’ll now assume $X\sim\mathcal N(\mu,\Sigma)$ and I’ll continue to have $\text{rank}(\Sigma) = r < n$. Often the multivariate normal distribution (MVN) is defined by saying that $X \sim \mathcal N(\mu, \Sigma)$ precisely when there exists a function $$ f(x; \mu,\Sigma) = \frac{\exp\left(-\frac12(x-\mu)^T\Sigma^{-1}(x-\mu)\right)}{(2\pi)^{n/2}|\Sigma|^{1/2}} $$ such that $$ P(X\in A) = \int_A f\,\text d\lambda^n $$ for all Borel $A\subseteq\mathbb R^n$. But in our case there is no such function $f$, first because $|\Sigma|=0$ so as written $f$ is undefined, but more deeply because $f$ by definition is the RND $\frac{\text dP_X}{\text d\lambda^n}$ and by Result 1 $\Sigma$ being low rank means that this RND doesn’t exist. So instead I will use a more general definition: $$ X\sim\mathcal N(\mu,\Sigma) \iff a^TX\sim\mathcal N(a^T\mu, a^T\Sigma a) \;\text{ for all }a\in\mathbb R^n. $$ This definition now only relies on the univariate Gaussian. If $a^T\Sigma a = 0$ I’ll treat $a^TX$ as a point mass at $a^T\mu$. The next result shows that this definition agrees with the usual one.

Result 2: Suppose I have a random variable $X\in\mathbb R^n$ where for all $a\in\mathbb R^n$ $a^TX\sim\mathcal N(a^T\mu, a^T\Sigma a)$. Then $X\sim\mathcal N(\mu,\Sigma)$.

Pf: I’ll use moment generating functions. Let $a\in\mathbb R^n$ and consider $M_X(a) = \E(e^{a^TX})$. If $a$ lies entirely in the null space of $\Sigma$ then $a^T\Sigma a = 0$ and $P_X$ becomes a Dirac delta at $a^T\mu$ so $\E(e^{a^TX}) = e^{a^T\mu}$. Now if $a$ is not entirely in $\text{NullSpace}(\Sigma)$ then $a^T\Sigma a > 0$ so $a^TX$ isn’t degenerate. In this case I can let $Y = a^TX$ so this becomes
$$
\E(e^{a^TX}) = \int_{\mathbb R} e^y f(y; a^T\mu, a^T\Sigma a)\,\text dy.
$$
This is just the MGF of a univariate Gaussian evaluated at $1$ so I know
$$
M_X(a) = \exp\left(a^T\mu + a^T\Sigma a / 2\right)
$$
and this is exactly the MGF of a $\mathcal N(\mu,\Sigma)$ random variable. If $a^T\Sigma a = 0$ then this reduces to $e^{a^T\mu}$ so for all $a\in\mathbb R^n$ I have $M_X(a)$ equal to the MGF of a $\mathcal N(\mu,\Sigma)$ random variable (and the MGFs in question are finite in a neighborhood of zero so I don’t have to worry about issues with that).

$\square$

It’s interesting to note that moment generating and characteristic functions, which characterize the distribution, only interact with the random vector itself through arbitrary linear combinations, so in light of that this definition isn’t too surprising.

Now that I have my MVN well-defined, I’ll consider two questions.

1. given an affine subspace of $\mathbb R^n$, how do I construct an MVN supported on it?

2. given a MVN with a low rank covariance matrix, how do I determine what the subspace is that supports it?

Constructing an MVN on a particular subspace

Let $\H + \mu$ be an affine subspace where $\H$ is a subspace of dimension $r$. I can let $Q_r$ be a $n\times r$ matrix containing an orthonormal basis for $\H$ ($\H$ is a finite dimensional vector space in its own right so it is guaranteed to have a basis [especially since I’m in measure theory land so I have no problems assuming the axiom of choice], and by Gram-Schmidt I can make it orthogonal).

Let $Z\sim\mathcal N_r(\0, I)$ and consider $Q_rZ + \mu$. $Q_r Z$ is a random combination of the basis vectors for $\H$, so $Q_r Z \in \H$, and then the translation by $\mu$ means $Q_rZ + \mu$ is in my target affine space. And since $Z$ is Gaussian I know
$$
Q_r Z + \mu \sim \mathcal N_n(\mu, Q_rQ_r^T)
$$
where $Q_rQ_r^T$ is a rank $r$ covariance matrix for an $n$ dimensional RV.

In the other direction, suppose $X\sim\mathcal N_n(\mu,\Sigma)$ with $\Sigma = Q_r\Lambda_rQ_r^T$ being rank $r$. Then $X$ is equal in distribution to $Q_r Z + \mu$ where $Z\sim\mathcal N_r(\0,\Lambda_r)$ so $X$ is supported on the affine space $\text{span}(Q_r) + \mu$.

Together these two results show that the support of $X$ is completely determined by the space spanned by the eigenvectors of the covariance matrix, which fits with Result 1 where I used the fact that a low rank $\Sigma$ meant there were directions with no variance and so $P_X \not\ll \lambda^n$.

One consequence of $X \stackrel{\text d}= Q_r Z + \mu$ is that I can compute $P_X$ by integrating a lower-dimension pdf with respect to $\lambda_r$:
$$
\begin{aligned}P(X\in A) &= P(Q_r Z + \mu \in A) \\&
= P(Z \in Q_r^T(A-\mu)) \\ & = \int_{Q_r^T(A-\mu)} f(z; \0, \Lambda_r)\,\text d\lambda^r(z)\end{aligned}
$$
since $Z$ has a full rank covariance matrix so it has a Lebesgue pdf on $\mathbb R^r$ in the usual sense. Intuitively, I’m effectively just projecting $A$ into the subspace where $X$ behaves like a non-degenerate Gaussian and I can integrate within that subspace in the usual way.

Application: multinomial

In the final section I want to develop an application of this idea. I’ll now have $X \sim \text{Mult}(n, \p)$ with $\p = (p_1,\dots,p_K) \in (0,1)^K$ and $\p^T\one = 1$ so
$$
P(X = (x_1,\dots,x_K)) = \left(\begin{array}{cccc}& n & \\ x_1 & \dots & x_K\end{array}\right)\prod_j p_j^{x_j}.
$$
I will be investigating the asymptotic behavior of the estimator $\hat \p := X/n$ of $\p$, and as I’ll see, a MVN with a low rank covariance matrix will naturally arise. I will use a non-boldface $\hat p_i$ to refer to the $i$th component of $\hat\p$ and $\hat\p_n$ will refer to $\hat \p$ coming from a $\text{Mult}(n, \p)$ distribution, i.e. with a sample size of $n$.

Result 3: A multinomial has covariance matrix proportional to
$$
\Omega := \left(\begin{array}{cc}
p_1(1-p_1) & -p_1p_2 & -p_1p_3 & \dots & -p_1p_K \\
-p_1p_2 & p_2(1-p_2) & -p_2p_3 & \dots & -p_2p_K \\
\vdots & & \ddots & & \vdots \\
-p_1p_K & -p_2p_K & -p_3p_K & \dots & p_K(1-p_K)
\end{array}\right)
$$
Pf: for the diagonal, each $X_i$ is $\text{Bin}(n, p_i)$ so I easily have $\Var(X_i) = np_i(1-p_i)$. For the off-diagonal terms (so I’m assuming $i\neq j$), I have
$$
\text{Cov}(X_i,X_j) = \E(X_iX_j) – (\E X_i)(\E X_j) \\
= \E(X_iX_j) – n^2p_ip_j.
$$
For $\E(X_iX_j)$ it’ll be easiest to use the fact that a $\text{Mult}(n,\p)$ RV can be represented as the sum of $n$ independent $\text{Cat}(\p)$ RVs where $\text{Cat}$ is the categorial distribution and gives the generalization of the Bernoulli distribution to more than two options. Thus if $Z_1,\dots,Z_n \stackrel{\text{iid}}\sim \text{Cat}(\p)$ (so $Z_i \in \{0,1\}^K$) then
$$
X = \sum_{k=1}^nZ_k
$$
and
$$
X_i = \sum_{k=1}^n I(Z_k = i)
$$
where $I(\cdot)$ is the indicator function. This means
$$
X_iX_j = \sum_{k,\ell=1}^n I(Z_k = i)I(Z_\ell = j) = \sum_{k\neq \ell}I(Z_k = i)I(Z_\ell = j)
$$
since when $k=\ell$ both indicator functions can’t be $1$. Then by independence of the $Z_k$ I have
$$
\E(X_iX_j) = \sum_{k\neq \ell}\E \left[I(Z_k = i)\right]\E\left[I(Z_\ell = j)\right] \\
= \sum_{k\neq\ell} p_ip_j = n(n-1)p_ip_j
$$
therefore
$$
\text{Cov}(X_i,X_j) = n(n-1)p_ip_j – n^2p_ip_j = -np_ip_j.
$$
This means
$$
\Var(X) = n\Omega.
$$
$\square$

Result 4: $\Omega$ is rank $K-1$ and the null space is spanned by $\one$.

Pf: $\Omega = \text{diag}(\p) – \p\p^T$ and every element of $\p$ is non-zero by assumption so $\Omega$ is a rank 1 update of an invertible matrix. This means its rank is either $K$ or $K-1$. Using the matrix-determinant lemma I can compute
$$
\det(\Omega) = \det(\text{diag}(\p))(1 – \p^T\text{diag}(\p)^{-1}\p) \\
= \left(\prod_i p_i \right)\left(1 – \sum_i \frac{p_i^2}{p_i}\right) = 0
$$
since $\sum_i p_i = 1$. Thus $\Omega$ is rank $K-1$, and since that means its null space is one dimensional I just need to find a single non-zero null vector and I’m good to go. Trying $\one$ is pretty reasonable so I’ll just plug that in and see what happens:
$$
\Omega\one = \p – \p\p^T\one = \p-\p = \0
$$
so $\one\in\text{NullSpace}(\Omega)$ and I’m done.

$\square$

This means that the non-null eigenvectors of $\Omega$ give an orthonormal basis for $\{x \in \mathbb R^K : x\perp \one\}$ so in terms of my earlier results this tells me that a Gaussian vector with $\Omega$ as its covariance matrix lies in the $K-1$ dimensional subspace orthogonal to $\one$. But when I’m actually working with $\hat \p$ I’ll have $\hat\p^T\one = 1$ always so I’m actually in the affine subspace $\mathcal A := \{x \in \mathbb R^K : x^T\one = 1\}$.

The result that I am interested in, and the connection of this to MVNs with low rank covariance matrices, is the following:

Result 5:
$$
\sqrt n(\hat \p_n – \p) \convd \mathcal N(\0,\Omega).
$$

Pf: I’ll do this via moment generating functions. I’ll let $X_n \sim \Mult(n, \p)$ so $M_{X_n}(t) = \left(\sum_{i=1}^K p_ie^{t_i}\right)^n$. Let
$$
S_n= \sqrt n(\hat \p_n – \p)
$$
and $M_n(t) = \E(e^{t^TS_n})$. Then
$$
\begin{aligned}M_n(t) &= \exp\left(-\p^Tt\sqrt n\right) M_{X_{(n)}}(t/\sqrt n) \\&
= \exp\left(-\p^Tt\sqrt n\right)\left[\sum_{i=1}^K p_ie^{t_i / \sqrt n}\right]^n \\&
= \left[\sum_{i=1}^K p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n}\right)\right]^n.\end{aligned}
$$
If $t\in\text{NullSpace}(\Omega)$, i.e. $t\propto\one$, then $t_i – \p^Tt = 0$ so at this point I have $M_n(t) = 1$. If $Y\sim\mathcal N(\0,\Omega)$ then $t^TY\sim \mathcal N (0,0)$ which is a Dirac delta at $0$. Integrating against this for $M_Y(t)$ evaluates $e^{t^Ty}$ at $y=0$ which leads to $M_Y(t) = 1$ and this agrees with $M_n(t)$. Since this case is established I can assume $t\not\in\text{span}\{\one\}$ going forward.

With this restriction on $t$, plugging in $n=\infty$ leads to the indeterminate $1^\infty$ so I’ll work with the limit on the log scale and see what happens (equivalently I’ll be using the cumulant generating function). Thus I want to evaluate
$$
L := \lim_{n\to\infty}\log M_n(t) = \lim_n \frac{\log\left[\sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n}\right)\right]}{1/n}
$$
which is a $\frac 00$ indeterminate so I can use L’Hopital’s rule. This leads me to
$$
\begin{aligned}L &= \frac 12 \lim_n \frac{n^{-3/2}\sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n}\right)(t_i -\p^Tt)}{n^{-2}\sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n}\right)} \\&
= \frac 12 \lim_n \frac{\sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n}\right)(t_i -\p^Tt)}{n^{-1/2}\sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n}\right)}.\end{aligned}
$$
This again is a $\frac 00$ indeterminate so I’ll first use an $\exp\circ \log$ on the $n^{-1/2}$ in the denominator and then apply L’Hopital’s rule again to get
$$
\begin{aligned}L &= \frac 12 \lim_n \frac{\sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n}\right)(t_i -\p^Tt)}{\sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n} – \frac 12 \log n\right)} \\
&= \frac 12 \lim_n \frac
{-\frac 12 n^{-3/2}\sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n}\right)(t_i -\p^Tt)^2}
{\sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n} – \frac 12 \log n\right)\left(-\frac 12 (t_i – \p^Tt) n^{-3/2} – \frac 1{2n}\right)} \\
&= \frac 12 \lim_n \frac
{ n^{-3/2}\sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n}\right)(t_i -\p^Tt)^2}
{ n^{-3/2}\sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n} – \frac 12 \log n\right)(t_i – \p^Tt) +
n^{-1} \sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n} – \frac 12 \log n\right)} \\
&= \frac 12 \lim_n \frac
{ \sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n}\right)(t_i -\p^Tt)^2}
{ \sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n} – \frac 12 \log n\right)(t_i – \p^Tt) +
n^{1/2} \sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n} – \frac 12 \log n\right)} \\
&= \frac 12 \lim_n \frac
{ \sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n}\right)(t_i -\p^Tt)^2}
{ \sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n} – \frac 12 \log n\right)(t_i – \p^Tt) +
\sum_{i} p_i\exp\left(\frac{t_i -\p^Tt}{\sqrt n}\right)}\end{aligned}
$$
and this is no longer indeterminate, so I have
$$
\begin{aligned}L &= \frac 12 \sum_{i=1}^K p_i(t_i – \p^Tt)^2 = \frac 12 \sum_i p_i(t_i^2 – 2t_i\p^Tt + (\p^Tt)^2)\\&
= \frac 12 t^T\left(\text{diag}(\p) – \p\p^T\right)t = \frac 12 t^T\Omega t.\end{aligned}
$$
This means
$$
M_n(t) \to \exp\left(\frac 12 t^T\Omega t\right)
$$
which is the MGF of a $\mathcal N(\0,\Omega)$ random variable. This holds for all $t\in\mathbb R^K$ so this means
$$
\sqrt n (\hat\p_n – \p) \convd \mathcal N(\0,\Omega).
$$
$\square$

This confirms that $\hat\p_n$ bounces around $\p$ in a Gaussian way as $n$ increases, but the Gaussian that it converges to is only supported on a subspace.

Here’s an example of 500 simulated $\hat\p$ a multinomial distribution. The red point in the center of the cloud is the true $\p$. This shows how the cloud has a Gaussian-like spread, but with a low rank covariance matrix so it’s constrained to this affine subspace rather than filling $\mathbb R^3$ even though the points themselves are vectors in $\mathbb R^3$.

Leave a Reply

Your email address will not be published. Required fields are marked *