The lognormal distribution and linear regression

$\newcommand{\e}{\varepsilon}$$\newcommand{\one}{\mathbf 1}$Suppose I have collected data $X \in \mathbb R^{n\times p}$ ($n \geq p$ and $X$ is full column rank by assumption) and $y\in(0,\infty)^n$. I’m going to explore what happens when I want to model the mean of $y$ as
$$
\text E(y_i) = \exp(x_i^T\beta)
$$
but I estimate my model parameters by OLS on $\log y$, i.e. the model I actually fit is
$$
\log y = X\beta + \e, \hspace{5mm} \e \sim \mathcal N(\mathbf 0, \sigma^2 I).
$$
My focus will be on the fitted values for the original scale as there’s nothing but a standard regression happening on the log scale. The proper way to do this would be to acknowledge that I’m really talking about a GLM and to fit this as such (so e.g. a Poisson regression could do the trick, at least for point estimates; for standard deviations I may want to consider overdispersion or perhaps robust SEs).

From my model I have $y_i = \exp\left(x_i^T\beta + \e_i\right)$ and $x_i^T\beta + \e_i \sim \mathcal N(x_i^T\beta, \sigma^2)$ so I’m going to begin by deriving the lognormal density and its moments, then I’ll return to the linear regression problem.

The Lognormal Distribution

I’m going to change my notation for this section to be more standard for probability, and I’ll return to my linear regression notation later.

Suppose $X\sim\mathcal N(\mu,\sigma^2)$. Let $Y = e^X$ and $y > 0$. Then
$$
\begin{aligned}F_Y(y) &= P(Y\leq y) = P(e^X \leq y) \\&
= P(X \leq \log y) \\&
= \frac{1}{\sqrt{2\pi\sigma^2}}\int_{-\infty}^{\log y} \exp\left(-\frac{(x – \mu)^2}{2\sigma^2}\right)\,\text dx\end{aligned}
$$
so by the fundamental theorem of calculus
$$
f_Y(y) = \frac{\text d}{\text d y} F_Y(y) \\
= \frac{1}{y\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(\log y – \mu)^2}{2\sigma^2}\right).
$$
Now I’ll work out the moments of $Y$. I can use the definition of $Y$ to reduce this to a problem about $X$:
$$
\text E(Y^t) = \text E(e^{tX}) = M_X(t)
$$
where $M_X$ is the moment generating function (MGF) of $X$.

$$
\text E(e^{tX}) = \frac{1}{\sqrt{2\pi\sigma^2}} \int_{\mathbb R}\exp\left(tx – \frac 1{2\sigma^2}(x – \mu)^2\right)\,\text dx
$$
so inside the exponential I’ll factor out $-\frac{1}{2\sigma^2}$ to get
$$
\begin{aligned}&-2\sigma^2tx + x^2 – 2x\mu + \mu^2 \\&
= x^2 – 2(\mu + \sigma^2 t)x + \mu^2 \\&
= \left(x – [\mu + \sigma^2 t]\right)^2 + \mu^2 – \mu^2 – 2\mu\sigma^2t – \sigma^4t^2\end{aligned}
$$
so all together I have
$$
\begin{aligned}\text E(e^{tX}) &= \frac{1}{\sqrt{2\pi\sigma^2}} \int_{\mathbb R}\exp\left(- \frac 1{2\sigma^2}\left(x – [\mu + \sigma^2 t]\right)^2 + \mu t + \sigma^2t^2/2\right)\,\text dx \\&
= \exp\left(\mu t + \sigma^2t^2/2\right).\end{aligned}
$$
This means that, for example,
$$
\text E(Y) = \exp\left(\mu + \sigma^2/2\right)
$$
and
$$
\begin{aligned}\text{Var}(Y) &= \text E(Y^2) – \text E(Y)^2 \\&
= \exp\left(2\mu + 2\sigma^2\right) – \exp\left(2\mu + \sigma^2\right)\\&
= e^{2\mu + \sigma^2}\left(e^{\sigma^2} – 1\right).\end{aligned}
$$
This gives an easy way to compute the moments of $Y$, and shows that all positive moments exist and are finite. I didn’t use the MGF of $Y$ for the following reason.
$$
\text E(e^{tY}) = \text E(e^{te^X}) \\
= \frac{1}{\sqrt{2\pi\sigma^2}}\int_{\mathbb R}\exp\left(te^x – \frac 1{2\sigma^2}(x – \mu)^2\right)\,\text dx.
$$
Let $t > 0$. Ignoring the leading constants, I can break this up into
$$
\int_{-\infty}^0\exp\left(te^x – \frac 1{2\sigma^2}(x – \mu)^2\right)\,\text dx + \\
\int_0^{\infty}\exp\left(te^x – \frac 1{2\sigma^2}(x – \mu)^2\right)\,\text dx.
$$
In the first term, $0 < te^x \leq t$ so
$$
\exp\left(te^x – \frac 1{2\sigma^2}(x – \mu)^2\right)\,\text dx \leq \\
e^t \cdot\exp\left(- \frac 1{2\sigma^2}(x – \mu)^2\right)\,\text dx
$$
and that integral converges to a finite value on $(-\infty, 0]$. But for the second term with $x \geq 0$,
$$
\lim_{x\to\infty} \exp\left(te^x – \frac 1{2\sigma^2}(x – \mu)^2\right) = \infty
$$
so the integral diverges. This means there is no neighborhood about $0$ for which $M_Y$ is finite and prevents all of the usual things from being done with it (e.g. finding moments by evaluating derivatives at $t=0$).

Returning to linear regression

I’ll now again have $X$ be my $n\times p$ feature matrix and $y$ is a random vector taking values in $(0,\infty)^n$.

Before any estimation happens, I am already modeling the wrong thing: I said I wanted $\text E(y) = \exp(X\beta)$ but, using the results of the previous section, I have
$$
\begin{aligned}&\log y_i \sim \mathcal N(x_i^T\beta, \sigma^2) \\&
\implies y_i \sim \text{Lognormal}(x_i^T\beta, \sigma^2) \\&
\implies \text E(y_i) = \exp(x_i^T\beta + \sigma^2/2).\end{aligned}
$$
So I definitely should have just used a GLM.

Next I’ll see what happens when I exponentiate $\widehat{\log y_i} = x_i^T\hat\beta$ which is probably the obvious way to try to get fitted values on the original scale. I’ll use
$$
\tilde y := \exp(X\hat\beta)
$$
for these.

By the assumption of Gaussian errors I have
$$
\hat\beta \sim \mathcal N(\beta, \sigma^2(X^TX)^{-1}) \\
\implies X\hat\beta \sim \mathcal N(X\beta, \sigma^2 H)
$$
where $H = X(X^TX)^{-1}X^T$ is the hat matrix. I’ll let $h_i = H_{ii}$ so
$$
\begin{aligned}&x_i^T\hat\beta \sim \mathcal N(x_i^T\beta, \sigma^2 h_i) \\&
\implies\exp(x_i^T\hat\beta) \sim \text{Lognormal}(x_i^T\beta, \sigma^2 h_i) \\&
\implies \text E(\tilde y_i) = \exp\left(x_i^T\beta + \sigma^2h_i / 2\right)\end{aligned}
$$
therefore $\tilde y_i$ is off from my target of $e^{x_i^T\beta}$ whenever $h_i > 0$, which will always be the case as $h_i \in \left[\frac 1n, 1\right]$ (I prove this in my post here).

The next question is if I can change $\tilde y$ to get unbiased point estimates for my target mean of $e^{X\beta}$.

If I could eliminate the $h_i$ term then $\tilde y_i$ is unbiased for this target, so this suggests the following correction:
$$
\tilde y_i := \exp(x_i^T\hat\beta – \hat\sigma^2h_i^{-1}/2).
$$
For the bias of my new $\tilde y_i$, I will need to use the following:

Result: if $y = X\beta + \e$ with $\e \sim \mathcal N(\mathbf 0, \sigma^2 I)$ then $\hat\beta \perp \hat\sigma^2$ and $\frac{n\hat\sigma^2}{\sigma^2}\sim\chi^2_{n-p}$.

Pf: I proved this in my Cross Validated answer here so I’ll just link to it.

$\square$

For the expected value of $\tilde y_i$ I will have
$$
\begin{aligned}\text E(\tilde y_i) &= \text E(\exp(x_i^T\hat\beta – \hat\sigma^2h_i^{-1}/2)) \\&
= \text E(e^{x_i^T\hat\beta}) \cdot E(e^{- \hat\sigma^2h_i^{-1}/2}) \\&
\exp\left(x_i^T\beta + \sigma^2h_i / 2\right) \cdot E(e^{- \hat\sigma^2h_i^{-1}/2})\end{aligned}
$$
by independence. I now need that remaining expectation, which can be seen to be an evaluation of the moment generating function of $\hat\sigma^2$. I’ll start by working out the MGF of a $\Gamma(a,b)$ RV.

Result: if $Z\sim\Gamma(a,b)$ with density
$$
f_Z(z;a,b) = \frac{1}{b^a\Gamma(a)}z^{a-1}e^{-z/b}, \hspace{5mm} z \geq 0
$$
then
$$
M_Z(t) = \left(1 – tb\right)^{-a}, \hspace{5mm} t< \frac 1b.
$$
Pf:
$$
\begin{aligned}\text E(e^{tZ}) &= \frac{1}{b^a\Gamma(a)}\int_0^\infty z^{a-1}e^{-(1/b-t)z}\,\text dz \\&
= \frac{(1/b-t)^{-a}}{b^a}\cdot\frac{1}{(1/b-t)^{-a}\Gamma(a)}\int_0^\infty z^{a-1}e^{-(1/b-t)z}\,\text dz \\&
= \left(1 – tb\right)^{-a}\end{aligned}
$$
so long as $1/b-t>0$ for the integral to converge, which gives me the requirement that $t< \frac 1b$.

$\square$

$\chi^2_{n-p} \stackrel{\text{d}}= \Gamma\left(\frac {n-p}2, 2\right)$ and I have
$$
\frac{n\hat\sigma^2}{\sigma^2} \sim \chi^2_{n-p}
$$
so for $t < \frac 12$
$$
\text E\left(\exp\left(t \frac{n\hat\sigma^2}{\sigma^2}\right)\right) = (1 – 2t)^{-\frac{n-p}{2}}.
$$
But I want the expected value of $\exp(-\hat\sigma^2h_i^{-1}/2)$ so this means I’ll take
$$
t = -\frac{\sigma^2 h_i^+}{2n}
$$
which is also always negative so I don’t need to worry about the MGF being finite. This gives me
$$
\text E(\exp(-\hat\sigma^2h_i^{-1}/2)) = \text E\left(\exp\left(-\frac{\sigma^2 h_i^{-1}}{2n}\cdot\frac{n\hat\sigma^2}{\sigma^2}\right)\right) \\
= \left(1 + \frac{\sigma^2h_i^{-1}}{n}\right)^{-\frac{n-p}{2}}
$$
so
$$
\text E(\tilde y_i) = \exp\left(x_i^T\beta + \sigma^2h_i / 2\right)\left(1 + \frac{\sigma^2h_i^{-1}}{n}\right)^{-\frac{n-p}{2}}.
$$
This means I haven’t fixed the issue for finite samples, but
$$
\lim_{n\to\infty}\left(1 + \frac{\sigma^2h_i^{-1}}{n}\right)^{-\frac{n-p}{2}} = \exp\left(-\sigma^2h_i^{-1}/2\right)
$$
so asymptotically at least the bias goes away for every observation and I recover $\exp(x_i^T\beta)$ with this correction. If I’m just using $\exp(x_i^T\hat\beta)$ for my fitted values then the bias doesn’t have to go away even asymptotically, although it will get smaller for the majority of points since $\text{tr}(H) = \sum_{i=1}^n h_i = p$ so as $n$ increases the $h_i$ must generally get smaller. But the bias still depends on which observation I’m looking at (more specifically, on $x_i$).

Leave a Reply

Your email address will not be published. Required fields are marked *