A characterization of the canonical link in logistic regression

$\newcommand{\logit}{\operatorname{logit}}$$\newcommand{\y}{\mathbf y}$$\newcommand{\one}{\mathbf 1}$$\newcommand{\e}{\varepsilon}$$\newcommand{\0}{\mathbf 0}$$\newcommand{\Var}{\operatorname{Var}}$$\newcommand{\E}{\operatorname{E}}$In this post I’ll look at the gradient of the loss in logistic regression and explore a simplification that turns out to be a characterization of the canonical link.

In logistic regression I have
$$
y_i\mid x_i \stackrel{\perp}\sim \text{Bern}(\theta_i) \\
\implies \ell(y\mid X) = \sum_{i=1}^n y_i\log\theta_i + (1-y_i)\log(1-\theta_i)
$$
where $\ell$ is the log-likelihood.

Let $\eta = X\beta$ be the linear predictor so $\frac{\partial \eta_i}{\partial \beta_j} = x_{ij}$ and set
$$
\theta = g(\eta)
$$
where $g$ is the (vectorized) inverse link function. I’ll assume $\operatorname{image}(g) \subseteq (0,1)$ so that the logs are always defined. Normally $g$ would be the link function itself instead of the inverse, but I’m using the inverse throughout this for notational simplicity. I’ll assume that $g$ is at least twice continuously differentiable.

This means
$$
\ell(y\mid X) = \sum_{i=1}^n y_i \log g(\eta_i) + (1-y_i)\log(1 – g(\eta_i))
$$
so
$$
\begin{aligned}\frac{\partial \ell}{\partial \beta_j} &= \sum_{i=1}^n y_i \frac{g'(\eta_i)}{g(\eta_i)}\frac{\partial \eta_i}{\partial \beta_j} – (1-y_i) \frac{g'(\eta_i)}{1 -g(\eta_i)}\frac{\partial \eta_i}{\partial \beta_j} \\&
= \sum_i g'(\eta_i)x_{ij} \left(\frac{y_i}{ g(\eta_i)} – \frac{1-y_i}{1-g(\eta_i)}\right) \\&
= \sum_i (y_i – g(\eta_i))x_{ij}\cdot \frac{g'(\eta_i)}{g(\eta_i)(1 – g(\eta_i))}\end{aligned}
$$
so
$$
\nabla \ell = X^TW_\eta(y – g(\eta))
$$
where
$$
W_\eta = \operatorname{diag}\left(\frac{g'(\eta)}{g(\eta)(1-g(\eta))}\right).
$$
This gradient is quite interpretable. $y – g(\eta)$ gives the raw residuals and $\nabla \ell$ therefore is based on how close the weighted raw residuals $W_\eta(y-g(\eta))$ are to being in the null space of $X^T$. Intuitively this means I’ll be moving $\beta$, and therefore $\eta$, around until the weighted residuals (the unexplained part of the regression) are uncorrelated with the columns of $X$.

These weights can also be interpreted as follows. If I have a single observation $Z$ and I’m considering $\xi$ as its scalar-valued linear predictor, my model is
$$
Z \sim \text{Bern}(g(\xi))
$$
which means
$$
\frac{g'(\xi)}{g(\xi)(1-g(\xi))} = \frac{\frac{\partial}{\partial \xi} \E(Z)}{\Var(Z)}
$$
so I can interpret these weights as the sensitivity of the mean to changes in the linear predictor relative to the variance at that point. A large weight (in absolute value) means that $\xi$ is a point with a sensitive mean and relatively low variance, so it’s more important to get that point right. This is basically a degree of freedom weight.

Unit weights

I will now consider what it takes to have all the weights just be $1$. This can be achieved by finding an inverse link function that satisfies the differential equation
$$
f’ = f(1-f)
$$
for $f : \mathbb R \to (0,1)$.

This can be done analytically since this differential equation is separable:
$$
\frac{\text d f}{\text dx} = f(1-f) \\
\implies \int \frac 1{f(1-f)}\,\text df = \int\text dx.
$$
I can use partial fraction decomposition now:
$$
\begin{aligned}\frac{a}{f} + \frac b{1-f}& \stackrel{\text{set}}= \frac 1{f(1-f)}\\&
\implies a(1-f) + bf = 1 \\&
\implies a = b = 1\end{aligned}
$$
so
$$
\begin{aligned}x + c &= \int \frac{1}{f(1-f)}\,\text df = \int \frac 1f + \frac 1{1-f}\,\text df \\&
= \log f – \log(1-f) \\&
= \logit f \\&
\implies f(x) = \logit^{-1}(x+c) \\&
= \frac{1}{1+e^{-x-c}}.\end{aligned}
$$
The initial condition $f(0) = \frac 12$ makes sense, i.e. an input of $0$ leads to a predicted probability of $\frac 12$, and this means $c=0$. Thus the inverse logit, which corresponds to the canonical link, is exactly the function that makes $W_\eta = I$ and then the gradient reduces to
$$
\nabla \ell = X^T(y – g(\eta))
$$
so the idea of finding fitted values that make the residuals uncorrelated with the predictors is even more apparent.

There are better ways to arrive at the canonical link but what I really like about this is that the handy simplification $g’ = g(1-g)$, along with the initial condition $g(0) = \frac 12$, turns out to not just be a convenient property of the (inverse) logit link but in fact characterizes it.

Leave a Reply

Your email address will not be published. Required fields are marked *