Fitting a logistic regression with MCMLE

$\newcommand{\E}{\text{E}}$$\newcommand{\Var}{\text{Var}}$$\newcommand{\Y}{\mathcal Y}$Here I’m going to introduce Monte Carlo maximum likelihood estimation (MCMLE) and I’ll use it to fit a logistic regression.

Suppose I’m modeling a discrete $y \in \Y$ and I decide to do so by coming up with several statistics that I feel capture the useful parts of $y$, say
$$
s(y) = \left(s_1(y), \dots, s_p(y)\right)
$$
so $s : \Y \to \mathbb R^p$. I’ll then say that
$$
f(y|\theta) = P(Y=y|\theta) \propto \exp\left(\theta^Ts(y)\right)
$$
so any aspects of $y$ not described by $s$ are ignored, and $\theta\in\mathbb R^p$ is my parameter of interest.

In order to get this to be a proper distribution it needs to sum to $1$, so I’ll have
$$
f(y|\theta) = \frac{\exp\left(\theta^Ts(y)\right)}{\sum\limits_{z\in \Y} \exp(\theta^Ts(z))}.
$$
I’ll write this as
$$
f(y|\theta) = \frac{q_\theta(y)}{k(\theta)}
$$
where $k$ is the partition function and my goal then is to find
$$
\hat\theta = \underset{\theta}{\text{argmax }} f(y|\theta).
$$
MCMLE can be used to do this optimization when I can’t evaluate $k(\theta)$. One common use-case is when I’m modeling a random graph so $y$ is an $n\times n$ (say) adjacency matrix. I could pick a few statistics that characterize a particular random graph, say
$$
f(y|\theta) \propto \exp\left(\theta_1 \cdot \text{#edges}(y) + \theta_2 \cdot \text{#twostars}(y) + \theta_3 \cdot \text{#triangles}(y)\right)
$$
for example (this is the “triad” model), but then the partition function will involve a sum over all $n\times n$ adjacency matrices which is generally intractable.

MCMLE to the rescue

The key idea behind MCMLE is that I may not be able to evaluate $k(\theta)$, but frequently I can sample from $f(y|\theta)$ so I can use that combined with the law of large numbers (LLN) to approximate $k(\theta) / k(\theta_0)$ for some fixed $\theta_0$.

Using various properties of argmaxmia I can do the following series of manipulations:
$$
\hat \theta = \text{argmax}_{\theta} f(y|\theta) \\
=\text{argmax}_{\theta} \log f(y|\theta) \\
=\text{argmax}_{\theta} \log f(y|\theta) – \log f(y|\theta_0)
$$
for some constant $\theta_0$. Continuing,
$$
= \text{argmax}_\theta \log q_\theta(y) – \log k(\theta) – \log q_{\theta_0}(y) + \log k(\theta_0) \\
= \text{argmax}_\theta \log \frac{q_\theta(y)}{q_{\theta_0}(y)} – \log \frac{k(\theta)}{k(\theta_0)}.
$$
I have $q_\theta(y) = \exp(\theta^Ts(y))$ so
$$
\log \frac{q_\theta(y)}{q_{\theta_0}(y)} = (\theta – \theta_0)^Ts(y)
$$
which is a pretty tidy form.

Now if I let
$$
B(\theta, \theta_0) = \log \frac{k(\theta)}{k(\theta_0)}
$$
then I could find $\hat \theta$ by
$$
\hat\theta = \text{argmax}_\theta \, (\theta – \theta_0)^Ts(y) – B(\theta, \theta_0)
$$
and it’ll turn out that I can estimate $B$ in a pretty straightforward way.

I’ve assumed that I can sample $Y_1^*, \dots, Y_m^* \stackrel{\text{iid}}\sim f(\cdot|\theta_0)$. This isn’t that big of an assumption because I can evaluate $f(y | \theta) / f(y’ | \theta)$ since $k(\theta)$ cancels out, so worst case I can use MCMC with a Metropolis-type algorithm to get these samples. For graphs that’s typically what’s necessary.

Note then that
$$
\begin{aligned}E_{\theta_0}\left(\frac{q_\theta(Y)}{q_{\theta_0}(Y)}\right) &= \sum_{z \in \Y} \frac{q_\theta(z)}{q_{\theta_0}(z)} f(z|\theta_0) \\&
= \sum_{z\in\Y} \frac{q_\theta(z)}{q_{\theta_0}(z)} \frac{q_{\theta_0}(z)}{k(\theta_0)} \\&
= \frac{1}{k(\theta_0)} \sum_{z} q_\theta(z) \\&
= \frac{k(\theta)}{k(\theta_0)} \sum_{z} \frac{q_\theta(z)}{k(\theta)} \\&
= \frac{k(\theta)}{k(\theta_0)} = \exp B(\theta, \theta_0).\end{aligned}
$$
This means I can estimate $B(\theta, \theta_0)$ via
$$
\hat B(\theta, \theta_0, \{Y_i^*\}) = \log\left(\frac 1m\sum_{i=1}^m \frac{q_\theta(Y_i^*)}{q_{\theta_0}(Y_i^*)}\right) \\
= \log\left(\frac 1m\sum_{i=1}^m \exp\left((\theta – \theta_0)^T s(Y_i^*)\right)\right)
$$
and this is a consistent estimator via the LLN. This is of the “log-mean-exp” form which is a function that can lead to numerical issues so it’s often worth finding a well written one.

All together, the objective function becomes
$$
g(\theta; \theta_0, y, \{Y_i^*\}) = (\theta – \theta_0)^Ts(y) – \log\left(\frac 1m\sum_{i=1}^m \exp \left((\theta – \theta_0)^Ts(Y_i^*)\right)\right)
$$
so this gives me the following algorithm:

1. initialize $\theta^{(0)}$

2. for $t = 1, \dots, T$ or until convergence:

a. sample $Y_1^*,\dots,Y_m^* \stackrel{\text{iid}}\sim f(\cdot | \theta^{(t-1)})$

b. obtain $\theta^{(t)} = \text{argmax}_{\theta} \;g(\theta; \theta^{(t-1)}, y, \{Y^*_i\})$.

Geyer and Thompson (1992) discuss how the approximation deteriorates for a fixed $m$ as $\|\theta-\theta_0\|$ grows, so they recommend an iterative approach where small improvements are made to each successive $\theta^{(t)}$ rather than trying to actually find the argmaximum at each step. This is still justified because the likelihood is globally concave (this is a property of exponential families).

Convergence could be defined by checking if $\text{max}_{\theta} \,g(\theta; \theta^{(t-1)}, y, \{Y^*_i\}) < \varepsilon$ for some $\varepsilon > 0$. This works because at the maximum $g$ will be non-negative, but if it’s tiny then there can only have been a tiny change between $\theta^{(t)}$ and $\theta^{(t-1)}$. Nevertheless I’m still going to check for convergence the old-fashioned way via $\|\theta^{(t)} – \theta^{(t-1)}\| < \varepsilon$ just because I’m not actually finding the maximum at each step and I want this to be robust to any other departures that I consider.

Logistic regression

I’ll now try this out by fitting a logistic regression with it. I’ll have
$$
f(y|\theta) = \prod_{i=1}^n \theta_i^{y_i}(1-\theta_i)^{1-y_i} \\
= \exp\left(\sum_i y_i \text{ logit}(\theta_i) + \log(1-\theta_i)\right).
$$
Now assuming I have a full column rank $n\times p$ matrix of covariates $X$, I can set $\theta = \frac{1}{1+e^{-X\beta}}$ so now I’ll write this as conditioned on $\beta$. I’ll omit the conditioning on $X$ for simplicity. Note that
$$
\text{logit}\left(\frac{1}{1+e^{-z}}\right) = z
$$
(i.e. those are inverses of each other) so I have $\text{logit}(\theta_i) = x_i^T\beta$. Additionally,
$$
\begin{aligned}\log(1-\theta_i) &= \log\left(1 – \frac{1}{1+e^{-x_i^T\beta}}\right) \\&
= \log\left(\frac{e^{-x_i^T\beta}}{1+e^{-x_i^T\beta}}\right) \\&
= \log\left(\frac{1}{1+e^{x_i^T\beta}}\right) = – \log\left(1+e^{x_i^T\beta}\right)\end{aligned}
$$
which is the negative softplus function. Plugging this all in I get
$$
f(y|\beta) = \exp\left(\sum_i y_i x_i^T\beta – \log\left(1+e^{x_i^T\beta}\right) \right) \\
= \frac{\exp(\beta^TX^Ty)}{\prod_{i=1}^n (1 + e^{x_i^T\beta})}.
$$
This means $s(y) = X^Ty$ which makes sense as the parts of $y$ that matter for this are the inner products of $y$ with each predictor.

As written the denominator is not hard to evaluate but I can do my best to fix that.

Suppose I have
$$
\prod_{i=1}^n (1+e^{a_i}).
$$
If I expand this I’ll get
$$
1 + e^{a_1} + \dots + e^{a_n} + e^{a_1 + a_2} + e^{a_1+a_3} + \dots + \dots + e^{a_1 + \dots + a_n}.
$$
Note that the numerators correspond to every value of $z^Ta$ for $z \in \{0,1\}^n$ so I can write this as

$$
\sum_{z \in \{0,1\}^n} e^{z^Ta}.
$$
Plugging this in to $f$ I get
$$
f(y|\beta) = \frac{\exp\left(\beta^Ts(y)\right)}{\sum\limits_{z \in \Y} \exp\left(\beta^Ts(z)\right)}
$$
which is exactly the form that I was looking for. That is a scary looking sum as written so I will now pretend that I don’t know how to efficiently evaluate it and I’ll use MCMLE to fit this.

Here’s the generic code to run MCMLE:

require(statnet.common)  # for log_mean_exp

# simulates y via a provided `y_simulator` and returns a closure
# that gives B(theta). `...` is for extra arguments to
# `y_simulator` and `s` and will typically just be the data
# matrix `x`. It is assumed that `y_simulator` returns a list
# of the simulated Y_i^*.
make_Bfunc <- function(theta0, s, nsim, y_simulator, ...) {
  smtx <- vapply(y_simulator(theta0, nsim, ...), s, ..., numeric(length(theta0)))
  function(theta) {
    statnet.common::log_mean_exp(t(theta - theta0) %*% smtx)
  }
}

# I'll actually maximize this
objfunc <- function(theta, theta0, s, y, B, ...) {
  sum((theta - theta0) * s(y, ...)) - B(theta)
}

# `...` is extra arguments to `s` and `y_simulator`.
# `maxit` is for the inner optimizer. `do.trace` prints progress.
MCMLE <- function(y, theta_init, s, y_simulator, nsim, ..., nsteps,
                  maxit=25, max_theta=1e6, tol=1e-8, do.trace=FALSE) {
  theta_curr <- theta_init
  for(i in 1:nsteps) {
    B <- make_Bfunc(theta_curr, s, nsim, y_simulator, ...)
    opt <- optim(
      theta_curr, objfunc, theta0=theta_curr, s=s, y=y, B=B, ...,
      control=list(fnscale=-1, maxit=maxit)
    )
    theta_new <- opt$par if(do.trace > 0 && i %% do.trace == 0) {
      print(i)
      print(opt)
    }
    
    if(any(abs(theta_new) > max_theta)) {
      stop("`theta` is escaping to infinity. Try decreasing `maxit`.")
    }
    if(sum((theta_new - theta_curr)^2) < tol) {
      cat("Converged in", i, "iterations.\n")
      break
    }
    theta_curr <- theta_new
  }
  opt$par
}

And now I’ll apply this to a logistic regression. I’m going to try this on a simulated data set.

s_logiregr <- function(y, x) {
  t(x) %*% y
}

ysim_logiregr <- function(beta0, nsim, x) {
  replicate(nsim, rbinom(nrow(x), 1, plogis(x %*% beta0)), simplify = FALSE)
}

set.seed(1)
n <- 100
p <- 5
x <- matrix(rnorm(n * p), n, p)
betas <- runif(p, -2, 2)
y <- rbinom(n, 1, plogis(x %*% betas))


beta_hat_mcmle <- MCMLE(
  y=y, theta_init=sample(c(-.1, .1), ncol(x), TRUE),
  s=s_logiregr, y_simulator=ysim_logiregr,
  nsim = 5000, x=x, nsteps=100, maxit=30
)

beta_hat_glm <- glm(y~x-1, family=binomial())$coef

round(rbind(beta_hat_mcmle, beta_hat_glm), 3)

This converged in 9 iterations and took just a second or two on my laptop. The results are below.

$$
\begin{array}{c|ccccc}
& 1 & 2 & 3 & 4 & 5 \\ \hline
\hat\beta_{\text{MCMLE}} & 0.163 & 0.984 & -0.491 & 1.845 & -2.126 \\
\hat\beta_{\text{GLM}} & 0.167 & 0.979 & -0.483 & 1.837 & -2.140
\end{array}
$$

So there it is. That’s just a super simple example but it’s neat to be able to replicate a trusted result like that from glm with this new procedure.

In my next post on the subject, I’ll generalize a lot of this because it turns out that this can all be carried out in a general exponential family and the discreteness of $y$ doesn’t actually matter. So I can actually use MCMLE to fit any GLM, not just a logistic regression, and I can do more complicated things like account for dependence between observations.

Leave a Reply

Your email address will not be published. Required fields are marked *