Products of exponential family densities

$\newcommand{\X}{\mathcal X}\newcommand{\one}{\mathbf 1}$In this post I’m going to look at the following: suppose I have a random variable $X$ supported on $\X\subseteq \mathbb R$ with a density $f$ w.r.t. some dominating measure $\nu$ (probably the Lebesgue or counting measure), and
$$
f \propto \prod_{i=1}^K f_i
$$
where the $f_i$ are other densities w.r.t. the same dominating measure. I’ll be assuming that the $f_i$ are all exponential family members.

If I had independent RVs $X_1,\dots,X_K$ then the joint distribution would be $f(x) = \prod_i f_i(x_i)$ but here I’m considering a product like this but all for just one variable. One consequence is that this product is not guaranteed to be normalized, or even integrable, so care will be needed. I’ll use $f_u = \prod_i f_i$ for the unnormalized density and $f$ will be the normalized one, so
$$
f = \frac{f_u}{\int_\X f_u\,\text d\nu}
$$
when the denominator is finite.

Motivating examples

For the examples that follow, I won’t always explicitly note the support but it does matter.

Example 1: I’ll begin with
$$
\begin{aligned}f_u(x;\lambda, p) &= \text{Pois}(x;\lambda) \cdot \text{Geo}(x; p) \\&
= \frac{\lambda^xe^{-\lambda}}{x!}\cdot (1-p)p^x \\&
= \frac{(p\lambda)^x}{x!} e^{-\lambda}(1-p).\end{aligned}
$$
Normalizing, I’ll sum over the intersection of the supports which is all of $\mathbb N$ so
$$
\sum_{x=0}^\infty\frac{(p\lambda)^x}{x!} e^{-\lambda}(1-p) = e^{-\lambda}(1-p)e^{p\lambda}.
$$
This means
$$
f(x; \lambda, p) = \frac{(p\lambda)^x}{x!}e^{-p\lambda} = \text{Pois}(x; p\lambda).
$$
This is very interesting: I ended up with a distribution that’s of the form of one of the ones I started with, and all dependence on $p$ and $\lambda$ separately is lost. I only need the new parameter $p\lambda$.

Example 2: Next I’ll try
$$
\begin{aligned}f_u(x;p,q) &= \text{Bin}(x; n, p) \cdot \text{Geo}(x; q) \\&
= {n\choose x} \one_{x\in\{0,\dots,n\}}p^x(1-p)^{n-x} \cdot (1-q)q^x.\end{aligned}
$$
I’ll put this in exponential family form so
$$
f_u(x; p,q)= {n\choose x}\one_{x\in\{0,\dots,n\}}\exp\left[x \log\left(\frac{pq}{1-p}\right) + n\log(1-p) + \log(1-q)\right].
$$
I’ll let $\psi = \log\left(\frac{pq}{1-p}\right)$. To normalize $f_u$ I need to sum over $\{0,\dots,n\}$ so
$$
\sum_{x=0}^nf_u(x;p,q) = \exp\left(n\log(1-p) + \log(1-q)\right)\sum_x{n\choose x} e^{x\psi}.
$$
The sum here is exactly the result of the binomial theorem applied to $(e^\psi + 1)^n$ so all together I have
$$
f(x;p,q) = {n\choose x}\one_{x\in\{0,\dots,n\}}\exp\left(x\psi – n \log (e^\psi + 1)\right)
$$
which is exactly the exponential family form of a $\text{Bin}(n, \text{logit}^{-1}(\psi))$ density. Again, the normalizing constant removed all dependence on $p$ and $q$ that wasn’t expressible just via $\psi$ and I ended up with a distribution that’s of the form of one of the ones I started with.

Example 3: I’ll do one more discrete example:
$$
\begin{aligned}f_u(x; \lambda, p) &=\text{Pois}(x; \lambda)\cdot\text{Bin}(x; n,p) \\&
= \frac{\lambda^xe^{-\lambda}}{x!}\cdot{n\choose x}p^x(1-p)^{n-x}\one_{x\in\{0,\dots,n\}}.\end{aligned}
$$
Again putting this in exponential form,
$$
f_u(x; \lambda, p) = \frac 1{x!}{n\choose x}\one_{x\in\{0,\dots,n\}} \exp\left[x\log\left(\frac{p\lambda}{1-p}\right) – \lambda + n\log(1-p)\right].
$$
I can see that the normalizing constant, which is a finite sum so I don’t need to worry about convergence, will cancel out the $-\lambda + n\log(1-p)$ term in the exponential so I’ll omit that. Then I have
$$
A(\psi) := \sum_{x=0}^n \frac{1}{x!}{n\choose x}e^{x\psi}
$$
where now $\psi = \log\left(\frac{p\lambda}{1-p}\right)$. This means
$$
f(x; \lambda, p) = \frac{1}{x!}{n\choose x} \one_{x\in\{0,\dots,n\}}\exp\left[x\psi – \log A(\psi)\right]
$$
but unfortunately $A(\psi)$ does not have a simple recognizable form and so this distribution is neither Poisson nor Binomial. But it is an exponential family, so that this point it really seems like there’s some kind of closure property here with the exponential family.

Example 4: I’ll do one example with continuous densities before I give some more general results. Let
$$
\begin{aligned}f_u(x; \mu, \lambda) &= \mathcal N(x; \mu, 1)\cdot\text{Exp}(x;\lambda) \\&
= \frac{\lambda}{\sqrt{2\pi}}\exp\left(-\frac 12 \left[(x-\mu)^2 + 2\lambda x\right]\right)\mathbf 1_{x\geq 0}.\end{aligned}
$$
Expanding and completing the square inside the exponential,
$$
(x-\mu)^2 + 2\lambda x = x^2 – 2(\mu – \lambda)x + \mu^2 \\
= (x – \theta)^2 – \theta^2 + \mu^2
$$
where $\theta = \mu-\lambda$. Then
$$
f_u(x; \mu,\lambda) = \frac{\lambda}{\sqrt{2\pi}} e^{-\frac 12(\mu^2-\theta^2)}\exp\left(-\frac 12(x-\theta)^2\right)\mathbf 1_{x\geq 0}
$$
so my normalized density is
$$
f(x;\theta) = \frac{\exp\left(-\frac 12(x-\theta)^2\right)\mathbf 1_{x\geq 0}}{\int_0^\infty e^{-(t-\theta)^2/2}\,\text dt}
$$
and the denominator is basically the error function and that isn’t expressible in terms of elementary functions so this is the tidiest it can be. This isn’t Gaussian or Exponential but is a truncated Gaussian which I think is a bit surprising. The $e^{-\lambda x}$ part of the $\text{Exp}(\lambda)$ density merely shifted the Gaussian, but it’s the support that truncated it.

General Exponential Families

For $i=1,\dots,K$ I’ll consider each $f_i$ to be an exponential family member with natural parameter $\theta_i\in\Theta_i\subseteq\mathbb R^{d_i}$, sufficient statistic $s_i : \X\to\mathbb R^{d_i}$, scaling function $h_i : \X\to[0,\infty)$, and partition function $A_i : \Theta_i\to\mathbb R$. Thus
$$
f_i(x; \theta_i) = h_i(x)\exp(\langle s_i(x), \theta_i\rangle – A_i(\theta_i))
$$
so
$$
f_u(x; \theta) = \left(\prod_{i=1}^K h_i(x)\right)\exp\left[\sum_{i=1}^K \langle s_i(x), \theta_i\rangle -\sum_{i=1}^K A_i(\theta_i)\right].
$$
I’ll use $H = \prod_i h_i$ so the normalization constant is
$$
\int_{\X} \exp\left[\sum_{i=1}^K \langle s_i(x), \theta_i\rangle -\sum_{i=1}^K A_i(\theta_i)\right] H(x)\,\text d\nu(x).
$$
Assuming for now that this integral converges, I’ll have the terms with the $A_i(\theta_i)$ cancel out from the numerator when I divide by this (which reflects what I was seeing in my examples before) so
$$
f(x;\theta) = H(x)\exp\left(\sum_{i=1}^K \langle s_i(x), \theta_i\rangle – \zeta(\theta_1,\dots,\theta_K)\right)
$$
with
$$
\zeta(\theta_1,\dots,\theta_K) = \log \int_{\X}\exp\left[\sum_{i=1}^K \langle s_i(x), \theta_i\rangle \right] H(x)\,\text d\nu(x)
$$
giving my new partition function. Since
$$
\sum_{i=1}^K \langle s_i(x), \theta_i\rangle = \sum_{ij} s_{ij}(x)\theta_{ij}
$$
this is still an inner product between a sufficient statistic
$$
\begin{aligned}&S : \X \to \mathbb R^D \\&
S(x) = (s_1(x)\mid \dots \mid s_K(x))\end{aligned}
$$
with $D = \sum_i d_i$ and an enlarged parameter $(\theta_1\mid \dots\mid\theta_K)$. That means $f$ is still an exponential family member, although it may not be full rank (i.e. the new enlarged parameter space may not contain an open set, e.g. if some of the elements of a $\theta_i$ are shared between different $f_j$).

To show that I’m not unnecessarily concerned with the integral being finite, suppose I have
$$
f_1 = f_2 = \dots = f_K = \text{Beta}\left(\frac 12, 1\right)
$$
(and these are exponential family members) so
$$
f_u(x) = \frac{x^{-K/2}}{\text B(1/2, 1)^K} = 2^{-K} x^{-K/2}.
$$
Then
$$
\int_0^1 f_u(x)\,\text dx = 2^{-K}\int_0^1 x^{-K/2}\,\text dx.
$$
This integral diverges for $K\geq 2$ so $f_u$ cannot be normalized to a proper density here.

Result 1: if the $f_i$ are all bounded almost everywhere then $\int f_u\,\text d\nu < \infty$, i.e. $f_u$ can be normalized to a valid density (the integral of a non-negative measurable function is defined as the $\sup$ of the integral of non-negative simple functions less than it; by the completeness of the real numbers this $\sup$ always exists [this also implies the IVT as I show here] so I only have to worry about the integral of $f_u$ being finite as opposed to not existing).

Pf: Let $B_i = \{x\in\X : f_i(x) > 1\}$ and take $B = \cup_i B_i$ (the $f_i$ being Borel means these sets are all measurable).

I’ll use the fact that $\int_\X f_u\,\text d\nu = \int_{B}f_u\,\text d\nu + \int_{B^c}f_u\,\text d\nu$ so if I show each piece is finite then the integral as a whole is as well.

On $B^c$ all of the $f_i$ are at most $1$ so
$$
f_u(x) = \prod_{i=1}^Kf_i(x) \leq f_1(x)
$$
therefore
$$
\int_{B^c} f_u\,\text d\nu \leq \int_{B^c} f_1\,\text d\nu \leq 1
$$
since $f_1$ is a valid density w.r.t. $\nu$.

Now for $B$, I know the $f_i$ are almost everywhere bounded so for each $i$ there is some finite positive $M_i$ such that $M_i \geq f_i$ a.e.-$\nu$. Then
$$
f_u = \prod_i f_i \leq \prod_i M_i \;\;\text{a.e.-}\nu
$$
on $B$ therefore
$$
\int_B f_u\,\text d\nu \leq \nu(B)\prod_i M_i.
$$
I can use the union bound to get
$$
\nu(B) = \nu\left(\bigcup_i B_i\right) \leq \sum_i \nu(B_i).
$$
I have
$$
1 \geq \int_{B_i}f_i\,\text d\nu \geq \int_{B_i}1\,\text d\nu = \nu(B_i)
$$
so since there’s only $K < \infty$ terms in that sum, it must be finite. This shows that $\nu(B) \prod_i M_i < \infty$ and I’ve now bounded both pieces of $\int_\X f_u\,\text d\nu$ which means it is finite.

$\square$

One consequence of this is that if $\nu$ is the counting measure so all densities involved are pmfs, then they are always bounded so I’ll always be able to turn $f_u$ into a valid pmf (this could have been done directly just by using the fact that for pmfs I’ll always have $f_i \leq 1$ therefore $f_u \leq f_1$ everywhere [in the language of Result 1, $B = \emptyset$]). But this result also applies to other distributions and this means that many of the standard exponential family distributions will have normalizable products.

Sufficient conditions for closure

I’ll finish by looking at when $f$ is the same as one of the densities I started with. If the $f_i$ are exponential family members and $f_u$ can be normalized, then as before I have
$$
f(x; \theta) = H(x)\exp\left(\sum_{i=1}^K \langle s_i(x), \theta_i\rangle – \zeta(\theta_1,\dots,\theta_K)\right).
$$
I need $f$’s parameter to have the same dimension as one of the inputs so I’ll assume that all the parameter dimensions are the same, i.e. $d_1=\dots=d_K=d$. If the sufficient statistics are the same function $s$ for each of the $f_i$ I’ll have
$$
\sum_i \langle s_i(x), \theta_i\rangle = \left\langle s(x), \sum_i \theta_i\right\rangle
$$
by the bilinearity of $\langle \cdot, \cdot\rangle$. I’ll let $\psi = \sum_i \theta_i\in\mathbb R^d$. I also have
$$
\zeta(\theta_1,\dots,\theta_K) = \int_\X \exp\left(\langle s(x), \psi\rangle\right)H(x)\,\text d\nu(x)
$$
which actually only depends on $\psi$ now so I can write
$$
f(x; \psi) = H(x)\exp\left( \langle s(x), \psi\rangle – \zeta(\psi)\right).
$$
If additionally $H = h_i$ for some $i$ then $f$ will be of the same form as $f_i$ except with a new parameter.

Examples 1 and 2 showed this: in both cases the sufficient statistics for both densities involved are the identity function and both have dimension $1$. Furthermore, one of the $h_i$ was the function $x\mapsto 1$ so $H$ was just the other $h_i$, and the product $f$ had the same form as density with the the non-constant $h_i$.

Example 3 showed that even though the sufficient statistics were the same and had the same dimension, $H$ was not the same as one of the $h_i$ and correspondingly I did not end up with the same distribution as one of the $f_i$.

Finally Example 4 had both different $h_i$ (since the $\text{Exp}(\lambda)$ distribution’s $h_i$ is just the indicator function for $[0,\infty)$ while the Gaussian one is $1/\sqrt{2\pi}$) and different sufficient statistics. Unsurprisingly, the end result was neither distribution.

Leave a Reply

Your email address will not be published. Required fields are marked *