Linear Regression with Functions

$\newcommand{\X}{\mathcal X}$$\newcommand{\dl}{\,\text d\lambda}$$\newcommand{\dx}{\,\text dx}$$\newcommand{\one}{\mathbf 1}$I’m going to take a look at functional linear regression, where I’ll find the best linear (technically affine) approximation to a target function $f : \X \to\mathbb R$. I’ll use $\X = [0,1]^p$ for the domain throughout so I’m only considering functions with bounded support. The setting of this will be $L^2(\X, \mathcal B_{\X}, \lambda)$, i.e. I’m considering Borel functions from $[0,1]^p$ to $\mathbb R$ that are square integrable w.r.t. the Lebesgue measure.

“Best” will be in the sense of the norm on my $L^2$ space, so for an affine $h$ my (squared) loss is
$$
\|f – h\|^2 = \int_{\X}(f-h)^2\dl.
$$
I can expand this to get a helpful form:
$$
\int_{\X}(f-h)^2\dl = \int f^2\dl – 2 \int fh\dl + \int h^2\dl \\
= \|f\|^2 – 2\langle f,h\rangle + \|h\|^2.
$$
$f\in L^2$ by assumption and $h\in L^2$ for any affine $h$ (given that $\X$ is bounded) so all these integrals are well-defined. $\|f\|^2$ is constant so the only things that matter here are the agreement between $f$ and $h$, as given by $\langle f,h\rangle$, and the magnitude of $h$ as given by $\|h\|^2$.

$h$ is affine so I’ll write it as $h(x) = \beta_0 + \beta^Tx$.

Before anything else, I can work on $\|h\|^2$ since I always know how that will behave.

$$
\|h\|^2 = \int_\X (\beta_0 + \beta^Tx)^2\dl(x) \\
= \int_\X (\beta_0^2 + 2\beta_0\beta^Tx + \beta^Txx^T\beta) \dl(x).
$$
$\lambda(\X) = 1$ so, using the linearity of integration, I have
$$
\|h\|^2 = \beta_0^2 + 2 \beta_0 \int_\X \beta^Tx\dx + \int_\X \beta^Txx^T\beta\dx.
$$
I could write $\beta^Tx$ and $\beta^Txx^T\beta$ in terms of scalars and distribute the integral, but instead I’ll note that because $\lambda(\X) = 1$ I can view these as expectations involving a $X\sim \text{Unif}(\X)$ random variable. This means I have
$$
\|h\|^2 = \beta_0^2 + 2\beta_0 \text E(\beta^TX) + \text E(\beta^TXX^T\beta) \\
= \beta_0^2 + 2\beta_0 \beta^T\text E(X) + \beta^T\text E(XX^T)\beta.
$$
$\text E X = \frac 12 \one$ and
$$
\text E(XX^T)_{ij} = \begin{cases}\frac 14 & i\neq j \\ \frac 13 & i = j \end{cases}
$$
so all together
$$
\|h\|^2 = \beta_0^2 + \beta_0 \beta^T\one + \beta^TA\beta.
$$
where I’ve let $A = \frac 14 \one\one^T + \frac 1{12} I$. This is a strictly convex parabola in $\beta_0$, and $A$ is PD so it’s also strictly convex in $\beta$ which is good news for me.

Now for the other piece:
$$
\langle f, h\rangle = \int_\X f(x) (\beta_0 + \beta^Tx)\dl(x) \\
= \beta_0 \langle f, \one\rangle + \beta^T\text E[Xf(X)]
$$
where $Xf(X) \in \mathbb R^p$ has elements like $x_if(x)$.

The constant function is square-integrable over $\X$ so $\one \in L^2$ and therefore $\langle f, \one\rangle$ is well-defined. Furthermore, the functions $x\mapsto x_i$ are also in $L^2$ so the elements of $\text E(Xf(X))$ are also inner products of two elements of $L^2$ and so that’s all good too.

Putting this all together, I have my objective function as
$$
g(\beta, \beta_0) = \beta_0^2 + \beta_0 \beta^T\one + \beta^TA\beta – 2 \beta_0 \langle f, \one\rangle – 2 \beta^T\text E[Xf(X)]
$$
(excluding $\|f\|^2$ which doesn’t depend on my parameters).

Differentiating leads me to
$$
\frac{\partial g}{\partial \beta} = \beta_0\one – 2 \text E[Xf(X)] + 2 A\beta \\
\frac{\partial g}{\partial \beta_0} = 2\beta_0 + \beta^T\one – 2\langle f, \one\rangle
$$
and then for the Hessian I have
$$
\frac{\partial^2 g}{\partial \beta^2} = 2A \\
\frac{\partial^2 g}{\partial \beta\partial \beta_0} = \one \\
\frac{\partial^2 g}{\partial \beta_0^2} = 2
$$
so
$$
H = \left(\begin{array}{c|c}\frac 12 \one\one^T + \frac 1{6} I & \one \\ \hline \one^T & 2\end{array}\right)
$$

Result: $H$ is strictly PD.

Pf: Consider $\det(H-\lambda I)$. $H-\lambda I$ is a $2\times 2$ block matrix so I can use the result that, for arbitrary matrices $A$, $B$, $C$, and $D$,
$$
\det \left(\begin{array}{cc} A & B \\ C & D \end{array}\right) = \det D \cdot \det(A – BD^{-1}C)
$$
when $D$ is invertible to get
$$
\det (H-\lambda I) = (2-\lambda) \det\left(2A – \lambda I – (2-\lambda)^{-1}\one\one^T\right)
$$
under the assumption that $\lambda \neq 2$ ($A$ is now my specific matrix again).

Inside the determinant I have
$$
2A – \lambda I – (2-\lambda)^{-1}\one\one^T = \left(\frac 16 – \lambda \right) I + \left[\frac 12 – (2-\lambda )^{-1}\right]\one\one^T.
$$
Further assuming $\lambda\neq\frac 16$, this is a rank 1 update to an invertible matrix so by the matrix determinant lemma I have
$$
\begin{aligned}&\det \left(2A – \lambda I – (2-\lambda)^{-1}\one\one^T\right) \\&
= \left(\frac 16 – \lambda \right)^{p-1} \left[1 + \left(\frac 16 – \lambda \right)^{-1}(2^{-1} – (2-\lambda)^{-1})(p-1)\right] \\&
= -\frac 16 \left(\frac 16 – \lambda \right)^{p-2} \left[6\lambda -3p +2 + \frac{6(p-1)}{2-\lambda}\right]\\&
= -\frac 16 (2-\lambda)^{-1}\left(\frac 16 – \lambda \right)^{p-2} \left[(6\lambda -3p +2)(2-\lambda) + 6(p-1)\right].\end{aligned}
$$
The $(2-\lambda)^{-1}$ cancels so all together I have
$$
\det(H-\lambda I) \propto (6\lambda – 1)^{p-2}\left[(6\lambda -3p +2)(2-\lambda) + 6(p-1)\right].
$$
This gives me $p-2$ eigenvalues of $\frac 16$ which are positive, so the only worry is the quadratic in brackets. Expanding and combining terms, I have
$$
(6\lambda -3p +2)(2-\lambda) + 6p – 6 = -6\lambda^2 + (10 + 3p)\lambda – 2
$$
which equals zero when
$$
\lambda = \frac{10 + 3p \pm\sqrt{(10+3p)^2-48}}{12}.
$$
$p>0$ and
$$
(10+3p)^2 > (10+3p)^2 – 48 \\
\implies 10+3p > \sqrt{(10+3p)^2-48}
$$
(and the square root is always defined because even with $p=1$ I have $13^2>48$). Thus every eigenvalue of $H$ is positive so $H$ is PD, and the problem is convex in $(\beta, \beta_0)$.

$\square$

Returning to my derivatives, I need to solve
$$
\frac{\partial g}{\partial \beta} = \beta_0\one – 2 \text E[Xf(X)] + 2 A\beta \\
\frac{\partial g}{\partial \beta_0} = 2\beta_0 + \beta^T\one – 2\langle f, \one\rangle
$$
for zero. Solving $\frac{\partial g}{\partial \beta_0} = 0$ for $\hat\beta_0$ gives me
$$
\hat\beta_0 = \langle f,\one\rangle – \frac 12 \hat\beta^T\one
$$
and plugging this in to $\frac{\partial g}{\partial \beta}$ gives me
$$
\begin{aligned}&\langle f,\one\rangle\one – \frac 12 \one\one^T\hat\beta – 2 \text E[Xf(X)] + 2 A\hat\beta = 0\\&
\implies \left(2A – \frac 12 \one\one^T\right)\hat\beta = 2 \text E[Xf(X)] – \langle f,\one\rangle\one .\end{aligned}
$$
$$
2A – \frac 12 \one\one^T = \frac 16 I
$$
so all together this means
$$
\hat\beta = 12 \text { E}[Xf(X)] – 6\langle f,\one\rangle\one
$$
and
$$
\hat\beta_0 = (1+3p)\langle f, \one\rangle – 6 \cdot\one^T\text {E}[Xf(X)] .
$$

Examples

One quick example is that if $f \stackrel{\text{ae}}= 0$ then $\langle f, h\rangle = 0$ so $\hat h$ will be the affine function that minimizes $\|h\|^2$, i.e. $\hat h = \mathbf 0$.

More interestingly, let $f(x) = x^Tx$. Then
$$
\langle f, \one\rangle = \int_\X x^Tx\dx = \text E(X^TX) = p \text E(X_1^2) = \frac p3.
$$
Furthermore,
$$
\begin{aligned}\text E(XX^TX)_i &= \int x_i x^Tx\dx = \int \sum_{j=1}^px_ix_j^2\dx \\&
= \int x_i^3\dx + \sum_{j\neq i} \int x_j^2x_i\dx \\&
= \frac 14 + \frac{p-1}{6}\end{aligned}
$$
and this does not depend on $i$. Overall this means
$$
\hat\beta = 12\left(\frac 14 + \frac{p-1}{6}\right) \one – 2p\one \\
= \one
$$
and
$$
\hat\beta_0 = (1+3p)\frac p3 – \cdot\one^T\left( p+\frac 12\right)\one \\
= \frac p3 + p^2 – p^2 – \frac p2 \\
= -\frac p6
$$
so all together the best affine approximation to $f(x) = x^Tx$ is
$$
\hat h(x) = \one^Tx – \frac p6.
$$

As a final example, suppose $f(x) = \one^Tx$ so $f$ is itself affine. I should get $f = h$.

I end up with $\langle f, \one\rangle = \frac p2$ and
$$
\text E(Xf(X)) = \text E(XX^T)\one = A\one = \left(\frac p4 + \frac 1{12}\right)\one
$$
so all together
$$
\hat\beta_0 = (1+3p)\frac p2 – 6 \cdot\one^T\left(\frac p4 + \frac 1{12}\right)\one \\
= \frac p2 + \frac 32p^2- \frac 32 p^2 – \frac 12 p = 0
$$
and
$$
\hat\beta = \left(3p + 1\right)\one – 3p\one = \one
$$
so I do indeed have $\hat h(x) = x^T\one$.

Recovering linear regression

Here I’ll note how I can recover the usual linear regression by changing $\X$ and my measure $\lambda$. Supposing that I have observed $f$ at $n$ points, $x_1,\dots, x_n$, and let $y_i = f(x_i)$. I’ll now take $\X = \{x_1,\dots,x_n\}$ so I’m only interested in the loss at these $n$ points, and I’ll replace $\lambda$ with the counting measure $c$. Then my loss is
$$
\int_{\X}(f – h)^2\,\text dc = \sum_{i=1}^n \int_{\{x_i\}}(f(x_i) – h(x_i))^2\,\text dc \\
= \sum_{i=1}^n (y_i – \beta^Tx_i – \beta_0)^2
$$
which is exactly the usual squared loss in linear regression (when the intercept is not subsumed into $\beta$).

Leave a Reply

Your email address will not be published. Required fields are marked *