From the EM Algorithm to Predictive Coding

Mufeng Tang

February 27, 2022

I spent the first few weeks of my PhD understanding predictive coding (PC) as a computational model for perception and learning in the brain, whose alternating nature is reminiscent of the Expectation-Maximization (EM) algorithm at first sight. Realizing that it is in fact a special case of the EM algorithm, I decided to write it down in this (and also my first!) scientific blog.

The content of this blog has been extensively discussed in Karl Friston’s papers, for example this one. However, since I come from a statistical background to the field of theoretical neuroscience, this blog may be particularly helpful for statisticians/mathematicians/computer scientists to understand predictive coding, an interesting theory of how our brain works.

Models of the observed world

The motivation for the EM algorithm is to build up a model \(\theta\) of the observed world \(x\). The goal is to maximize the log likelihood with respect to \(\theta\):

\[\log{p_{\theta}(x)}\]

In particular, the EM algorithm assumes that there is a hidden, or 'latent' cause \(z\) of the observations \(x\), and therefore the log likelihood can be written as:

\[\log\int_z p_{\theta}(z,x) dz\]

and we want to maximize this value over possible values of \(z\) and \(\theta\). The direct optimization of this integral is, as it appears, impossible, so the following mathematical manipulations are introduced to work it around.

Suppose now we have an arbitrary distribution over \(z\): \(q(z)\). Jensen’s inequality gives that:

\[ \begin{aligned} \log\int_z p_{\theta}(z,x) dz &= \log\int q(z)\frac{p_{\theta}(z,x)}{q(z)} dz \\ &\geq \int q(z)\log \frac{p_{\theta}(z,x)}{q(z)}dz \end{aligned} \]

We define \(F:=\int q(z)\log \frac{p_{\theta}(z,x)}{q(z)}dz\) and thus:

\[ \begin{aligned} F &= \int q(z) \log p_{\theta} (z,x)dz - \int q(z)\log q(z) \\ &= \mathbb{E}_q\left[\log p_{\theta} (z,x) - \log q(z) \right] \\ &= \mathbb{E}_q\left[\log p_{\theta} (z,x)\right] + H[q] \end{aligned} \]

where \(H[q]\) is the entropy of \(q(z)\). \(F\) is usually called the ‘free energy’, as it is defined in physics as the sum of a negative energy term and an entropy term. Notice that \(F\) provides a lower bound on the log likelihood \(\log\int_z p_{\theta}(z,x) dz\). Therefore, in order to maximize the log likelihood (the model of the world), we can maximize the lower bound \(F\).

The natural question that arises at this stage is how we can perform maximization on \(F\). However, before we proceed to that, it is helpful to understand the model from another perspective. This perspective views the key to the modelling of the world as an inverse problem: knowing that there is a cause \(z\) of the observations \(x\), can we model the inverse of the generation process, or the posterior distribution \(p_{\theta}(z|x)\)? We may resort to Baye’s rule i.e. \(p_{\theta}(z|x) = \frac{p_{\theta}(x|z)p_{\theta}(z)}{p_{\theta}(x)}\) but again the denominator is intractable. To solve this problem, we introduce an approximate distribution \(q(z)\) of \(p_{\theta}(z|x)\) we try to minimize the KL divergence between \(q(z)\) and \(p_{\theta}(z|x)\):

\[ \begin{aligned} KL[q(z) \Vert p_{\theta}(z|x)] &= \mathbb{E}_q\left[\log q(z) - \log p_{\theta}(z|x)\right] \\ &= \mathbb{E}_q\left[\log q(z) - \log \frac{p_{\theta}(z,x)}{p_{\theta}(x)}\right] \\ &= \mathbb{E}_q\left[\log q(z) - \log p_{\theta}(z,x)\right] + \log p_{\theta}(x) \\ &= -F + \log p_{\theta}(x) \\ \end{aligned} \]

Notice that since the \(KL\) divergence is always greater than or equal to 0, we reached the conclusion that \(F\) is a lower bound for \(\log p_{\theta}(x)\). Therefore (again) to maximize the log likelihood of the model, we can maximize \(F\).

Maximizing \(F\): the EM algorithm

Given the objective of maximizing the lower bound \(F\), the Expectation-Maximization (EM) algorithm alternates between the E-step and the M-step:

E(xpectation) step: optimize \(F(q(z),\theta)\) with respect to \(q(z)\) over hidden variables, holding the parameters \(\theta\) fixed:

\[q^{(k)}(z) := arg\max_{q(z)} F(q(z),\theta^{(k-1)})\]

M(aximization) step: optimize \(F(q(z),\theta)\) with respect to \(\theta\) over hidden variables, holding the hidden distribution fixed:

\[\theta^{(k)} := arg\max_{\theta} F(q^{(k)}(z),\theta) = arg\max_{\theta}\mathbb{E}_{q^{(k)}(z)}\left[\log p_{\theta} (z,x)\right]\]

The second equation in the M-step comes from the fact that the entropy term \(H[q]\) is constant with respect to \(\theta\). Also, the name of the first step may be a bit confusing as there is no “expectation” computed there. However, if we consider the fact that \(F=-KL[q(z) \Vert p_{\theta}(z|x)]+\log p_{\theta}(x)\), we could see that the maximum of \(F\) is reached when the \(KL\) term goes to 0 i.e. \(q(z) = p_{\theta}(z|x)\). Therefore theoretically, we can consider the E-step as “setting \(q(z) = p_{\theta}(z|x)\) and compute the expectation \(\mathbb{E}_{ p_{\theta}(z|x)}\left[\log p_{\theta} (z,x)\right]\)” and the M-step as “maximizing the computed expectation”.

An example of the EM algorithm. TODO

Predictive Coding as a special case of the EM algorithm

So how is predictive coding (PC) related to the EM algorithm? Foundamentally, the idea comes from Rajesh P.N. Rao in his 1999 paper An optimal estimation approach to visual perception and learning, where he described the purpose of the neural activities as to build up an internal model of the world i.e. to solve the inverse problem of finding the hidden causes of the brain’s perceptions. This is exactly the latent variable model that the EM algorithm tries to optimize. Specifically, instead of searching over the distributional space to find the optimal \(q(z)\), we now try to find the most likely value of \(z\) denoted as \(\phi\) (although this is not a fully accurate analogy, but think of \(z\) as the neural activities!). In other words, we assume \(q(z)\) is a delta function with its mass concentrated at \(\phi\) and perform the maximum a posteriori (MAP) estimation. Notice that in this case, we may no longer be able to decrease the \(KL\) term to 0, as \(p_{\theta}(z|x)\) may not necessarily be a delta function.

Now, recall that our definition of the lower bound \(F\) is \(\int q(z)\log \frac{p_{\theta}(z,x)}{q(z)}dz\). With the delta-function assumption on \(q(z)\) we can re-write it as:

\[ \begin{aligned} F &= \int q(z)\log \frac{p_{\theta}(z,x)}{q(z)}dz \\ &= \int q(z)\log p_{\theta}(z,x) dz - \int q(z)\log q(z) dz \\ &= \log p_\theta(x, \phi) + const. \end{aligned} \]

The third equality comes from the fact that for a delta function \(\delta(\cdot)\) concentrated at \(\phi\), the term \(\int \delta(z)h(z)dz\) for any function \(h(\cdot)\) equals \(h(\phi)\), and the fact that the second term \(\int q(z)\log q(z) dz\) is constant with respect to \(\phi\). Considering that \(\log p_\theta(x, \phi) = \log p_{\theta}(x|\phi)+\log p_{\theta}(\phi)\), where \(p_{\theta}(\phi)\) is the prior of \(\phi\), the EM algorithm becomes:

E(xpectation) step: with \(\theta\) fixed \[\phi^{(k)} = arg\max_{\phi} \log p_{\theta^{(k-1)}}(x|\phi)+\log p_{\theta^{(k-1)}}(\phi)\]

M(aximization) step: with \(\phi\) fixed \[\theta^{(k)} = arg\max_{\theta} \log p_{\theta}(x|\phi^{(k-1)})+\log p_{\theta}(\phi^{(k-1)})\]

We now assume that \(x|\phi \sim \mathcal{N}(f_w(\phi), \Sigma_x)\), where \(\mathcal{N}\) denotes the Gassian density function, \(f_w(\cdot)\) is a function parameterized by \(w\) and \(\Sigma_x\) is the covariance matrix of this Gaussian distribution, and that the prior of \(\phi\) is \(\mathcal{N}(\mu, \Sigma_{\phi})\). Therefore the whole model is parameterized by \(\theta = \{w,\Sigma_x,\mu,\Sigma_{\phi} \}\), and the objective functions becomes:

\[ \begin{aligned} L &= \log p_\theta(x, \phi) \\ &= \log p_{\theta}(x|\phi)+\log p_{\theta}(\phi) \\ &= -\frac{1}{2}(x-f_w(\phi))^T \Sigma_x^{-1}(x-f_w(\phi)) - \frac{1}{2}\log|\Sigma_x| \\ &- \frac{1}{2}(\phi-\mu)^T \Sigma_\phi^{-1}(\phi-\mu) - \frac{1}{2}\log|\Sigma_\phi| \end{aligned} \]

To maximize \(L\) over \(\phi\) and \(\theta\), the gradients \(\frac{\partial L}{\partial \phi}\) and \(\frac{\partial L}{\partial \theta}\) can be computed, which will result in a neural network implementation that is also biologically plausible. An in-depth discussion of this implementation is beyond the scope of this blog and can be found in Prof. Rafal Bogacz’s tutorial.

Further thoughts

This blog is written to summarize my perspective of the predictive coding algorithm as a special case of the EM algorithm, formulated during the first few months of my PhD research. There are two questions that I have been thinking about. First, although Prof. Rafal Bogacz’s tutorial has laid out ways of representing the parameters \(\theta\) in a network implementation, the covariance matrices \(\Sigma\) seem less likely to be computed in the brain. So how are they represented biologically? The answer is that they can be encoded into recurrent connections that are ubiquitous in the cortex, and we can mathematically derive the equivalence between the covariance matrix and recurrent connections! I’m current working on a paper with this theoretical result and a bunch of simulations related to it.

Second, PC is a simplification of the EM algorithm, where we assume that \(q(z)\) is a delta function. This assumption seems a bit too strict both mathematically and biologically: it will make the E-step of the EM algorithm unable to reach its theoretical optimum, and it does not account for the stochasticity of neural activities. Can we use a less strict assumption on the distribution \(q(z)\)? This remains an interesting direction to pursue in the future.

Back to Homepage