Back to Homepage
Ever since Hopfield
Network is All You Need, many papers in the past few years have
pointed to the connection between Associative Memory (AM) models and the
attention mechanism in Transformers. To name but a few: Universal Hopfield Network,
Kernel Memory Networks,
and the Tolman Eichenbaum
Machine. Although the title of ‘Linking Transformers to how the
brain performs computations’ sounds very cool, at the heart of these
connections is the outer product between two matrices , which
results in a matrix . In AM models like Hopfield
Networks, this happens when you try to compare your query with each of
the stored memories, while in Transformers, this is the ‘affinity’
matrix . These matrices are nice, because they
stores all the pair-wise relationships between what you have
() and what you are
presented/queried with ().
Since I’ve been working on AM for quite a while, I recently noticed
another fundamental model that might be (loosely) connected to
attention: linear regression. I find this connection
much more interesting than the connection between AM models and
Transformers, as AM models seek to memorize training patterns, while we
know that Transformers’ power goes far beyond memorizing training data:
they generalize. Meanwhile, as we learned in our first class of
statistics, regression models also generalize.
In the following part of this blog, I will write down the exact
mathematical expressions of this connection between linear regression
and self-attention and try to provide and interpretation and open
questions following this line. I do not seek to claim that ‘this is
going to be an interesting research topic’; rather, it’s just an
interesting finding that worth being transcribed from my scratch paper
to a markdown file. Whether it is worth being moved further to a formal
Latex file remains to be discovered - if you are reading this and come
up with a new research idea, let me know.
The Maths
Let’s start from the simplest linear regression that we learned in
STATS101 (although I remember having a whole course on this - what did I
learn?). Let’s say we have some training data: independent variables
and
dependent variables . Also let’s call the regression coefficients/parameters
.
The objective function of linear regression is: To get the optimal ,
we take the derivative of the squared Frobenius norm with respective to
and set it to 0. The optimal
can be written as (I’ll
skip the steps because you can literally find it everywhere): Then, according to this linear model, the fitted values should
be: Without loss of generality, if we assume the data points have zero mean, the middle
matrix is the inverse
of the covariance matrix of the dataset. Let’s call it , then the fitted values become:
(the colors will be useful later). That’s all about linear
regression. Now let’s look at the attention mechanism. For clarity and
simplicity, I’ll focus on the self-attention here. We again assume some
data ,
where now denotes the input
sequence length and the size of
each embedded token vector. We would then multiply with three trainable weight matrices
,
and
respectively to get the query , key
and value . The attention is then calculated: Now, let’s express the
and matrices as and their corresponding weight matrices
and some color codes: I hope at this point, the color coded equation has made it
quite clear that calculating the fitted values of linear regression and
self-attention are doing something similar: they both take the form of
. For linear regression
this is the inverse covariance
matrix of the data, and for self-attention this is the product of the learnable
parameters . In
addition, self-attention has a softmax function that performs some
nonlinear transformation of this matrix product.
Interpretations
Having set up the mathematical similarity between these two, we can
come up with some interpretations of both linear regression and
self-attention.

First, for linear regression: We can now interpret
linear regression as a type of Universal Hopfield Networks. The fitted
values are now a weighted
sum of the original dependent variables in the training set, where
the weights a some kind of ‘similarity measure’ between each pair of
training data and . When is identity, this similarity measure is
simply dot product, and I have hand-drawn a diagram of this process
above. Imagine if ’s are a bunch
of orthonormal vectors, then the dot product will be zero unless . In our example above this will give
us exactly the desired output .
However, almost always we don’t get orthonormal vectors and similarity
by dot product is almost never a good idea. The original Universal Hopfield Network
paper discussed this so I will not go into details about this.
When is not identity, the
product is in
fact a ‘whitened’ dot product that makes the similarity measure more
robust to variances and correlations between features, because we can
effective decompose into whitening
matrices. I had some experiments in my recent paper that shows the
benefits of this whitening step before dot product. Conceptually, the
division by in
self-attention seems to serve the same purpose as to handle variable
key/query pairs, although it doesn’t help handle the correlated features
like a whitening matrix does.
Another important point to make about this interpretation is that
although linear regression itself isn’t an AM model, in the case where
we query it with an original training data say (instead of a new data point), what it
does is essentially AM, by memorizing the pattern . When the we can actually recall perfectly (imagine fitting a line
with two dots on a 2d plane), so the capacity of a linear
regression AM model (on condition of perfect recall) should be
.
Second, for attention: This similarity brings us to
interpret the attention mechanism as some type of regression operation.
If we map the value to our
dependent variable , attention can
be considered as the fitted value by a regression model trained on and pairs, given a seen example (self-attention) or an unseen example
(cross-attention). So at
the end of the day, the attention layer is secretly learning a
linear regression between its key-value pairs.
Some open questions
worth looking into
Despite the similarity, one can argue that the connection between
attention and linear regression is quite loose. Indeed, their difference
is also quite obvious: attention uses a softmax function that make the
largest affinity between two vectors stand out whereas linear regression
doesn’t have this mechanism; the matrix in attention is trainable whereas the inverse
covariance matrix
is directly calculated from the embedded tokens. These differences opens
up a few questions that I wanted to look into (if time and energy allow
of course…):
- What does the matrix
look like?
After training with an autoregressive model, does it implicitly learn
the inverse covariance matrix of the embeddings to perform the
(beneficial) whitening operation when we compare query and keys?
- If it doesn’t learn the whitening operation, can we hand-craft some
whitening step into the architecture of Transformer and see if it helps
with the performance? This implementation should be quite
straightforward because we don’t actually need to invert the covariance
matrix of embeddings directly - there are loads of ways to transform it
into whitening matrices (for example eigendecomposition). Or maybe layer
normalization already achieved this?
- Instead of training two weights and separately, can we train a single
matrix , and, in cases where we
need to access the individual weights, we can decompose to get some pseudo- or pseudo- that are sufficient to get us some
representation of ?
As I’m not an expert in Transformers or LLMs (been in the niche of
NeuroAI for too long…), some of these questions have probably been asked
or addressed. If you are reading this blog and happen to know any works
related to these questions, please let me know. At the same time, if any
of these findings/questions sparks an idea, please also let me know and
I’m happy to discuss :)
Back to Homepage