On the Expressivity Role of LayerNorm in Transformers' Attention

TL;DR — LayerNorm in transformers isn't just for training stability — it projects keys onto a hyperplane and rescales them, fundamentally changing which keys attention can select. Without LayerNorm, certain keys are "un-selectable" regardless of the query.

The Problem

Standard dot-product attention computes scores as q · k for each key, then applies softmax to produce attention weights. This seems flexible enough to attend to any key — but it has a blind spot.

Consider what happens when one key vector has a much larger magnitude than the others. When the query is pointed toward a nearby smaller key, the large key's dot product can still be large (since it points in a similar direction), overwhelming the smaller key's score. The neighboring keys can never receive high attention weight, no matter what query is used. This limits the expressivity of the attention mechanism.

Key insight: In standard attention without LayerNorm, when one key has a sufficiently large norm, its neighboring keys become effectively "un-selectable" — no query can give them the dominant attention weight, because the large key's dot product overwhelms theirs whenever the query points in their direction.

The Key Idea

LayerNorm, typically viewed as a training stabilizer, actually plays a fundamental role in the expressivity of attention. We show that LayerNorm decomposes into two distinct geometric operations on the key (and query) vectors:

1. Mean-Centering (Projection)

Projection

Subtracting the mean of each vector's components projects it onto a hyperplane orthogonal to the all-ones vector 1. This removes one dimension of information — the "uniform magnitude" component — and ensures that vectors are compared by their direction relative to the hyperplane, not by their absolute position.

2. Variance Normalization (Scaling)

Scaling

Dividing by the standard deviation rescales all vectors to have similar norms. This neutralizes the magnitude advantage that made certain keys un-selectable. After scaling, attention weights depend primarily on the angle between query and key vectors.

LayerNorm(x) = (x - mean(x)) / std(x) = Projection onto hyperplane + Rescaling to unit variance

Interactive Demo

Explore how LayerNorm's two effects change which keys receive attention. Increase Key 3's magnitude to see how it makes its neighbor (Key 2) un-selectable, then toggle projection and scaling to restore selectability.

Attention with and without LayerNorm

Query angle: 160°

Key 3 magnitude: 1.0x

Query

Key 1

Key 2 (neighbor of Key 3)

Key 3 (adjustable magnitude)

Key 4

Attention Weights (softmax of q · k)

Try increasing "Key 3 magnitude" — Key 2 (its close neighbor) becomes un-selectable because Key 3's large dot product overwhelms it. Then toggle Projection and Scaling to restore selectability.

How It Works

LayerNorm is typically written as a single operation, but it naturally decomposes into two geometric steps. Given an input vector x in dimension d:

Step 1: Projection

Subtracting the mean is equivalent to projecting onto the hyperplane {v : 1^T v = 0}. Formally, x - mean(x) = (I - 11^T/d) x, where I - 11^T/d is the projection matrix onto the orthogonal complement of the all-ones vector. This removes the component of x along the uniform direction, reducing the effective dimensionality by one.

Step 2: Scaling

After projection, dividing by the standard deviation normalizes the projected vector to have unit variance. This ensures all keys live on (approximately) the same sphere, so dot products depend only on angles, not on magnitudes.

This decomposition reveals that LayerNorm is not merely a numerical trick — it is a geometric operation that reshapes the space of possible attention patterns. The projection removes a degree of freedom, and the scaling equalizes vector norms, together enabling attention patterns that are provably impossible without LayerNorm.

Results

We prove several theoretical results about the interaction between LayerNorm and attention:

Without LayerNorm, there exist sets of key vectors where no query can produce a "sharp" attention pattern (i.e., assign most weight to a single key). This happens whenever one key has disproportionately large norm.
With the projection component alone, mean-centering can change the relative geometry of keys, but does not fully solve the selectability problem since magnitude disparities can remain after projection.
With both projection and scaling, all keys become selectable. For any key in the set, there exists a query that assigns it the highest attention weight. The variance normalization ensures that after centering, all keys have comparable norms.
Empirically, removing LayerNorm from pre-trained transformers degrades performance on tasks requiring sharp, selective attention patterns — confirming the theoretical predictions.

LayerNorm enables attention patterns that are provably impossible without it. It should be understood not just as a training aid, but as a fundamental component of the transformer's expressive power.

@inproceedings{brody2023expressivity, title = {On the Expressivity Role of LayerNorm in Transformers' Attention}, author = {Brody, Shaked and Alon, Uri and Yahav, Eran}, booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2023}, url = {https://arxiv.org/abs/2305.02582} }