ICLR 2022

How Attentive are Graph Attention Networks?

Shaked Brody, Uri Alon, Eran Yahav

arXiv Code Video

TL;DR — The popular GAT (Graph Attention Network) computes a limited form of "static" attention that ranks neighbors the same way regardless of the query node. GATv2 fixes this by applying the nonlinearity after concatenation, enabling truly dynamic attention.

The Problem: Static Attention

Graph Attention Networks (GATs) were introduced as a powerful way to let each node attend to its neighbors, weighting their messages by learned attention coefficients. The idea is intuitive: not all neighbors are equally important, so let the network learn which ones matter most for each query node.

But there is a subtle problem. GAT computes attention as:

e(h i, h j) = LeakyReLU(a T [W h i || W h j])

Because the weight vector a is applied after the concatenation but the LeakyReLU is applied to the entire expression, the scoring function decomposes into separate terms for i and j. Concretely, the top half of a scores only Wh_i and the bottom half scores only Wh_j. The ranking of neighbors j does not depend on the identity of the query node i — it is static.

This means that for any two query nodes i and i', GAT will rank their shared neighbors in the exact same order. The attention is not truly "attending" to the interaction between query and key.

The Key Idea: GATv2

GATv2 fixes the problem with a simple reordering of operations. Instead of applying the nonlinearity to the full concatenated-then-projected expression, GATv2 first applies the linear transformation, then concatenates, then applies LeakyReLU, and finally projects with a:

GAT (static)

e(h_i, h_j) =
LeakyReLU(a^T [Wh_i || Wh_j])

GATv2 (dynamic)

e(h_i, h_j) =
a^T LeakyReLU(W [h_i || h_j])

In GATv2, the LeakyReLU is applied before the final projection by a. This means the nonlinearity mixes the features of both the query and the key, making the attention function capable of computing every possible ranking of neighbors — it is universally expressive. Neighbor rankings now genuinely change based on the query node.

Order of operations

GAT

Apply W to h_i and h_j separately
Concatenate [Wh_i || Wh_j]
Apply a^T (linear projection)
Apply LeakyReLU

GATv2

Concatenate [h_i || h_j]
Apply W (linear transformation)
Apply LeakyReLU
Apply a^T (linear projection)

Interactive Demo

GAT vs. GATv2: Static vs. Dynamic Attention

Click any node to select it as the query node. Compare how attention weights change (or don't) between GAT and GATv2.

GAT (Static Attention)

Click a node to see neighbor rankings

GATv2 (Dynamic Attention)

Click a node to see neighbor rankings

Edge thickness and opacity encode attention weight. Notice how GAT's neighbor rankings stay identical regardless of the query node, while GATv2's rankings change.

How It Works

The core issue comes down to the expressiveness of the attention function. An attention mechanism is a function e : R^d x R^d → R that scores how relevant key j is to query i.

Why GAT is limited

In GAT, the attention function can be decomposed:

a T [W h i || W h j] = a l T W h i + a r T W h j

where a_l and a_r are the top and bottom halves of a. The LeakyReLU, being a monotonic function applied to this sum, preserves the ranking induced by the additive terms. Since the term a_r^T Wh_j does not depend on i, the ranking of neighbors j is the same for every query i.

Why GATv2 is universal

GATv2 applies LeakyReLU before the final linear mapping. The LeakyReLU introduces a nonlinear interaction between the features of the query and the key inside the learned transformation W. This makes GATv2 a universal approximator over the attention function space — it can represent any continuous scoring function of query and key.

Results

The paper provides both theoretical analysis and extensive experiments showing that:

On a synthetic Dictionary Lookup task (designed to require dynamic attention), GAT completely fails while GATv2 achieves near-perfect accuracy.
On standard benchmarks (OGB, QM9, VarMisuse), GATv2 consistently matches or outperforms GAT, while never being significantly worse.
Empirically, GAT's learned attention is indeed static across different query nodes, confirming the theoretical analysis.

Key takeaway: GATv2 is a drop-in replacement for GAT. It requires no additional parameters or computational overhead, yet unlocks strictly more expressive attention. There is no reason to use the original GAT formulation.

@inproceedings{brody2022how, title={How Attentive are Graph Attention Networks?}, author={Shaked Brody and Uri Alon and Eran Yahav}, booktitle={International Conference on Learning Representations (ICLR)}, year={2022}, url={https://arxiv.org/abs/2105.14491} }