TL;DR — Different RNN architectures (LSTM, GRU, Elman, etc.) can recognize fundamentally different classes of formal languages. This paper establishes a formal hierarchy showing LSTMs are strictly more powerful than GRUs, which are strictly more powerful than Elman RNNs.
The Problem
We know from years of empirical experience that some RNN architectures work better than others. LSTMs routinely outperform vanilla RNNs on tasks that require long-range dependencies. GRUs sit somewhere in the middle. But why?
Is there something fundamentally different about what these architectures can compute? Are there tasks that one architecture can solve in principle but another cannot, no matter how you train it? Or is it just a matter of optimization and training dynamics?
This paper answers these questions by going back to the foundations: formal language theory. Instead of comparing architectures on benchmarks, it asks what classes of formal languages each architecture can recognize when given unbounded precision in its internal state.
The Key Idea
The paper analyzes RNN architectures in the saturated regime—where activation functions are pushed to their extremes (sigmoids become step functions, tanh becomes sign). This idealization reveals the fundamental computational structure of each architecture, stripping away the noise of finite precision.
Under this lens, the key differentiator between architectures is their counting mechanism:
- Elman RNNs have bounded state: their hidden state is determined entirely by the saturated activations, giving them finite memory. They recognize exactly the regular languages.
- GRUs can implement a limited form of counting through their update gate, but the gating structure prevents them from doing full integer counting. They can recognize some but not all counter languages.
- LSTMs have a cell state that can be incremented and decremented by fixed amounts, providing a true unbounded counter. They can recognize languages like anbn and Dyck languages (balanced parentheses).
The result is a strict hierarchy: Elman ⊊ GRU ⊊ LSTM. Each level is strictly more expressive than the one below it.
Interactive Demo
Architecture Hierarchy Explorer
Select an architecture to see where it sits in the hierarchy and which formal languages it can recognize.
String Processing Simulator
Choose a language and see how each architecture processes an input string. The animation shows internal state evolving as each character is consumed.
Cell / Hidden State
Step
How It Works
Saturated Networks
The analysis relies on a key abstraction: saturated networks. In a saturated network, all sigmoid activations resolve to 0 or 1, and tanh activations resolve to -1 or +1. This transforms the continuous, differentiable network into a discrete computational device whose behavior can be analyzed with formal language theory.
Why is this reasonable? In practice, well-trained networks often push gate values close to 0 or 1. The saturated model captures the "intended" discrete computation that the network learns to approximate.
Counting Mechanisms
The critical distinction between architectures comes down to how they count:
| Architecture | State mechanism | Counting ability | Language class |
|---|---|---|---|
| Elman RNN | h = tanh(Wx + Uh + b) | None—finite states only | Regular |
| GRU | Gated update with reset | Conditional update, no true increment | Between regular and counter |
| LSTM | Cell + input/forget/output gates | Cell state acts as unbounded counter | Real-time counter languages |
In an LSTM, the cell state update has the form c_t = f * c_{t-1} + i * g. When the forget gate f = 1 and the input gate i = 1, the cell state is incremented by g. This gives the LSTM a true counter that can grow without bound.
The GRU lacks this additive structure. Its update h_t = (1-z) * h_{t-1} + z * h' is an interpolation—it can only move the state toward a target value, never past it. This makes it unable to implement unbounded counting.
Results
Main theorem: Under the saturated assumption, the language classes recognized by Elman RNNs, GRUs, and LSTMs form a strict hierarchy. There exist languages recognizable by LSTMs but not GRUs, and languages recognizable by GRUs but not Elman RNNs.
The separating languages serve as concrete witnesses:
- Elman vs. GRU: The GRU can recognize certain languages beyond the regular class by using its update gate to conditionally preserve state, something the Elman RNN's fully-overwriting update cannot achieve.
- GRU vs. LSTM: The language anbn (equal numbers of a's and b's) is recognizable by LSTMs but not by GRUs. The LSTM's additive cell state can count up and count down; the GRU's interpolating update cannot.
This provides a formal explanation for the empirical observation that LSTMs tend to outperform GRUs on tasks requiring precise counting or tracking nested structure, while GRUs and Elman RNNs perform similarly on tasks that only require finite-state pattern matching.