A Formal Hierarchy of RNN Architectures

TL;DR — Different RNN architectures (LSTM, GRU, Elman, etc.) can recognize fundamentally different classes of formal languages. This paper establishes a formal hierarchy showing LSTMs are strictly more powerful than GRUs, which are strictly more powerful than Elman RNNs.

The Problem

We know from years of empirical experience that some RNN architectures work better than others. LSTMs routinely outperform vanilla RNNs on tasks that require long-range dependencies. GRUs sit somewhere in the middle. But why?

Is there something fundamentally different about what these architectures can compute? Are there tasks that one architecture can solve in principle but another cannot, no matter how you train it? Or is it just a matter of optimization and training dynamics?

This paper answers these questions by going back to the foundations: formal language theory. Instead of comparing architectures on benchmarks, it asks what classes of formal languages each architecture can recognize when given unbounded precision in its internal state.

The Key Idea

The paper analyzes RNN architectures in the saturated regime—where activation functions are pushed to their extremes (sigmoids become step functions, tanh becomes sign). This idealization reveals the fundamental computational structure of each architecture, stripping away the noise of finite precision.

Under this lens, the key differentiator between architectures is their counting mechanism:

Elman RNNs have bounded state: their hidden state is determined entirely by the saturated activations, giving them finite memory. They recognize exactly the regular languages.
GRUs can implement a limited form of counting through their update gate, but the gating structure prevents them from doing full integer counting. They can recognize some but not all counter languages.
LSTMs have a cell state that can be incremented and decremented by fixed amounts, providing a true unbounded counter. They can recognize languages like aⁿbⁿ and Dyck languages (balanced parentheses).

The result is a strict hierarchy: Elman ⊊ GRU ⊊ LSTM. Each level is strictly more expressive than the one below it.

Interactive Demo

Architecture Hierarchy Explorer

Select an architecture to see where it sits in the hierarchy and which formal languages it can recognize.

String Processing Simulator

Choose a language and see how each architecture processes an input string. The animation shows internal state evolving as each character is consumed.

Cell / Hidden State

Step

0 / 0

How It Works

Saturated Networks

The analysis relies on a key abstraction: saturated networks. In a saturated network, all sigmoid activations resolve to 0 or 1, and tanh activations resolve to -1 or +1. This transforms the continuous, differentiable network into a discrete computational device whose behavior can be analyzed with formal language theory.

Why is this reasonable? In practice, well-trained networks often push gate values close to 0 or 1. The saturated model captures the "intended" discrete computation that the network learns to approximate.

Counting Mechanisms

The critical distinction between architectures comes down to how they count:

Architecture	State mechanism	Counting ability	Language class
Elman RNN	h = tanh(Wx + Uh + b)	None—finite states only	Regular
GRU	Gated update with reset	Conditional update, no true increment	Between regular and counter
LSTM	Cell + input/forget/output gates	Cell state acts as unbounded counter	Real-time counter languages

In an LSTM, the cell state update has the form c_t = f * c_{t-1} + i * g. When the forget gate f = 1 and the input gate i = 1, the cell state is incremented by g. This gives the LSTM a true counter that can grow without bound.

The GRU lacks this additive structure. Its update h_t = (1-z) * h_{t-1} + z * h' is an interpolation—it can only move the state toward a target value, never past it. This makes it unable to implement unbounded counting.

Results

Main theorem: Under the saturated assumption, the language classes recognized by Elman RNNs, GRUs, and LSTMs form a strict hierarchy. There exist languages recognizable by LSTMs but not GRUs, and languages recognizable by GRUs but not Elman RNNs.

The separating languages serve as concrete witnesses:

Elman vs. GRU: The GRU can recognize certain languages beyond the regular class by using its update gate to conditionally preserve state, something the Elman RNN's fully-overwriting update cannot achieve.
GRU vs. LSTM: The language aⁿbⁿ (equal numbers of a's and b's) is recognizable by LSTMs but not by GRUs. The LSTM's additive cell state can count up and count down; the GRU's interpolating update cannot.

This provides a formal explanation for the empirical observation that LSTMs tend to outperform GRUs on tasks requiring precise counting or tracking nested structure, while GRUs and Elman RNNs perform similarly on tasks that only require finite-state pattern matching.

@inproceedings{merrill-etal-2020-formal, title = "A Formal Hierarchy of {RNN} Architectures", author = "Merrill, William and Weiss, Gail and Goldberg, Yoav and Schwartz, Roy and Smith, Noah A. and Yahav, Eran", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", year = "2020", url = "https://arxiv.org/abs/2004.08500" }