ACL 2020

A Formal Hierarchy of RNN Architectures

William Merrill, Gail Weiss, Yoav Goldberg, Roy Schwartz, Noah A. Smith, Eran Yahav

TL;DR — Different RNN architectures (LSTM, GRU, Elman, etc.) can recognize fundamentally different classes of formal languages. This paper establishes a formal hierarchy showing LSTMs are strictly more powerful than GRUs, which are strictly more powerful than Elman RNNs.

The Problem

We know from years of empirical experience that some RNN architectures work better than others. LSTMs routinely outperform vanilla RNNs on tasks that require long-range dependencies. GRUs sit somewhere in the middle. But why?

Is there something fundamentally different about what these architectures can compute? Are there tasks that one architecture can solve in principle but another cannot, no matter how you train it? Or is it just a matter of optimization and training dynamics?

This paper answers these questions by going back to the foundations: formal language theory. Instead of comparing architectures on benchmarks, it asks what classes of formal languages each architecture can recognize when given unbounded precision in its internal state.

The Key Idea

The paper analyzes RNN architectures in the saturated regime—where activation functions are pushed to their extremes (sigmoids become step functions, tanh becomes sign). This idealization reveals the fundamental computational structure of each architecture, stripping away the noise of finite precision.

Under this lens, the key differentiator between architectures is their counting mechanism:

The result is a strict hierarchy: Elman ⊊ GRU ⊊ LSTM. Each level is strictly more expressive than the one below it.

Interactive Demo

Architecture Hierarchy Explorer

Select an architecture to see where it sits in the hierarchy and which formal languages it can recognize.

Elman RNN Regular languages GRU Limited counting LSTM Full counter languages anbn Dyck-1 w#wR (ab)* a*b* (ab)* a*b*

String Processing Simulator

Choose a language and see how each architecture processes an input string. The animation shows internal state evolving as each character is consumed.

Cell / Hidden State

0

Step

0 / 0

How It Works

Saturated Networks

The analysis relies on a key abstraction: saturated networks. In a saturated network, all sigmoid activations resolve to 0 or 1, and tanh activations resolve to -1 or +1. This transforms the continuous, differentiable network into a discrete computational device whose behavior can be analyzed with formal language theory.

Why is this reasonable? In practice, well-trained networks often push gate values close to 0 or 1. The saturated model captures the "intended" discrete computation that the network learns to approximate.

Counting Mechanisms

The critical distinction between architectures comes down to how they count:

Architecture State mechanism Counting ability Language class
Elman RNN h = tanh(Wx + Uh + b) None—finite states only Regular
GRU Gated update with reset Conditional update, no true increment Between regular and counter
LSTM Cell + input/forget/output gates Cell state acts as unbounded counter Real-time counter languages

In an LSTM, the cell state update has the form c_t = f * c_{t-1} + i * g. When the forget gate f = 1 and the input gate i = 1, the cell state is incremented by g. This gives the LSTM a true counter that can grow without bound.

The GRU lacks this additive structure. Its update h_t = (1-z) * h_{t-1} + z * h' is an interpolation—it can only move the state toward a target value, never past it. This makes it unable to implement unbounded counting.

Results

Main theorem: Under the saturated assumption, the language classes recognized by Elman RNNs, GRUs, and LSTMs form a strict hierarchy. There exist languages recognizable by LSTMs but not GRUs, and languages recognizable by GRUs but not Elman RNNs.

The separating languages serve as concrete witnesses:

This provides a formal explanation for the empirical observation that LSTMs tend to outperform GRUs on tasks requiring precise counting or tracking nested structure, while GRUs and Elman RNNs perform similarly on tasks that only require finite-state pattern matching.

@inproceedings{merrill-etal-2020-formal, title = "A Formal Hierarchy of {RNN} Architectures", author = "Merrill, William and Weiss, Gail and Goldberg, Yoav and Schwartz, Roy and Smith, Noah A. and Yahav, Eran", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", year = "2020", url = "https://arxiv.org/abs/2004.08500" }