Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples

TL;DR — RNNs are black boxes, but you can extract interpretable DFAs from them using Angluin's L* algorithm. The RNN serves as the "oracle" — answering membership and equivalence queries — while L* constructs a minimal DFA that approximates the RNN's behavior.

The Problem

Recurrent neural networks can learn to recognize formal languages from examples, but their internal representations — continuous hidden states evolving through nonlinear transformations — are opaque. A trained RNN may achieve perfect accuracy on a test set, yet we have no way to state what language it has actually learned. Is it the target language? A superset? Something subtly different that happens to agree on the training distribution?

We need a way to extract a concise, human-readable description of the language recognized by an RNN. Ideally, this description would be a deterministic finite automaton (DFA) — the simplest model of computation that can describe regular languages, and one that is easy to inspect, compare, and formally verify.

The Key Idea

The paper's central insight is to treat the trained RNN as a black-box oracle for Angluin's L* algorithm — a classic algorithm from automata learning theory that constructs a minimal DFA by asking two types of questions:

Membership queries: feed a string to the RNN, threshold the output to get accept/reject. "Does the string 0110 belong to the language?"
Equivalence queries: sample random strings to find disagreements between the current hypothesis DFA and the RNN. "Is my current DFA correct, or is there a counterexample?"

L* iteratively builds an observation table that records query results, and from it constructs a sequence of hypothesis DFAs that converge to the RNN's actual behavior. Each counterexample refines the hypothesis, adding states until the DFA matches the RNN on all tested inputs.

Interactive Demo: RNN to DFA Extraction

The demo below simulates the L* extraction process on an RNN trained to recognize the language of strings over {0, 1} with an even number of 1s. Watch L* build the observation table, query the "RNN," and construct the DFA step by step.

L* Extraction — Language: even number of 1s

Step 0 / 10 — Initializing

Observation Table

Hypothesis DFA

Each step shows L* querying the RNN oracle and refining the DFA hypothesis.

Trained RNN

Black-box: continuous hidden states, opaque decision boundary

→

◉

Extracted DFA

Interpretable: 2 states, clear transitions, provably minimal

How It Works

Angluin's L* algorithm, published in 1987, learns an unknown regular language by interacting with a Minimally Adequate Teacher (MAT) that answers two types of queries. In our adaptation, the trained RNN plays the role of this teacher:

The L* Algorithm Adapted for RNNs

Initialize an observation table with the empty string and the alphabet symbols. Fill in table entries by sending membership queries to the RNN.
Close and make consistent the table. If a bottom-row entry has the same signature as no top-row entry, promote it (closing). If two top rows look the same but differ after extension, add a distinguishing suffix (consistency).
Construct a hypothesis DFA from the closed, consistent table. Each distinct row in the top part becomes a state; transitions are determined by the table's structure.
Check equivalence by sampling random strings and comparing the hypothesis DFA's answer with the RNN's answer. If a disagreement (counterexample) is found, add it and all its prefixes to the table.
Repeat from step 2 until no counterexample is found within the sampling budget.

Key adaptation: Since we cannot truly verify equivalence with a neural network (it would require checking infinitely many strings), equivalence queries are approximated by random sampling. The more strings we sample, the more confident we are that the extracted DFA is correct — but we can never be 100% sure.

Results

The extraction method was applied to RNNs (GRUs and LSTMs) trained on a range of regular languages, including parity, tomita grammars, and balanced parentheses of bounded depth. The key findings:

Faithful extraction: In most cases, L* extracted a DFA that perfectly matched the target language, confirming the RNN had learned the correct language.
Revealing mistakes: In some cases, the extracted DFA revealed that the RNN had learned a subtly wrong language — one that agreed with the training set but diverged on longer or rarer strings.
Architecture insights: GRUs and LSTMs learned different languages for the same training data in some cases, and the extracted DFAs made these differences immediately visible.
Compact representations: The extracted DFAs were small (often matching the minimal DFA for the target language), providing interpretable models from opaque neural networks.

This work establishes a practical bridge between neural network learning and formal language theory — turning black-box RNNs into white-box automata that can be inspected, verified, and compared.

@inproceedings{weiss2018extracting, title={Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples}, author={Weiss, Gail and Goldberg, Yoav and Yahav, Eran}, booktitle={International Conference on Machine Learning (ICML)}, year={2018} }