Structural Language Models of Code

TL;DR — Instead of treating code as a flat sequence of tokens, this model generates code by predicting AST nodes — leveraging the tree structure. Given partial code, it uses AST paths to the missing "hole" to predict completions with higher accuracy than token-based models.

The Problem

Standard language models treat source code as a flat sequence of tokens — just like natural language text. But code is fundamentally different: it has a well-defined tree structure called an Abstract Syntax Tree (AST). Every valid program corresponds to a tree where nodes represent syntactic constructs (statements, expressions, operators) and leaves represent concrete tokens (variable names, literals, keywords).

By ignoring this tree structure, token-based models miss important structural constraints. For example, a closing brace } must match an opening brace from the same block, possibly hundreds of tokens earlier. A flat sequence model must learn these long-range dependencies implicitly, while the AST encodes them explicitly as parent-child relationships.

Key insight: Code is not just text. Its inherent tree structure (AST) captures syntactic constraints and long-range relationships that flat token sequences cannot represent efficiently. A language model for code should generate trees, not token sequences.

The Key Idea

Instead of predicting the next token, predict the next AST node. The model generates code by expanding a partial AST one node at a time, choosing which grammar rule to apply at each step.

The critical question is: what context should the model use for each prediction? Rather than relying on a flat window of preceding tokens, the model extracts AST paths — sequences of nodes connecting relevant parts of the partial tree to the current "hole" (the node being predicted). These paths capture structural relationships that span the entire program, regardless of token distance.

Each AST path walks up from a context node to an ancestor, then down to the hole, encoding the syntactic relationship between the two. Multiple such paths from different context nodes are combined to form a rich, structure-aware representation for prediction.

Interactive Demo: Code Completion Comparison

Explore how structural context (AST paths) leads to better code completions than flat token context. Select an example, then examine the partial code, its AST, the paths to the hole, and the predicted completions.

Partial Code

Abstract Syntax Tree

Top-k Predicted Completions

Token-based Model

Structural Model (Ours)

The structural model uses AST paths (colored above) to the <???> hole, capturing long-range syntactic context that flat token windows miss.

How It Works

The model follows a generate-by-expanding approach on the AST. At each step, it identifies the leftmost non-terminal leaf in the partial AST (the "hole") and predicts how to expand it.

Parse Partial AST

The incomplete code is parsed into a partial AST with one or more non-terminal leaves (holes).

→

Extract AST Paths

Paths from context nodes (existing leaves) to the target hole are extracted, each encoding a structural relationship.

→

Encode Paths

Each path is encoded with an LSTM that reads the sequence of AST node types along the path.

→

Predict Node

Path representations are aggregated with attention and decoded to predict the AST node (grammar rule or terminal value).

Why AST Paths Work

An AST path from a context node to the hole encodes the syntactic relationship between two parts of the program. For example, a path might say: "the variable x is the condition of an if-statement whose then-branch contains the hole." This path captures that x was just tested, so the hole is likely to use x or related variables — even if they are many tokens apart.

Multiple paths from different context nodes provide complementary views of the same hole. The model combines them with an attention mechanism, learning which paths are most informative for each prediction.

By working directly on the AST, the structural model naturally respects the grammar of the language — it cannot generate syntactically invalid code. Every prediction corresponds to a valid grammar rule application or terminal value.

Results

The structural language model consistently outperforms flat token-based models on code completion tasks, especially for predictions requiring long-range context or structural understanding.

Model	Exact Match (%)	Syntactically Valid (%)
Token-level LSTM	12.4	82.1
Token-level Transformer	14.7	84.3
Structural LM (AST paths)	18.0	100.0

Key findings include:

Higher accuracy: The structural model achieves significantly higher exact-match accuracy for predicting the correct completion.
Always syntactically valid: Because predictions correspond to grammar rules, the generated code is syntactically correct by construction — unlike token-based models which can produce invalid syntax.
Better on long-range dependencies: The advantage is largest for completions that depend on context far from the hole in token distance but close in the AST.
Complementary to token models: The structural approach captures different information than token-based models, suggesting that combining both could yield further improvements.

The structural language model demonstrates that leveraging the inherent tree structure of code is both feasible and beneficial. By predicting AST nodes instead of tokens, the model achieves higher accuracy and guarantees syntactic validity — a property no token-based model can offer.

@inproceedings{alon2020structural, title = {Structural Language Models of Code}, author = {Alon, Uri and Sadaka, Roy and Levy, Omer and Yahav, Eran}, booktitle = {Proceedings of the 37th International Conference on Machine Learning (ICML)}, year = {2020}, url = {https://arxiv.org/abs/1910.00577} }