ICML 2020

Structural Language Models of Code

Uri Alon, Roy Sadaka, Omer Levy, Eran Yahav

TL;DR — Instead of treating code as a flat sequence of tokens, this model generates code by predicting AST nodes — leveraging the tree structure. Given partial code, it uses AST paths to the missing "hole" to predict completions with higher accuracy than token-based models.

The Problem

Standard language models treat source code as a flat sequence of tokens — just like natural language text. But code is fundamentally different: it has a well-defined tree structure called an Abstract Syntax Tree (AST). Every valid program corresponds to a tree where nodes represent syntactic constructs (statements, expressions, operators) and leaves represent concrete tokens (variable names, literals, keywords).

By ignoring this tree structure, token-based models miss important structural constraints. For example, a closing brace } must match an opening brace from the same block, possibly hundreds of tokens earlier. A flat sequence model must learn these long-range dependencies implicitly, while the AST encodes them explicitly as parent-child relationships.

Key insight: Code is not just text. Its inherent tree structure (AST) captures syntactic constraints and long-range relationships that flat token sequences cannot represent efficiently. A language model for code should generate trees, not token sequences.

The Key Idea

Instead of predicting the next token, predict the next AST node. The model generates code by expanding a partial AST one node at a time, choosing which grammar rule to apply at each step.

The critical question is: what context should the model use for each prediction? Rather than relying on a flat window of preceding tokens, the model extracts AST paths — sequences of nodes connecting relevant parts of the partial tree to the current "hole" (the node being predicted). These paths capture structural relationships that span the entire program, regardless of token distance.

Each AST path walks up from a context node to an ancestor, then down to the hole, encoding the syntactic relationship between the two. Multiple such paths from different context nodes are combined to form a rich, structure-aware representation for prediction.

Interactive Demo: Code Completion Comparison

Explore how structural context (AST paths) leads to better code completions than flat token context. Select an example, then examine the partial code, its AST, the paths to the hole, and the predicted completions.

Token-based Model

Structural Model (Ours)

The structural model uses AST paths (colored above) to the <???> hole, capturing long-range syntactic context that flat token windows miss.

How It Works

The model follows a generate-by-expanding approach on the AST. At each step, it identifies the leftmost non-terminal leaf in the partial AST (the "hole") and predicts how to expand it.

1
Parse Partial AST

The incomplete code is parsed into a partial AST with one or more non-terminal leaves (holes).

2
Extract AST Paths

Paths from context nodes (existing leaves) to the target hole are extracted, each encoding a structural relationship.

3
Encode Paths

Each path is encoded with an LSTM that reads the sequence of AST node types along the path.

4
Predict Node

Path representations are aggregated with attention and decoded to predict the AST node (grammar rule or terminal value).

Why AST Paths Work

An AST path from a context node to the hole encodes the syntactic relationship between two parts of the program. For example, a path might say: "the variable x is the condition of an if-statement whose then-branch contains the hole." This path captures that x was just tested, so the hole is likely to use x or related variables — even if they are many tokens apart.

Multiple paths from different context nodes provide complementary views of the same hole. The model combines them with an attention mechanism, learning which paths are most informative for each prediction.

By working directly on the AST, the structural model naturally respects the grammar of the language — it cannot generate syntactically invalid code. Every prediction corresponds to a valid grammar rule application or terminal value.

Results

The structural language model consistently outperforms flat token-based models on code completion tasks, especially for predictions requiring long-range context or structural understanding.

Model Exact Match (%) Syntactically Valid (%)
Token-level LSTM 12.4 82.1
Token-level Transformer 14.7 84.3
Structural LM (AST paths) 18.0 100.0

Key findings include:

The structural language model demonstrates that leveraging the inherent tree structure of code is both feasible and beneficial. By predicting AST nodes instead of tokens, the model achieves higher accuracy and guarantees syntactic validity — a property no token-based model can offer.

@inproceedings{alon2020structural, title = {Structural Language Models of Code}, author = {Alon, Uri and Sadaka, Roy and Levy, Omer and Yahav, Eran}, booktitle = {Proceedings of the 37th International Conference on Machine Learning (ICML)}, year = {2020}, url = {https://arxiv.org/abs/1910.00577} }