TL;DR — AST paths — sequences of nodes connecting two terminals in a syntax tree — are a powerful general-purpose representation for code. They capture both syntactic and semantic patterns, enabling prediction of variable names, method names, types, and more.
The Problem
How do you represent program elements — variables, methods, expressions — in a way that captures their structural context? Traditional approaches either flatten the code into token sequences (losing tree structure) or attempt to encode the full AST with expensive recursive neural networks.
Token-based features miss critical relationships: they cannot see that a variable appears inside a loop condition, or that a parameter flows into a return statement. Full tree encoders, on the other hand, are computationally expensive and often struggle with long-range dependencies in deep trees.
What we need is a representation that captures structural patterns efficiently — one that is rich enough to encode meaningful program properties, yet simple enough to scale.
The Key Idea
The core insight is to extract paths between pairs of terminal nodes in the Abstract Syntax Tree (AST). Each path walks upward from one leaf node to a common ancestor, then downward to another leaf. These paths naturally capture syntactic patterns such as "variable is assigned inside a loop" or "parameter is passed to a method call."
For example, a path from variable x to variable y might traverse: x ↑ Assignment ↑ ForStatement ↓ Condition ↓ y. This single path encodes the fact that x is assigned in the body of a for-loop whose condition involves y.
By collecting a bag of paths for each program element, we obtain a rich, fixed-size representation that captures diverse structural patterns — without needing to process the entire tree.
Interactive Demo: AST Path Explorer
Explore AST Paths
Click any two leaf nodes (highlighted) in the AST below to see the path between them. Or click a path from the list to highlight it.
How It Works
Path Extraction Algorithm
Given a program element (e.g., a method body), the algorithm works as follows:
Learning Representations
Each path-context triple ⟨ts, p, te⟩ is embedded using three learned embedding matrices: one for terminal values, one for path types, and one for terminal values again. These are concatenated and fed through a fully connected layer, then aggregated (via attention) over all path-contexts in a code snippet to produce a single fixed-length vector.
This vector is the code representation. It can be used to predict method names (the model learns that certain path patterns correspond to certain method purposes), variable types, or other program properties.
Key advantage: Unlike sequential models (which see code as a flat token stream) or tree-based models (which process the full AST recursively), the path-based representation decomposes the tree into a bag of local structural patterns. This is both computationally efficient and surprisingly expressive.
Results
The path-based representation achieves strong results across multiple tasks and programming languages:
The model accurately predicts variable names, method names, and types across Java, JavaScript, Python, and C#. It significantly outperforms token-based approaches and matches or exceeds tree-based neural networks at a fraction of the computational cost.
The path-based representation became the foundation for code2vec and code2seq, two widely-used models for learning code representations, demonstrating that this decomposition of ASTs into paths is a powerful and general principle.