A General Path-Based Representation for Predicting Program Properties

TL;DR — AST paths — sequences of nodes connecting two terminals in a syntax tree — are a powerful general-purpose representation for code. They capture both syntactic and semantic patterns, enabling prediction of variable names, method names, types, and more.

The Problem

How do you represent program elements — variables, methods, expressions — in a way that captures their structural context? Traditional approaches either flatten the code into token sequences (losing tree structure) or attempt to encode the full AST with expensive recursive neural networks.

Token-based features miss critical relationships: they cannot see that a variable appears inside a loop condition, or that a parameter flows into a return statement. Full tree encoders, on the other hand, are computationally expensive and often struggle with long-range dependencies in deep trees.

What we need is a representation that captures structural patterns efficiently — one that is rich enough to encode meaningful program properties, yet simple enough to scale.

The Key Idea

The core insight is to extract paths between pairs of terminal nodes in the Abstract Syntax Tree (AST). Each path walks upward from one leaf node to a common ancestor, then downward to another leaf. These paths naturally capture syntactic patterns such as "variable is assigned inside a loop" or "parameter is passed to a method call."

For example, a path from variable x to variable y might traverse: x ↑ Assignment ↑ ForStatement ↓ Condition ↓ y. This single path encodes the fact that x is assigned in the body of a for-loop whose condition involves y.

By collecting a bag of paths for each program element, we obtain a rich, fixed-size representation that captures diverse structural patterns — without needing to process the entire tree.

Interactive Demo: AST Path Explorer

Explore AST Paths

Click any two leaf nodes (highlighted) in the AST below to see the path between them. Or click a path from the list to highlight it.

Select two leaf nodes to see their AST path...

Structural

Data flow

Control flow

Extracted Paths

Click "Auto-Extract Top 10" or select nodes above...

How It Works

Path Extraction Algorithm

Given a program element (e.g., a method body), the algorithm works as follows:

Parse the code into an AST. Each node has a type (e.g., MethodDeclaration, ForStatement, Identifier) and leaf nodes carry values (variable names, literals).

Enumerate pairs of terminal nodes. For each pair of leaves (t_i, t_j), find their path in the tree: the sequence of nodes from t_i up to the lowest common ancestor (LCA), then down to t_j.

Represent each path as a triple: ⟨leaf_start, path-of-AST-nodes, leaf_end⟩. The path is a sequence of node types with up/down directions, e.g., (x, Name↑Assign↑For↓Compare↓Name, y).

Filter by length and width. Paths are bounded by a maximum length (number of nodes) and maximum width (difference in child indices between siblings), keeping only the most local and informative paths.

Learning Representations

Each path-context triple ⟨t_s, p, t_e⟩ is embedded using three learned embedding matrices: one for terminal values, one for path types, and one for terminal values again. These are concatenated and fed through a fully connected layer, then aggregated (via attention) over all path-contexts in a code snippet to produce a single fixed-length vector.

This vector is the code representation. It can be used to predict method names (the model learns that certain path patterns correspond to certain method purposes), variable types, or other program properties.

Key advantage: Unlike sequential models (which see code as a flat token stream) or tree-based models (which process the full AST recursively), the path-based representation decomposes the tree into a bag of local structural patterns. This is both computationally efficient and surprisingly expressive.

Results

The path-based representation achieves strong results across multiple tasks and programming languages:

62.5%

F1 on Java method naming

53.5%

F1 on C# method naming

faster than tree-based models

Multi

language generality

The model accurately predicts variable names, method names, and types across Java, JavaScript, Python, and C#. It significantly outperforms token-based approaches and matches or exceeds tree-based neural networks at a fraction of the computational cost.

The path-based representation became the foundation for code2vec and code2seq, two widely-used models for learning code representations, demonstrating that this decomposition of ASTs into paths is a powerful and general principle.

@inproceedings{alon2018general, title={A General Path-Based Representation for Predicting Program Properties}, author={Alon, Uri and Zilberstein, Meital and Levy, Omer and Yahav, Eran}, booktitle={Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)}, year={2018}, publisher={ACM} }