code2vec: Learning Distributed Representations of Code

TL;DR — code2vec represents code snippets as fixed-length vectors by decomposing them into AST paths — pairs of nodes connected through the syntax tree. These path contexts capture structural meaning, enabling accurate prediction of method names from method bodies.

The Problem

How do you represent a code snippet as a fixed-length vector that captures its semantics? Natural language processing benefits from word embeddings that place similar words nearby in vector space, but code is fundamentally different from prose. A flat sequence of tokens misses the hierarchical structure — the nesting of blocks, the scoping of variables, the tree-shaped syntax — that makes code meaningful.

Previous approaches either treated code as a flat token sequence (losing structural information) or required expensive whole-tree models that struggled to scale. The challenge is finding a representation that is both structurally aware and practically efficient.

The Key Idea

Decompose each code snippet into a bag of path-contexts. A path-context is a triple (start-node, path, end-node) extracted from the Abstract Syntax Tree (AST). Each path walks up from one terminal (leaf) node, through a common ancestor, then down to another terminal node.

For example, in a method that swaps two variables, one path-context might connect the token tmp to the token a via the path NameExpr ↑ AssignExpr ↓ NameExpr. This captures the assignment relationship between these two variables through the tree structure.

Each path-context is embedded into a vector, and then all path-context vectors are aggregated using a learned attention mechanism into a single fixed-length code vector. The attention weights tell the model which structural patterns matter most for the prediction task. The resulting code vector can be used to predict method names, classify code, or measure code similarity.

Interactive Demo

Code → Path-Contexts → Prediction

Select an example method to see how code2vec decomposes it into AST path-contexts and predicts its name. Highlighted rows show the highest-attention paths.

↓↓↓

Extracted Path-Contexts

↓↓↓

Aggregated Code Vector

Each cell represents one dimension of the 128-d code vector. Color intensity encodes magnitude.

How It Works

1. Path Extraction from the AST

Given a method body, code2vec first parses it into an Abstract Syntax Tree. It then enumerates all pairs of terminal (leaf) nodes and records the syntactic path connecting them. A path is a sequence of AST node types, annotated with up (↑) or down (↓) movement. To keep the representation tractable, only paths up to a bounded length are retained.

Path example:   tmp  ──>  NameExpr ↑ AssignExpr ↓ NameExpr  ──>  a

This path encodes: "tmp is assigned from a" via the AST structure.

2. Attention-Based Aggregation

Each path-context triple is mapped to a combined embedding by concatenating the embeddings of its start token, path, and end token, then applying a fully connected layer. A global attention vector is learned during training. The dot product of each path-context vector with this attention vector yields a scalar weight, which is normalized via softmax across all path-contexts in the snippet.

The final code vector is the weighted sum of all path-context vectors. This lets the model focus on the most informative structural patterns while ignoring noise. The attention weights are interpretable — they reveal which parts of the code the model considers most relevant.

3. Training

The model is trained end-to-end on a method name prediction task: given the body of a method, predict its name. The code vector is multiplied by a target embedding matrix, and the loss is computed via cross-entropy over the vocabulary of method name subtokens. This task serves as a proxy for learning general-purpose code representations — a model that can accurately predict a method's name must understand what the method does.

Results

code2vec achieves state-of-the-art results on method name prediction across a large Java corpus, outperforming both token-based and tree-based neural models while being significantly faster to train.

Key results: The model achieves an F1 score of 19.04 on the Java-large dataset for method name prediction, substantially outperforming ConvAttention (17.99) and previous TreeLSTM-based approaches. Training is over an order of magnitude faster than tree-based models due to the embarrassingly parallel path decomposition.

Beyond accuracy, the learned code vectors capture genuine semantic similarity. Methods with similar functionality cluster together in vector space, even when they have very different surface-level token sequences. The attention mechanism provides interpretability: inspecting the top-weighted paths reveals which structural patterns drive each prediction.

code2vec also generalizes beyond method naming. The same path-based representation has been applied to code search, code summarization, and bug detection, demonstrating that AST paths are a versatile structural feature for machine learning on code.

@inproceedings{alon2019code2vec, title={code2vec: Learning Distributed Representations of Code}, author={Alon, Uri and Zilberstein, Meital and Levy, Omer and Yahav, Eran}, booktitle={Proceedings of the ACM on Programming Languages (POPL)}, year={2019} }