ICLR 2019

code2seq: Generating Sequences from Structured Representations of Code

Uri Alon, Shaked Brody, Omer Levy, Eran Yahav

TL;DR — code2seq extends code2vec by generating sequences (like method names or documentation) from AST paths, using an encoder-decoder architecture with attention over path-contexts. Instead of predicting a single label, it decodes output tokens one at a time, attending to different paths at each step.

The Problem

code2vec demonstrated that AST paths are a powerful representation for source code — but it predicts a single label. Many important tasks require sequence output: method names composed of multiple subtokens (e.g., binary|search|recursive), code summaries, documentation strings, and more.

With a fixed-vocabulary classifier, you can only predict names you have seen during training. If the correct name is calculate|average|score, the model cannot produce it unless that exact combination appears in the training data. What we need is a way to generate names token by token, composing novel sequences from a subtoken vocabulary.

The challenge is to preserve code2vec's structural representation — the path-contexts from the AST — while replacing its classification head with a generative decoder that can produce arbitrarily long sequences.

The Key Idea

code2seq introduces an encoder-decoder architecture with attention over AST path-contexts. The encoder processes each path-context using bidirectional LSTMs, and the decoder generates output subtokens one at a time, attending to different paths at each generation step.

Encoder
Each AST path-context is encoded by a bidirectional LSTM into a fixed-length vector
Attention
At each decoder step, attention weights select the most relevant encoded paths
Decoder
An LSTM decoder generates subtokens sequentially using the attended context

The key advantage is that the decoder can attend to different path-contexts when generating each subtoken. When generating "binary", it might focus on paths involving a comparison and recursive call; when generating "search", it might shift attention to paths involving array access and midpoint computation.

Interactive Demo

Watch the decoder generate a method name subtoken by subtoken. At each step, observe how the attention shifts across different path-contexts — highlighted rows indicate which paths the decoder is focusing on.

Sequence Generation Visualizer

Start AST Path End Attention
Generated sequence:

How It Works

Path-Context Encoding

Like code2vec, code2seq extracts path-contexts from the AST. Each path-context is a triple: (start terminal, AST path, end terminal). But instead of embedding the full path as a single token, code2seq encodes the path node sequence with a bidirectional LSTM. The forward and backward hidden states are concatenated and combined with the terminal token embeddings to produce a single vector per path-context.

Attention-Based Decoding

The decoder is an LSTM that generates one subtoken per step. At each step, it computes attention weights over all encoded path-contexts. The attention mechanism uses the decoder's hidden state as a query against the encoded paths, producing a weighted context vector. This context vector, along with the previous subtoken embedding, is fed into the decoder LSTM to produce the next subtoken.

Because the decoder attends to different path-contexts at each step, it can compose information from multiple parts of the code. The first subtoken might draw from paths related to the algorithm's structure, while later subtokens draw from paths related to the data types or operations involved.

Handling Rare and Unseen Names

By generating subtokens rather than whole names, code2seq can produce method names it has never seen during training. Even if serialize|json|array never appears in the training set, the model can compose it from familiar subtokens. This is a critical advantage over code2vec's fixed vocabulary approach.

Results

code2seq was evaluated on the tasks of method name prediction and code summarization. It significantly outperforms code2vec and other baselines, particularly on multi-subtoken predictions where the ability to generate sequences is essential.

19.2
F1 score (method naming, Java-large)
+23%
F1 improvement over code2vec
Seq
Generates novel multi-subtoken names
Attn
Interpretable path attention per step

The attention mechanism also provides interpretability: by examining which paths the decoder attends to at each generation step, we can understand which parts of the code influenced each subtoken in the predicted name. This makes code2seq's predictions more transparent and debuggable than a flat classifier.

code2seq demonstrates that structured code representations (AST paths) can be effectively combined with sequence-to-sequence models. The path-context encoder preserves the structural inductive bias of code2vec, while the attention-based decoder enables flexible sequence generation — getting the best of both worlds.

@inproceedings{alon2019code2seq, title={code2seq: Generating Sequences from Structured Representations of Code}, author={Alon, Uri and Brody, Shaked and Levy, Omer and Yahav, Eran}, booktitle={International Conference on Learning Representations (ICLR)}, year={2019} }