Code Completion with Statistical Language Models

TL;DR: IDE code completion typically uses type-based filtering, producing long alphabetical lists. This paper applies n-gram language models trained on large code corpora to rank completions by probability — putting the most likely completion first.

The Problem

Traditional code completion in IDEs like Eclipse or IntelliJ shows all type-correct options sorted alphabetically. When you type list., the IDE presents every method available on List — from add() all the way to trimToSize() — in alphabetical order.

The right choice is often buried deep in a long list. If you want size(), you scroll past dozens of methods. If you want get(), you still have to scan through the list or type enough characters to narrow it down. The result: developers spend keystrokes and time navigating completions that could have been predicted from context.

What developers actually want is the most likely completion at the top — the one that fits the current code context, not just the one that comes first alphabetically.

The Key Idea

The core insight is simple: code is repetitive and predictable. Just like natural language models can predict the next word in a sentence, a language model trained on millions of lines of code can predict the most probable next token in a program.

The approach works in three steps:

Train an n-gram language model

Collect a large corpus of code (millions of lines of Java). Tokenize each program and build an n-gram model that captures the probability of each token given its preceding context.

Query the model at completion time

When the developer triggers completion, take the preceding tokens as context and ask the model: what is the most likely next token? The model returns a ranked list of candidates with their probabilities.

Combine with the type system

Filter the model's predictions through the type system to ensure every suggestion is type-correct. This gives you the best of both worlds: statistical ranking for relevance, type-checking for correctness.

The n-gram model captures patterns like: after for (int i = 0; i <, the next token is very likely list.size() or arr.length. After str., the most common method is length(), not codePointAt().

list → . → get → ( → i

An n-gram model uses the preceding tokens as context to predict the next token.

Interactive Demo

Compare statistical completion (our approach) with traditional alphabetical completion. Select an example context, then type a prefix to filter completions. Notice how the intended completion is ranked first by the statistical model but buried in the alphabetical list.

Statistical vs. Traditional Completion

Filter prefix:

Traditional (Alphabetical)

Statistical (Ours)

Alphabetical rank

Statistical rank

Keystrokes saved

The highlighted row is the intended completion. Statistical ranking places it at rank #1; alphabetical ordering buries it much lower.

How It Works

N-gram language models for code

An n-gram model estimates the probability of a token given the previous n-1 tokens. For a 3-gram model, the probability of token t given the context t_-2 t_-1 is estimated from the training corpus by counting:

P(t | t_{-2}, t_{-1}) = count(t_{-2}, t_{-1}, t) / count(t_{-2}, t_{-1})

In practice, the paper uses Witten-Bell smoothed n-gram models (with n=3, trigrams). Smoothing is crucial because many valid token sequences are rare or unseen in the training data, and the model needs to assign non-zero probability to them.

Training on large code corpora

The model is trained on a massive corpus of open-source Android Java code (over 3 million methods from GitHub). Each file is tokenized — identifiers, keywords, operators, and literals become individual tokens — and the n-gram counts are collected across the entire corpus.

A key design choice is the tokenization strategy. The paper explores two levels of abstraction:

Full tokens: Keep all identifiers as-is. This captures naming conventions and common variable names.
Abstracted tokens: Replace local variable names with canonical placeholders. This helps the model generalize across different naming styles.

Combining with the type system

The language model alone might suggest tokens that are syntactically plausible but type-incorrect. The paper combines the model with a type-based completion engine: the type system generates the set of valid completions, and the language model ranks them. This ensures every suggestion is both type-safe and ranked by likelihood.

Results

The statistical approach dramatically improves completion ranking compared to traditional alphabetical ordering:

On Java/Android completions: the desired completion appears in the top 3 suggestions over 90% of the time (vs. being scattered across a long alphabetical list). The model reduces the average number of keystrokes needed to reach the correct completion significantly.

The paper also shows that the approach generalizes well: a model trained on one set of projects performs well when applied to entirely different projects, confirming that code indeed follows predictable statistical patterns.

This work was an early and influential demonstration that statistical language models — the same kind used for natural language processing — can dramatically improve developer tools. It helped pave the way for the neural code completion systems (including large language models) that are now ubiquitous in modern IDEs.

Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code Completion with Statistical Language Models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14). ACM, New York, NY, USA, 419-428. https://doi.org/10.1145/2594291.2594321