Leveraging a Corpus of Natural Language Descriptions for Program Similarity

TL;DR: Code similarity is usually measured by comparing code structure or tokens. This paper takes a different approach: use natural language descriptions of code (from Q&A sites like StackOverflow) to bridge between code snippets that look different but do the same thing.

The Problem

Measuring code similarity is fundamental to many software engineering tasks: code search, clone detection, plagiarism detection, and recommendation systems. The traditional approaches rely on comparing the structure of code -- tokens, abstract syntax trees (ASTs), or control-flow graphs.

But these approaches have a blind spot. Structurally different code can be semantically equivalent. A for loop and a while loop doing the same thing will look very different to a token-based or AST-based similarity metric. Using a StringBuilder vs. string concatenation, iterative vs. recursive approaches, different API choices for the same task -- all of these create a gap between syntactic similarity and semantic similarity.

Code that looks different can do the same thing. Traditional similarity metrics, which compare code structure, miss these connections.

The Key Idea

The insight is simple but powerful: if two code snippets are described in similar natural language, they are probably similar in what they do -- regardless of how different their code looks.

Q&A sites like StackOverflow are a natural source for this pairing. Each answer contains a code snippet alongside a natural language description (the question, the surrounding text). By collecting these pairs, we build a corpus that links code to its intent.

Two snippets with similar NL descriptions are likely semantically similar, even if their code is completely different. The natural language acts as a semantic bridge between syntactically different implementations.

The NL Bridge

Snippet A

for-loop impl.

→

NL Description

"reverse a string"

←

Snippet B

StringBuilder impl.

Natural language descriptions bridge the gap between syntactically different but semantically equivalent code.

Interactive Demo: NL-Bridged Similarity

Select a pair of code snippets below. Each pair solves the same task but uses completely different syntax. Traditional code similarity (based on tokens/structure) gives a low score, but their StackOverflow descriptions are nearly identical -- revealing high semantic similarity.

Compare Code Snippets

Select pair:

Snippet A

Snippet B

Code Similarity (token-based)

NL Similarity (description-based)

How It Works

The approach builds a similarity metric in several stages, drawing on the rich pairing of code and natural language available in Q&A forums.

Corpus Collection

Extract code-NL pairs from StackOverflow: each code snippet is paired with its surrounding question and answer text.

NL Embedding

Compute vector representations of the NL descriptions using standard NLP techniques (TF-IDF, word embeddings).

Similarity Lookup

Given a new code snippet, find its closest NL descriptions in the corpus. Use them as proxies for semantic meaning.

NL-Based Comparison

Compare two snippets by comparing their NL descriptions. High NL similarity implies high semantic similarity.

The key insight in the pairing process is that StackOverflow provides a natural mapping: questions describe what the code should do, and answers provide how. Multiple answers to the same question give us different implementations of the same task -- exactly the kind of semantic equivalences we want to capture.

Combining Code and NL Signals

The approach does not discard code-based similarity entirely. Instead, it combines the NL-based signal with traditional code similarity to get the best of both worlds. When code is structurally similar, the code metric captures it. When code looks different but the intent is the same, the NL metric fills the gap.

Results

The evaluation shows that NL-based similarity better captures semantic relationships between code snippets compared to purely structural approaches. In particular:

Code pairs with different syntax but identical intent score much higher with NL-based similarity than with token- or AST-based similarity.
The combined metric (code + NL) outperforms either signal alone, confirming that the two sources of information are complementary.
The approach is especially effective for "cross-idiom" similarity -- detecting when different programming patterns (e.g., iterator vs. index-based loop) achieve the same goal.

Natural language descriptions from Q&A sites provide a powerful complementary signal for code similarity. By bridging through NL, the approach captures semantic relationships that purely structural methods miss.

@inproceedings{zilberstein2016leveraging, title={Leveraging a Corpus of Natural Language Descriptions for Program Similarity}, author={Zilberstein, Meital and Yahav, Eran}, booktitle={Proceedings of the 2016 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!)}, year={2016}, organization={ACM} }