TL;DR: Code similarity is usually measured by comparing code structure or tokens. This paper takes a different approach: use natural language descriptions of code (from Q&A sites like StackOverflow) to bridge between code snippets that look different but do the same thing.
The Problem
Measuring code similarity is fundamental to many software engineering tasks: code search, clone detection, plagiarism detection, and recommendation systems. The traditional approaches rely on comparing the structure of code -- tokens, abstract syntax trees (ASTs), or control-flow graphs.
But these approaches have a blind spot. Structurally different code can be semantically equivalent. A for loop and a while loop doing the same thing will look very different to a token-based or AST-based similarity metric. Using a StringBuilder vs. string concatenation, iterative vs. recursive approaches, different API choices for the same task -- all of these create a gap between syntactic similarity and semantic similarity.
Code that looks different can do the same thing. Traditional similarity metrics, which compare code structure, miss these connections.
The Key Idea
The insight is simple but powerful: if two code snippets are described in similar natural language, they are probably similar in what they do -- regardless of how different their code looks.
Q&A sites like StackOverflow are a natural source for this pairing. Each answer contains a code snippet alongside a natural language description (the question, the surrounding text). By collecting these pairs, we build a corpus that links code to its intent.
Two snippets with similar NL descriptions are likely semantically similar, even if their code is completely different. The natural language acts as a semantic bridge between syntactically different implementations.
The NL Bridge
Natural language descriptions bridge the gap between syntactically different but semantically equivalent code.
Interactive Demo: NL-Bridged Similarity
Select a pair of code snippets below. Each pair solves the same task but uses completely different syntax. Traditional code similarity (based on tokens/structure) gives a low score, but their StackOverflow descriptions are nearly identical -- revealing high semantic similarity.
Compare Code Snippets
How It Works
The approach builds a similarity metric in several stages, drawing on the rich pairing of code and natural language available in Q&A forums.
The key insight in the pairing process is that StackOverflow provides a natural mapping: questions describe what the code should do, and answers provide how. Multiple answers to the same question give us different implementations of the same task -- exactly the kind of semantic equivalences we want to capture.
Combining Code and NL Signals
The approach does not discard code-based similarity entirely. Instead, it combines the NL-based signal with traditional code similarity to get the best of both worlds. When code is structurally similar, the code metric captures it. When code looks different but the intent is the same, the NL metric fills the gap.
Results
The evaluation shows that NL-based similarity better captures semantic relationships between code snippets compared to purely structural approaches. In particular:
- Code pairs with different syntax but identical intent score much higher with NL-based similarity than with token- or AST-based similarity.
- The combined metric (code + NL) outperforms either signal alone, confirming that the two sources of information are complementary.
- The approach is especially effective for "cross-idiom" similarity -- detecting when different programming patterns (e.g., iterator vs. index-based loop) achieve the same goal.
Natural language descriptions from Q&A sites provide a powerful complementary signal for code similarity. By bridging through NL, the approach captures semantic relationships that purely structural methods miss.