TL;DR — Can natural language descriptions help measure code similarity? This early work explores using textual descriptions paired with code to create similarity metrics that capture semantic intent beyond syntactic structure.
The Problem
Determining whether two code snippets do "the same thing" is undecidable in general. Traditional approaches rely on syntactic similarity — comparing tokens, AST structures, or control-flow graphs. While these capture surface-level resemblance, they fundamentally miss semantically equivalent but structurally different implementations.
Consider a simple example: a for loop and a while loop that both compute the sum of an array. Token overlap is low; AST structure differs. Yet any programmer would recognize they do the same thing. How can we bridge this gap automatically?
The Key Idea
Leverage the connection between code and its natural language descriptions. If two code snippets have similar descriptions, they likely serve similar purposes. Rather than trying to reason about program equivalence directly — a problem that is undecidable — we can use natural language as a semantic abstraction layer over code structure.
Developers routinely write comments, documentation, and specifications that describe what code does in natural language. By pairing code snippets with their NL descriptions and comparing those descriptions, we obtain a similarity signal that is aligned with human intent rather than syntactic form.
Interactive Demo: Code Similarity Explorer
Select any two snippets to compare their similarity
Click on two code snippets below. Three similarity metrics will be computed: token-based, structural (AST), and NL-description-based. Notice how NL similarity captures semantic equivalence that other metrics miss.
Similarity Metrics
Natural Language Descriptions
How It Works
Code-NL Pairing
The approach begins by associating each code snippet with a natural language description. These descriptions can come from multiple sources: inline comments, Javadoc-style documentation, StackOverflow question-answer pairs, or manual annotations. The key requirement is that each description captures the intent of the code — what it accomplishes — rather than a line-by-line narration of how it works.
Similarity Computation
Given two code snippets and their associated NL descriptions, we compute similarity along three dimensions:
- Token similarity — measures lexical overlap by comparing the multisets of tokens in each snippet. This captures naming conventions and shared vocabulary but is blind to structural reorganization.
- Structural similarity — compares AST-level features such as node types, tree depth, and subtree patterns. This is more robust than token matching but still tied to syntactic form.
- NL-based similarity — compares the natural language descriptions using standard text similarity measures. This metric is decoupled from code syntax entirely, operating in the space of human intent.
The NL-based metric acts as a complement to syntactic measures. When two implementations are structurally different but serve the same purpose, the NL descriptions converge, providing a high similarity score that the other metrics miss.
Results
Natural language descriptions provide a useful semantic signal for code similarity. The NL-based metric successfully identifies semantically equivalent code pairs that syntactic approaches miss, particularly for implementations using different algorithms, control structures, or programming idioms to achieve the same goal. This work laid early groundwork for the idea that natural language can serve as an effective bridge for reasoning about code semantics — a theme that would become central to later work on code representation learning.