TL;DR — Searching for similar code in binaries is like finding a needle in a haystack. Tracelets — short sequences of instructions along execution paths — provide a robust unit of comparison that survives compilation differences.
The Problem
Given a known binary function — for example, a vulnerable function in a library — we want to find similar functions in other binaries. This is critical for vulnerability detection: if a security flaw is found in one compiled version of a function, we need to locate all other binaries that contain a version of the same function.
But direct comparison fails. Different compilers rearrange instructions, use different registers, apply different optimizations, and even change the control flow structure. A function compiled with GCC -O0 can look completely different from the same function compiled with Clang -O2, even though they perform the exact same computation.
The Key Idea
Instead of comparing entire functions at once, we decompose each function into tracelets — short, bounded-length paths through the control flow graph. Each tracelet captures a small unit of computation: a sequence of instructions that might execute consecutively along some path.
The key insight is that while compilers may rearrange the global structure of a function, these small local computations tend to be preserved. By matching tracelets between a query function and target functions, we can identify similar code even when the overall structure differs dramatically.
Interactive Demo: Tracelet Search
Tracelet-Based Binary Code Search
Query Function (vulnerable_parse)
Target Binary
Click "Extract Tracelets" to decompose the query function into tracelets along CFG paths.
How It Works
CFG Decomposition
Each binary function is disassembled and its control flow graph (CFG) is recovered. The CFG represents basic blocks as nodes and control flow transitions as edges. This is the standard starting point for binary analysis.
Tracelet Extraction
We enumerate paths of bounded length k through the CFG. Each path visits at most k basic blocks. The instructions along each path form a tracelet — a concrete sequence of instructions that might execute in order. By keeping the length bounded, we get a manageable number of tracelets while still capturing meaningful computation patterns.
Tracelet Matching
To compare two functions, we match their tracelets using a similarity metric that accounts for register renaming, constant differences, and minor instruction reordering. Each tracelet from the query is matched against the best-fitting tracelet in the target. The overall similarity score is the fraction of query tracelets that find a good match.
Results
Tracelet-based search effectively finds similar code across different compilers (GCC, Clang, ICC) and optimization levels (O0 through O3). The decomposition into tracelets provides robustness: even when a compiler completely restructures the control flow, the local computation patterns captured by tracelets remain recognizable. This enables practical vulnerability search in stripped binaries at scale.
The approach was evaluated on real-world binaries compiled with different compilers and optimization levels. Key findings include:
- High precision in cross-compiler search, correctly identifying functions compiled from the same source code across GCC, Clang, and ICC.
- Resilience to optimization level differences — functions compiled at O0 can be matched to their O2 or O3 counterparts.
- Scalability to large codebases, enabling practical vulnerability search across repositories of compiled software.