PLDI 2014

Tracelet-Based Code Search in Executables

Yaniv David, Eran Yahav

TL;DR — Searching for similar code in binaries is like finding a needle in a haystack. Tracelets — short sequences of instructions along execution paths — provide a robust unit of comparison that survives compilation differences.

The Problem

Given a known binary function — for example, a vulnerable function in a library — we want to find similar functions in other binaries. This is critical for vulnerability detection: if a security flaw is found in one compiled version of a function, we need to locate all other binaries that contain a version of the same function.

But direct comparison fails. Different compilers rearrange instructions, use different registers, apply different optimizations, and even change the control flow structure. A function compiled with GCC -O0 can look completely different from the same function compiled with Clang -O2, even though they perform the exact same computation.

The Key Idea

Instead of comparing entire functions at once, we decompose each function into tracelets — short, bounded-length paths through the control flow graph. Each tracelet captures a small unit of computation: a sequence of instructions that might execute consecutively along some path.

The key insight is that while compilers may rearrange the global structure of a function, these small local computations tend to be preserved. By matching tracelets between a query function and target functions, we can identify similar code even when the overall structure differs dramatically.

Interactive Demo: Tracelet Search

Tracelet-Based Binary Code Search

1
2
3

Query Function (vulnerable_parse)

B0: entry push rbp mov rbp, rsp cmp rdi, 0 B1: loop mov rax, [rsi] add rsi, 8 cmp rax, rdx jne .B2 B2: check test rdi, rdi je .B4 dec rdi B3: body lea rcx, [rax+8] mov [rcx], rbx call memcpy jmp .B4 B4: exit xor eax, eax pop rbp ret

Target Binary

parse_input
push rbp mov rbp, rsp sub rsp, 16 test rdi, rdi je .L3 mov rax, [rsi] add rsi, 8 cmp rax, rdx jne .L2 lea rcx, [rax+8] mov [rcx], rbx call memcpy .L2: dec rdi jne .L1 .L3: xor eax, eax leave ret
hash_lookup
push rbx mov rbx, rdi shr rdi, 3 and edi, 0xff mov rax, [rsi+rdi*8] test rax, rax je .done cmp [rax], rbx je .found mov rax, [rax+8] jmp .check .found: mov eax, 1 pop rbx ret .done: xor eax, eax pop rbx ret
init_buffer
push rbp mov rbp, rsp push rbx mov rdi, 256 call malloc test rax, rax je .fail mov rbx, rax xor ecx, ecx mov [rbx+rcx], cl inc ecx cmp ecx, 256 jne .fill mov rax, rbx pop rbx pop rbp ret .fail: xor eax, eax pop rbx pop rbp ret

Click "Extract Tracelets" to decompose the query function into tracelets along CFG paths.

How It Works

1

CFG Decomposition

Each binary function is disassembled and its control flow graph (CFG) is recovered. The CFG represents basic blocks as nodes and control flow transitions as edges. This is the standard starting point for binary analysis.

2

Tracelet Extraction

We enumerate paths of bounded length k through the CFG. Each path visits at most k basic blocks. The instructions along each path form a tracelet — a concrete sequence of instructions that might execute in order. By keeping the length bounded, we get a manageable number of tracelets while still capturing meaningful computation patterns.

3

Tracelet Matching

To compare two functions, we match their tracelets using a similarity metric that accounts for register renaming, constant differences, and minor instruction reordering. Each tracelet from the query is matched against the best-fitting tracelet in the target. The overall similarity score is the fraction of query tracelets that find a good match.

Results

Tracelet-based search effectively finds similar code across different compilers (GCC, Clang, ICC) and optimization levels (O0 through O3). The decomposition into tracelets provides robustness: even when a compiler completely restructures the control flow, the local computation patterns captured by tracelets remain recognizable. This enables practical vulnerability search in stripped binaries at scale.

The approach was evaluated on real-world binaries compiled with different compilers and optimization levels. Key findings include:

Yaniv David, Eran Yahav. "Tracelet-Based Code Search in Executables." In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2014. DOI: 10.1145/2594291.2594343