Statistical Similarity of Binaries

TL;DR — How similar are two binary functions? This paper decomposes functions into short instruction sequences called "strands" and uses statistical methods to measure similarity, enabling cross-compiler and cross-optimization binary comparison.

The Problem

Binary code comparison is essential for vulnerability detection, plagiarism detection, and patch analysis. When a critical vulnerability is discovered in a library, security analysts need to determine which deployed binaries contain the vulnerable code. When software is suspected of plagiarism, analysts must compare binaries even without access to source code.

The fundamental challenge is that the same source code, compiled with different compilers (GCC vs. Clang) or different optimization levels (O0 vs. O3), produces dramatically different binary code. Register assignments change, instructions are reordered, loops are unrolled, and functions are inlined. Traditional exact-matching approaches fail completely in this setting.

The Key Idea

Instead of comparing entire functions monolithically, the paper proposes decomposing each binary function into strands — short, semantically meaningful instruction sequences extracted by following data-flow dependencies in the binary's data-flow graph. Each strand captures a small, self-contained computation within the function.

The key insight is that even when a compiler completely reorganizes a function, the underlying data-flow computations remain largely the same. By extracting strands and comparing the sets of strands statistically rather than requiring exact structural matches, the approach achieves robustness to compiler variations while preserving semantic precision.

Interactive Demo

Strand-Based Binary Similarity

Select a comparison scenario, then step through the analysis to see how strand decomposition enables cross-compiler binary similarity.

Same source compiled with GCC -O1 twice — nearly identical binaries.

Function A (GCC -O1)

Function B (GCC -O1)

How It Works

Strand Extraction

Given a binary function, the approach first constructs its data-flow graph, where nodes represent instructions and edges represent data dependencies. A strand is extracted by selecting an instruction and tracing backward through its data-flow dependencies for a bounded number of steps. This produces a short sequence of instructions that together compute one logical value.

Each function yields a set of strands, one rooted at each instruction. Strands naturally capture the semantic building blocks of the function: an address computation, a conditional test, an arithmetic expression, etc.

Statistical Comparison

To compare two functions, the approach compares their strand sets using a statistical similarity measure. For each strand in function A, it finds the best matching strand in function B. The overall similarity score aggregates these pairwise strand similarities. Because the comparison operates on sets rather than sequences, it is inherently robust to instruction reordering, register renaming, and other compiler-induced transformations.

Results

The approach was evaluated on real-world binaries compiled with different compilers (GCC and Clang/LLVM) and different optimization levels (O0 through O3). Key findings include:

High accuracy across compilers: the approach correctly identifies matching functions even when compiled with completely different compilers, achieving high precision and recall.
Robust to optimization levels: similarity scores remain high for the same function compiled at O0 vs. O3, despite radical code transformations.
Scalable: the strand-based decomposition enables efficient comparison of large binaries with thousands of functions.
Practical applications: the technique was applied to real vulnerability search tasks, successfully locating known vulnerabilities in binary firmware images.

By decomposing binary functions into data-flow strands and comparing them statistically, this work provides a principled and practical approach to binary similarity that is robust to the many transformations introduced by different compilers and optimization settings.

Yaniv David, Nimrod Partush, and Eran Yahav. 2016. Statistical Similarity of Binaries. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16). ACM, New York, NY, USA, 266-280. https://doi.org/10.1145/2908080.2908126