Similarity of Binaries through Re-optimization

TL;DR — Comparing binary functions compiled by different compilers is hard because they produce very different code. This paper's key insight: re-optimize both binaries through the same compiler's optimization pipeline, bringing them to a canonical form that reveals their similarity.

The Problem

Binary similarity is essential for finding vulnerabilities across software, detecting plagiarism, and understanding malware variants. But the same source code looks completely different when compiled with GCC vs. Clang, or with -O0 vs. -O3.

Consider a simple function like strlen. GCC might unroll the loop and use SSE instructions, while Clang might emit a compact byte-scanning loop. The register choices differ, the instruction sequences differ, even the control flow graph structure differs. Traditional binary comparison techniques that rely on syntactic matching or graph isomorphism struggle with these compiler-induced variations.

This is not a minor issue. The differences introduced by compiler choice and optimization level can be far greater than the differences between genuinely unrelated functions, making naive similarity measures unreliable.

The Key Idea

The core insight is elegantly simple: if two binaries were compiled from the same source but look different because of different compilers, we can neutralize those differences by re-optimizing both through the same compiler.

The approach works in three steps:

Lift both binary functions to an intermediate representation (LLVM IR) using binary-to-IR lifting.
Re-optimize both lifted IR representations through LLVM's optimization pipeline. The optimizer canonicalizes the code, stripping away compiler-specific quirks.
Compare the re-optimized results. Since both went through the same optimization pipeline, compiler-specific differences are largely eliminated, and the underlying algorithmic similarity is revealed.

Binary A
(GCC)

→

Lift to IR

→

LLVM Re-optimize

→

Compare

←

LLVM Re-optimize

←

Lift to IR

←

Binary B
(Clang)

Interactive Demo: Re-optimization in Action

Select a function below, then click Re-optimize to see how running both binaries through the same optimization pipeline reveals their hidden similarity.

Binary A (GCC -O2)

Binary B (Clang -O3)

Similarity: 32%

32%

Click Re-optimize to lift both binaries to LLVM IR and re-optimize them through the same pipeline.

How It Works

Step 1: Lifting to IR

The binary is disassembled and translated into LLVM's intermediate representation. This is a lossy but surprisingly effective transformation. While perfect decompilation is undecidable in general, modern lifting tools (like McSema or rev.ng) handle the common cases well enough for the similarity task. The lifted IR preserves the function's computational semantics even if some low-level details are approximated.

Step 2: LLVM Re-optimization

The lifted IR is then fed through LLVM's standard optimization passes: constant propagation, dead code elimination, instruction combining, loop canonicalization, and more. These passes normalize the code into a canonical form. Two different IR representations of the same algorithm will converge toward the same optimized form, because the optimizer makes the same decisions when faced with equivalent computations.

Step 3: Strand-based Comparison

After re-optimization, the comparison uses strands -- sequences of dependent instructions extracted via data-flow analysis. Strands capture the essential computation of a function while being robust to instruction reordering and register renaming. Two functions are similar if their sets of strands overlap significantly.

Why strands? Even after re-optimization, minor differences may remain (e.g., different register allocation choices). Strands abstract away these superficial differences by focusing on data-flow dependencies rather than exact instruction sequences.

Results

The re-optimization approach significantly outperforms direct binary comparison methods. On a benchmark of functions compiled with different compilers (GCC, Clang, ICC) and optimization levels (O0 through O3), the approach achieves:

Cross-compiler matching: Functions compiled by different compilers that were previously indistinguishable from unrelated functions can now be matched with high confidence.
Cross-optimization matching: The gap between -O0 and -O3 code, which typically destroys similarity, is largely bridged by re-optimization.
Scalability: The strand-based comparison is efficient enough to handle large-scale binary databases, enabling practical vulnerability search across entire software repositories.

The method was evaluated on real-world vulnerability search tasks, successfully identifying known vulnerabilities in binaries compiled under different settings than the reference vulnerability signature.

Yaniv David, Nimrod Partush, and Eran Yahav. 2017. Similarity of Binaries through Re-optimization. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '17). ACM, New York, NY, USA, 79-94. https://doi.org/10.1145/3062341.3062387