PLDI 2017

Similarity of Binaries through Re-optimization

Yaniv David, Nimrod Partush, Eran Yahav

TL;DR — Comparing binary functions compiled by different compilers is hard because they produce very different code. This paper's key insight: re-optimize both binaries through the same compiler's optimization pipeline, bringing them to a canonical form that reveals their similarity.

The Problem

Binary similarity is essential for finding vulnerabilities across software, detecting plagiarism, and understanding malware variants. But the same source code looks completely different when compiled with GCC vs. Clang, or with -O0 vs. -O3.

Consider a simple function like strlen. GCC might unroll the loop and use SSE instructions, while Clang might emit a compact byte-scanning loop. The register choices differ, the instruction sequences differ, even the control flow graph structure differs. Traditional binary comparison techniques that rely on syntactic matching or graph isomorphism struggle with these compiler-induced variations.

This is not a minor issue. The differences introduced by compiler choice and optimization level can be far greater than the differences between genuinely unrelated functions, making naive similarity measures unreliable.

The Key Idea

The core insight is elegantly simple: if two binaries were compiled from the same source but look different because of different compilers, we can neutralize those differences by re-optimizing both through the same compiler.

The approach works in three steps:

  1. Lift both binary functions to an intermediate representation (LLVM IR) using binary-to-IR lifting.
  2. Re-optimize both lifted IR representations through LLVM's optimization pipeline. The optimizer canonicalizes the code, stripping away compiler-specific quirks.
  3. Compare the re-optimized results. Since both went through the same optimization pipeline, compiler-specific differences are largely eliminated, and the underlying algorithmic similarity is revealed.
Binary A
(GCC)
Lift to IR
LLVM Re-optimize
Compare
LLVM Re-optimize
Lift to IR
Binary B
(Clang)

Interactive Demo: Re-optimization in Action

Select a function below, then click Re-optimize to see how running both binaries through the same optimization pipeline reveals their hidden similarity.

Binary A (GCC -O2)

Binary B (Clang -O3)

Similarity: 32%
32%

Click Re-optimize to lift both binaries to LLVM IR and re-optimize them through the same pipeline.

How It Works

Step 1: Lifting to IR

The binary is disassembled and translated into LLVM's intermediate representation. This is a lossy but surprisingly effective transformation. While perfect decompilation is undecidable in general, modern lifting tools (like McSema or rev.ng) handle the common cases well enough for the similarity task. The lifted IR preserves the function's computational semantics even if some low-level details are approximated.

Step 2: LLVM Re-optimization

The lifted IR is then fed through LLVM's standard optimization passes: constant propagation, dead code elimination, instruction combining, loop canonicalization, and more. These passes normalize the code into a canonical form. Two different IR representations of the same algorithm will converge toward the same optimized form, because the optimizer makes the same decisions when faced with equivalent computations.

Step 3: Strand-based Comparison

After re-optimization, the comparison uses strands -- sequences of dependent instructions extracted via data-flow analysis. Strands capture the essential computation of a function while being robust to instruction reordering and register renaming. Two functions are similar if their sets of strands overlap significantly.

Why strands? Even after re-optimization, minor differences may remain (e.g., different register allocation choices). Strands abstract away these superficial differences by focusing on data-flow dependencies rather than exact instruction sequences.

Results

The re-optimization approach significantly outperforms direct binary comparison methods. On a benchmark of functions compiled with different compilers (GCC, Clang, ICC) and optimization levels (O0 through O3), the approach achieves:

The method was evaluated on real-world vulnerability search tasks, successfully identifying known vulnerabilities in binaries compiled under different settings than the reference vulnerability signature.

Yaniv David, Nimrod Partush, and Eran Yahav. 2017. Similarity of Binaries through Re-optimization. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '17). ACM, New York, NY, USA, 79-94. https://doi.org/10.1145/3062341.3062387