Estimating Types in Binaries using Predictive Modeling

TL;DR: Stripped binaries lose all type information — every variable is just a register or memory location. This paper uses machine learning to predict variable types (int, float, pointer, struct) from binary code patterns, recovering high-level type information.

The Problem

When a compiler produces a binary, it strips away all the rich type information from the original source code. Variable names disappear. Type annotations vanish. Structs and their field layouts are flattened into raw memory offsets. What remains is a sea of registers and memory addresses — no indication of whether eax holds an integer, a pointer, or the bit pattern of a float.

For reverse engineers, recovering types is one of the most time-consuming and error-prone parts of binary analysis. A human analyst stares at assembly code, mentally tracking how each register is used: Is it added to? Dereferenced? Passed to a floating-point instruction? These clues accumulate, but the process is painfully manual. And yet, the types are crucial — without them, understanding what a function actually does is nearly impossible.

The Key Idea

The central insight is that how a variable is used reveals its type. Even though the type annotation is gone, the ghost of the type persists in the instruction patterns:

Integers are manipulated with arithmetic instructions like ADD, SUB, IMUL, and compared with CMP.
Floats pass through the FPU or SSE pipeline — MOVSS, ADDSS, CVTSI2SS.
Pointers get dereferenced with MOV [reg], used in LEA for address computation, and passed to memory allocation functions.
Structs show up as base-plus-offset access patterns — MOV eax, [ebx+0x8] followed by MOV ecx, [ebx+0xC].

The paper extracts these usage patterns as sequences of instructions (object tracelets) and uses Statistical Language Models (SLMs) based on Variable-order Markov Models (VMMs) to predict types. The SLMs capture the characteristic instruction patterns for each type by learning the statistical regularities in how typed values are manipulated.

Interactive Demo: Type Prediction from Assembly

Stripped Binary Analysis

Integer

Float

Pointer

Struct

Unknown

Click "Predict Types" to analyze usage patterns and infer variable types from the assembly code.

How It Works

The approach follows a structured pipeline from raw binary to typed variables:

1 Disassemble Binary to assembly instructions

2 Identify Variables Registers & stack locations

3 Extract Features Usage patterns per variable

4 SLM Prediction Statistical type inference

Feature Extraction

For each variable (register or stack slot), the system extracts a rich set of features that capture how the variable is used in the binary code:

Instruction context: Which opcodes touch this variable? Arithmetic (ADD, MUL), floating-point (MOVSS, ADDSD), memory access (LEA, dereferences), or comparison (CMP, TEST)?
Data flow: Where does the value come from and where does it go? If it flows into a known library function like printf or malloc, the expected argument types provide strong signal.
Access patterns: Is the variable used as a base register with varying offsets? That is a strong indicator of a struct or array pointer.
Constants: Are constants involved? Small constants in comparisons suggest integer loop bounds. Hexadecimal masks suggest bit manipulation on integers.

SLM-Based Prediction

For each candidate type, the method trains a separate Statistical Language Model that captures the typical instruction sequences for variables of that type. Given a new variable's tracelet (the sequence of instructions that use it), the method queries each type-specific SLM and assigns the type whose model gives the highest probability. The VMM-based SLMs can capture variable-length instruction patterns, making them more expressive than fixed-order n-grams.

Results

The approach was evaluated on real-world binaries compiled from C programs with known types (used as ground truth). The system achieves high accuracy across different type categories:

The SLM-based predictor achieves high accuracy on predicting variable types in stripped x86 binaries. Pointer types and integer types are the easiest to recover, demonstrating that instruction-level usage patterns carry strong type signal even in stripped binaries.

The results demonstrate that machine learning can automate a substantial portion of the type recovery task that previously required expert human analysis. The predicted types can be fed into decompilers to produce more readable and accurate decompiled code, directly benefiting reverse engineering workflows.

@inproceedings{katz2016estimating, title={Estimating Types in Binaries using Predictive Modeling}, author={Katz, Omer and El-Yaniv, Ran and Yahav, Eran}, booktitle={Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL)}, year={2016}, publisher={ACM} }