POPL 2016

Estimating Types in Binaries using Predictive Modeling

Omer Katz, Ran El-Yaniv, Eran Yahav

TL;DR: Stripped binaries lose all type information — every variable is just a register or memory location. This paper uses machine learning to predict variable types (int, float, pointer, struct) from binary code patterns, recovering high-level type information.

The Problem

When a compiler produces a binary, it strips away all the rich type information from the original source code. Variable names disappear. Type annotations vanish. Structs and their field layouts are flattened into raw memory offsets. What remains is a sea of registers and memory addresses — no indication of whether eax holds an integer, a pointer, or the bit pattern of a float.

For reverse engineers, recovering types is one of the most time-consuming and error-prone parts of binary analysis. A human analyst stares at assembly code, mentally tracking how each register is used: Is it added to? Dereferenced? Passed to a floating-point instruction? These clues accumulate, but the process is painfully manual. And yet, the types are crucial — without them, understanding what a function actually does is nearly impossible.

The Key Idea

The central insight is that how a variable is used reveals its type. Even though the type annotation is gone, the ghost of the type persists in the instruction patterns:

The paper extracts these usage patterns as sequences of instructions (object tracelets) and uses Statistical Language Models (SLMs) based on Variable-order Markov Models (VMMs) to predict types. The SLMs capture the characteristic instruction patterns for each type by learning the statistical regularities in how typed values are manipulated.

Interactive Demo: Type Prediction from Assembly

Stripped Binary Analysis

Integer
Float
Pointer
Struct
Unknown

                    

Click "Predict Types" to analyze usage patterns and infer variable types from the assembly code.

How It Works

The approach follows a structured pipeline from raw binary to typed variables:

1 Disassemble Binary to assembly instructions
2 Identify Variables Registers & stack locations
3 Extract Features Usage patterns per variable
4 SLM Prediction Statistical type inference

Feature Extraction

For each variable (register or stack slot), the system extracts a rich set of features that capture how the variable is used in the binary code:

SLM-Based Prediction

For each candidate type, the method trains a separate Statistical Language Model that captures the typical instruction sequences for variables of that type. Given a new variable's tracelet (the sequence of instructions that use it), the method queries each type-specific SLM and assigns the type whose model gives the highest probability. The VMM-based SLMs can capture variable-length instruction patterns, making them more expressive than fixed-order n-grams.

Results

The approach was evaluated on real-world binaries compiled from C programs with known types (used as ground truth). The system achieves high accuracy across different type categories:

The SLM-based predictor achieves high accuracy on predicting variable types in stripped x86 binaries. Pointer types and integer types are the easiest to recover, demonstrating that instruction-level usage patterns carry strong type signal even in stripped binaries.

The results demonstrate that machine learning can automate a substantial portion of the type recovery task that previously required expert human analysis. The predicted types can be fed into decompilers to produce more readable and accurate decompiled code, directly benefiting reverse engineering workflows.

@inproceedings{katz2016estimating, title={Estimating Types in Binaries using Predictive Modeling}, author={Katz, Omer and El-Yaniv, Ran and Yahav, Eran}, booktitle={Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL)}, year={2016}, publisher={ACM} }