Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs

TL;DR — When binaries are stripped of debug info, function names become meaningless labels like sub_401000. NERO uses neural networks on the control flow graph (CFG) to predict meaningful function names, aiding reverse engineering.

The Problem

Compilers translate human-readable source code into machine code. When software is released, it is typically stripped of all debug information — function names, variable names, type annotations — to reduce binary size and hinder reverse engineering. The result is a binary full of anonymous functions labeled sub_401000, sub_401080, and so on.

Reverse engineers — security researchers analyzing malware, vulnerability hunters auditing closed-source software, or developers understanding legacy systems — must manually read assembly code to figure out what each function does. This is a tedious, time-consuming process that demands deep expertise and can take hours per function.

The Key Idea

NERO treats each binary function as a control flow graph (CFG), where each node is a basic block — a straight-line sequence of assembly instructions with a single entry and exit point. The edges represent branches and jumps between blocks.

The model works in three stages:

LSTM encodes each basic block. The assembly instructions within each block are fed sequentially into an LSTM, producing a fixed-length vector representation that captures the block's semantics.
GNN propagates information across the CFG. A graph neural network passes messages between neighboring blocks, allowing each node to incorporate structural context — how it relates to the rest of the function.
Sequence decoder predicts the function name. The aggregated graph representation is decoded into a sequence of sub-tokens (e.g., binary + search), generating a human-readable name for the function.

Interactive Demo

See how stripping erases meaningful names and how NERO recovers them from the binary's structure.

Compile & Strip Simulator

Source Code

Control Flow Graph

NERO Prediction

Predicted function name:

Confidence

Select an example function, then click "Compile & Strip" to see the full pipeline in action.

How It Works

Step 1

LSTM

Basic Blocks

→

Step 2

GNN

CFG Structure

→

Step 3

Decoder

Name Prediction

The LSTM reads the assembly instructions in each basic block one by one, building a dense vector that encodes the block's behavior. These per-block embeddings are then placed on the nodes of the function's CFG.

The GNN performs several rounds of message passing on this graph. In each round, every node aggregates information from its neighbors, allowing the model to capture patterns that span multiple blocks — for instance, a loop structure or an error-handling branch.

Finally, the graph-level representation (computed by aggregating all node embeddings) is fed into an attention-based sequence decoder. The decoder generates the function name one sub-token at a time, naturally handling compound names like binary_search or read_config_file.

Results

NERO was evaluated on a large dataset of real-world stripped binaries. The model accurately recovers meaningful function names even from complex, optimized binaries compiled at different optimization levels.

NERO significantly outperforms prior approaches at predicting function names from stripped binaries, demonstrating that the control flow graph structure — not just the raw instruction sequence — carries rich semantic information about a function's purpose.

The approach generalizes across different compilers and optimization levels, and the predicted names are often exact matches or close semantic equivalents of the original names.

@article{david2020neural, title={Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs}, author={David, Yaniv and Alon, Uri and Yahav, Eran}, journal={Proceedings of the ACM on Programming Languages}, volume={4}, number={OOPSLA}, year={2020}, publisher={ACM} }