TL;DR — When binaries are stripped of debug info, function names become meaningless labels like sub_401000. NERO uses neural networks on the control flow graph (CFG) to predict meaningful function names, aiding reverse engineering.
The Problem
Compilers translate human-readable source code into machine code. When software is released, it is typically stripped of all debug information — function names, variable names, type annotations — to reduce binary size and hinder reverse engineering. The result is a binary full of anonymous functions labeled sub_401000, sub_401080, and so on.
Reverse engineers — security researchers analyzing malware, vulnerability hunters auditing closed-source software, or developers understanding legacy systems — must manually read assembly code to figure out what each function does. This is a tedious, time-consuming process that demands deep expertise and can take hours per function.
The Key Idea
NERO treats each binary function as a control flow graph (CFG), where each node is a basic block — a straight-line sequence of assembly instructions with a single entry and exit point. The edges represent branches and jumps between blocks.
The model works in three stages:
- LSTM encodes each basic block. The assembly instructions within each block are fed sequentially into an LSTM, producing a fixed-length vector representation that captures the block's semantics.
- GNN propagates information across the CFG. A graph neural network passes messages between neighboring blocks, allowing each node to incorporate structural context — how it relates to the rest of the function.
- Sequence decoder predicts the function name. The aggregated graph representation is decoded into a sequence of sub-tokens (e.g.,
binary+search), generating a human-readable name for the function.
Interactive Demo
See how stripping erases meaningful names and how NERO recovers them from the binary's structure.
Compile & Strip Simulator
Source Code
Control Flow Graph
NERO Prediction
Select an example function, then click "Compile & Strip" to see the full pipeline in action.
How It Works
The LSTM reads the assembly instructions in each basic block one by one, building a dense vector that encodes the block's behavior. These per-block embeddings are then placed on the nodes of the function's CFG.
The GNN performs several rounds of message passing on this graph. In each round, every node aggregates information from its neighbors, allowing the model to capture patterns that span multiple blocks — for instance, a loop structure or an error-handling branch.
Finally, the graph-level representation (computed by aggregating all node embeddings) is fed into an attention-based sequence decoder. The decoder generates the function name one sub-token at a time, naturally handling compound names like binary_search or read_config_file.
Results
NERO was evaluated on a large dataset of real-world stripped binaries. The model accurately recovers meaningful function names even from complex, optimized binaries compiled at different optimization levels.
NERO significantly outperforms prior approaches at predicting function names from stripped binaries, demonstrating that the control flow graph structure — not just the raw instruction sequence — carries rich semantic information about a function's purpose.
The approach generalizes across different compilers and optimization levels, and the predicted names are often exact matches or close semantic equivalents of the original names.