Adversarial Examples for Models of Code

TL;DR — Neural models of code (like code2vec) can be fooled by simple variable renaming. DAMP (Discrete Adversarial Manipulation of Programs) systematically renames variables to change a model's prediction while preserving program semantics.

The Problem

Neural models trained on source code have achieved impressive results on tasks like method name prediction, code summarization, and bug detection. But how robust are they? It turns out, not very.

These models often rely heavily on variable names rather than the underlying program structure. An attacker can exploit this by applying a simple, semantics-preserving transformation: renaming variables. The renamed program computes exactly the same thing, but the model produces an entirely different (and wrong) prediction.

This is a serious concern. If a model can be fooled by renaming arr to sorted, how much can we trust its predictions in downstream applications like vulnerability detection or code review?

The Key Idea

DAMP — Discrete Adversarial Manipulation of Programs — is a targeted attack algorithm for models of code. The key insight is to treat variable renaming as a discrete optimization problem:

DAMP Algorithm

Compute variable influence. Use gradient-based analysis to determine which variable in the program has the most influence on the model's current prediction.

Find the best replacement name. Search for a new name that maximally shifts the model's prediction away from the correct label (untargeted) or toward a chosen wrong label (targeted).

Rename and repeat. Apply the renaming and iterate. Each step, the program remains semantically identical, but the model's confidence in the correct answer drops.

Because the transformation is limited to renaming local variables, the adversarial example is guaranteed to be a valid, semantics-preserving program. The attacker never changes what the code does — only what the model thinks it does.

Interactive Demo

See DAMP in action. The demo below shows a Java method and a code2vec model's prediction. Click Attack to watch DAMP iteratively rename variables to fool the model.

Code Attack Simulator

Adversarial training (defense)

Model prediction:

sort 92%

How It Works

Gradient-Based Variable Importance

To determine which variable to rename first, DAMP computes the gradient of the loss with respect to each variable's embedding. Variables whose embeddings have the largest gradient magnitude are the ones the model relies on most — and therefore the best targets for manipulation.

Targeted Renaming

Once the most influential variable is identified, DAMP searches the model's vocabulary for a replacement name that maximizes the change in the model's output distribution. This is a discrete search: the algorithm evaluates candidate names and picks the one that most effectively shifts the prediction.

The combination of gradient-based variable selection and targeted renaming makes DAMP both efficient (typically only 2-4 renamings needed) and effective (high attack success rate across different model architectures).

Results

DAMP was evaluated on code2vec and other neural models of code across multiple tasks. The results show that current models are highly vulnerable to adversarial variable renaming.

94%

Attack success rate (untargeted)

89%

Attack success rate (targeted)

2-4

Variable renames needed

+14%

Robustness gain with adversarial training

Importantly, the paper also proposes a defense: adversarial training. By augmenting the training data with adversarial examples, the model learns to rely more on structural features and less on variable names. This significantly improves robustness while maintaining accuracy on clean examples.

Adversarial training reduces the attack success rate from 94% to around 60%, and the adversarially-trained model maintains nearly the same accuracy on unperturbed code. This suggests that models can learn to look beyond variable names when properly trained.

@inproceedings{yefet2020adversarial, title={Adversarial Examples for Models of Code}, author={Yefet, Noam and Alon, Uri and Yahav, Eran}, journal={Proceedings of the ACM on Programming Languages}, volume={4}, number={OOPSLA}, pages={1--30}, year={2020}, publisher={ACM} }