WSDM 2017

Synthesis of Forgiving Data Extractors

Adi Omari, Sharon Shoham, Eran Yahav

TL;DR: Web data extraction is brittle — small HTML changes break scrapers. “Forgiving” extractors use multiple redundant strategies to locate data, so if one approach breaks (e.g., the CSS class changes), another still works.

The Problem

Traditional web scrapers rely on a single extraction strategy — typically an XPath expression or CSS selector — to pull data from a page. This works fine until the website updates its layout: a class name gets renamed, an element is wrapped in a new container, or the ordering of sibling elements shifts. A single HTML change can break extraction entirely.

This is not a rare event. Websites change constantly — redesigns, A/B tests, new features, framework migrations. Each change threatens to silently break every scraper that depends on the old structure. Maintaining scrapers is an arms race against moving HTML.

The Key Idea

Instead of relying on a single fragile selector, we synthesize multiple extraction strategies for each data field. Each strategy locates the target data through a different structural signal:

These strategies are then combined with a voting mechanism. When one strategy breaks due to an HTML change, the others compensate. As long as a majority of strategies still agree, the extractor returns the correct value — hence the name forgiving.

HTML Page
CSS Selector
Text Pattern
Positional
Label-based
Voting / Agreement
Extracted Value
Multiple strategies independently extract a candidate value; a voting layer resolves disagreements.

Interactive Demo

Below is a simplified product page. Three extraction strategies target each field (name, price, rating). Click a mutation scenario to see how the HTML changes, which strategies break, and why the forgiving extractor still succeeds while a rigid one fails.

Forgiving Extractor Simulator

Page HTML

Extraction Strategies

Voting Results

All strategies agree on the original HTML. Try a mutation to see the forgiving extractor in action.

How It Works

Multi-Strategy Synthesis

Given a set of example pages where the target data is labeled, the system automatically synthesizes multiple extraction programs for each field. Each program uses a different feature of the HTML to locate the data:

By drawing on orthogonal signals, the synthesized strategies are independently vulnerable — a single HTML change is unlikely to break all of them simultaneously.

Voting-Based Combination

At extraction time, every strategy runs independently and produces a candidate value. A voting mechanism compares these candidates:

This is analogous to ensemble methods in machine learning, where combining multiple weak learners produces a robust predictor. Here, each "weak learner" is an extraction strategy that may fail under certain mutations but succeeds under others.

Results

The forgiving extractor approach significantly outperforms traditional single-strategy extractors. In experiments on real-world websites that underwent layout changes, forgiving extractors maintained correct extraction in cases where rigid, single-selector extractors failed completely. The redundancy of multiple strategies provides graceful degradation: even when several strategies break, the remaining ones can still produce the correct output.

Key findings from the evaluation:

@inproceedings{omari2017synthesis, title = {Synthesis of Forgiving Data Extractors}, author = {Omari, Adi and Shoham, Sharon and Yahav, Eran}, booktitle = {Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM)}, year = {2017}, doi = {10.1145/3018661.3018698} }