Synthesis of Forgiving Data Extractors

TL;DR: Web data extraction is brittle — small HTML changes break scrapers. “Forgiving” extractors use multiple redundant strategies to locate data, so if one approach breaks (e.g., the CSS class changes), another still works.

The Problem

Traditional web scrapers rely on a single extraction strategy — typically an XPath expression or CSS selector — to pull data from a page. This works fine until the website updates its layout: a class name gets renamed, an element is wrapped in a new container, or the ordering of sibling elements shifts. A single HTML change can break extraction entirely.

This is not a rare event. Websites change constantly — redesigns, A/B tests, new features, framework migrations. Each change threatens to silently break every scraper that depends on the old structure. Maintaining scrapers is an arms race against moving HTML.

The Key Idea

Instead of relying on a single fragile selector, we synthesize multiple extraction strategies for each data field. Each strategy locates the target data through a different structural signal:

CSS selector — matches by element type and class name
Text pattern — matches by the format or surrounding text of the value
Positional — matches by the element's position in the DOM tree
Label-based — matches by proximity to a descriptive label

These strategies are then combined with a voting mechanism. When one strategy breaks due to an HTML change, the others compensate. As long as a majority of strategies still agree, the extractor returns the correct value — hence the name forgiving.

HTML Page

↓

CSS Selector

Text Pattern

Positional

Label-based

↓

Voting / Agreement

↓

Extracted Value

Multiple strategies independently extract a candidate value; a voting layer resolves disagreements.

Interactive Demo

Below is a simplified product page. Three extraction strategies target each field (name, price, rating). Click a mutation scenario to see how the HTML changes, which strategies break, and why the forgiving extractor still succeeds while a rigid one fails.

Forgiving Extractor Simulator

Page HTML

Extraction Strategies

Voting Results

All strategies agree on the original HTML. Try a mutation to see the forgiving extractor in action.

How It Works

Multi-Strategy Synthesis

Given a set of example pages where the target data is labeled, the system automatically synthesizes multiple extraction programs for each field. Each program uses a different feature of the HTML to locate the data:

Structural features — the tag hierarchy and CSS classes surrounding the target
Content features — regular-expression patterns that match the value's format (e.g., $\d+\.\d{2} for prices)
Relational features — the position of the target relative to landmark elements like labels or headings

By drawing on orthogonal signals, the synthesized strategies are independently vulnerable — a single HTML change is unlikely to break all of them simultaneously.

Voting-Based Combination

At extraction time, every strategy runs independently and produces a candidate value. A voting mechanism compares these candidates:

If a majority agree, that value is returned with high confidence.
If there is no majority, the system can flag the result for manual review or fall back to a confidence-weighted selection.

This is analogous to ensemble methods in machine learning, where combining multiple weak learners produces a robust predictor. Here, each "weak learner" is an extraction strategy that may fail under certain mutations but succeeds under others.

Results

The forgiving extractor approach significantly outperforms traditional single-strategy extractors. In experiments on real-world websites that underwent layout changes, forgiving extractors maintained correct extraction in cases where rigid, single-selector extractors failed completely. The redundancy of multiple strategies provides graceful degradation: even when several strategies break, the remaining ones can still produce the correct output.

Key findings from the evaluation:

Forgiving extractors tolerate a wide range of common HTML mutations — class renames, element reordering, structural wrapping, and attribute changes.
The synthesis algorithm is efficient enough to generate multi-strategy extractors from a small number of labeled examples.
The voting mechanism adds negligible overhead at extraction time compared to running a single selector.

@inproceedings{omari2017synthesis, title = {Synthesis of Forgiving Data Extractors}, author = {Omari, Adi and Shoham, Sharon and Yahav, Eran}, booktitle = {Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM)}, year = {2017}, doi = {10.1145/3018661.3018698} }