Cross-Supervised Synthesis of Web-Crawlers

TL;DR: Building web crawlers manually for each site is tedious. This paper synthesizes crawlers automatically by using one website's known structure to supervise extraction from similar websites — “cross-supervision” between sites of the same domain.

The Problem

The web is full of sites that contain the same type of structured information — recipe sites listing ingredients and cooking times, job boards with titles and salaries, e-commerce pages with product names and prices. Yet every site wraps that data in different HTML structures, CSS classes, and layouts.

If you want to extract structured data from one of these sites, you write a custom scraper: carefully inspect the DOM, craft CSS selectors or XPath expressions, and test against edge cases. That is manageable for one site. But when you need to cover 100 similar sites — 100 recipe portals, 100 job boards — writing 100 hand-crafted scrapers is impractical, fragile, and expensive to maintain.

The Key Idea

Suppose you already have a correctly extracted dataset from one recipe site (Site A) — you know which text corresponds to the recipe name, ingredients, and cooking time. Can you use that knowledge to automatically build an extractor for a completely different recipe site (Site B)?

That is exactly what cross-supervision does. The known data from Site A acts as a training signal: the system searches for HTML patterns in Site B that produce output matching the same kind of data it already knows about. By aligning the values across sites — not the HTML structure — it synthesizes extraction rules for the new site without any human labeling.

The key insight is that even though two websites look completely different structurally, the data they contain often overlaps or shares the same semantic types. A recipe name on Site A might also appear on Site B (or at least follow the same textual patterns), allowing the system to bootstrap extraction on the new site.

Interactive Demo

Cross-Supervised Extraction

Site A (Labeled)

→

Site B (Unlabeled)

Name Ingredients Time

Click "Cross-Supervise" to see how labeled data from Site A guides extraction on Site B. Toggle domains to see different vertical types.

How It Works

Cross-Site Feature Alignment

The system begins with a set of pages from Site A where the relevant fields (e.g., recipe name, ingredients, cooking time) are already identified. It extracts features from these fields — not just CSS selectors, but semantic properties like text patterns, DOM depth, sibling structure, and value distributions.

When presented with pages from Site B, it searches for HTML elements whose features align with those learned from Site A. For example, if recipe names on Site A were always inside an h1 near the top of a specific container, the system looks for similar structural and semantic patterns on Site B — even if the exact tags and class names are different.

Rule Synthesis

Once candidate elements on Site B are identified through feature alignment, the system synthesizes extraction rules — concrete XPath or CSS-like expressions that reliably select the correct elements. It uses the overlapping data between sites as a verification signal: if an extracted value from Site B matches known data from the domain (or structurally resembles it), the rule is likely correct.

The synthesis process iterates, refining rules until they consistently extract the right fields across multiple pages from Site B. The result is a complete, site-specific extractor that required zero manual labeling of Site B.

Results

The cross-supervision approach automatically creates extractors for new websites with minimal human effort. In experiments, the method achieves high precision and recall across multiple web domains, producing extractors that rival hand-crafted ones — while eliminating the need to manually inspect and label each target site.

Tested across multiple domains (recipes, job listings, products) with structurally diverse websites.
Achieves high extraction accuracy by leveraging cross-site data overlap rather than site-specific labels.
Scales to many target sites from a single labeled source — one correct extraction can bootstrap dozens of new extractors.
Synthesized rules are interpretable and editable, unlike black-box ML approaches.

@inproceedings{omari2016cross, title={Cross-Supervised Synthesis of Web-Crawlers}, author={Omari, Adi and Shoham, Sharon and Yahav, Eran}, booktitle={Proceedings of the 38th International Conference on Software Engineering (ICSE)}, year={2016} }