KDD 2016

Lossless Separation of Web Pages into Layout Code and Data

Adi Omari, Benny Kimelfeld, Sharon Shoham, Eran Yahav

TL;DR: A web page is a mix of layout/template code and actual data content. This paper automatically separates the two by comparing multiple pages from the same site — the template is what stays constant, the data is what changes.

The Problem

Web pages embed data within HTML templates. A product page on an e-commerce site, for example, wraps product names, prices, and descriptions inside a fixed HTML structure with headers, footers, navigation bars, and styling markup. To extract structured data at scale, you need to separate the reusable template from the varying content — but doing this manually doesn't scale. With millions of websites, each with its own template structure, we need an automatic method.

Existing web extraction approaches often rely on hand-crafted wrappers or fragile heuristics. What if we could automatically and losslessly decompose any web page into its template and data components?

The Key Idea

Given multiple pages from the same template (e.g., product pages from the same e-commerce site), the method aligns their DOM trees and identifies which nodes are constant (template) versus varying (data). The separation is "lossless" — you can reconstruct any original page from the template plus the extracted data, with no information lost.

The intuition is simple: if you look at three product pages from the same site, the shared HTML skeleton is the template, and the parts that differ — product name, price, image, description — are the data. The algorithm formalizes this intuition through DOM tree alignment and a classification of each node as either constant or variant.

Interactive Demo

Template-Data Separation

Three product pages share the same HTML template but contain different data. Click Separate to align the DOMs and classify each node.

Page 1
<div class="product">
<img src="shoe1.jpg">
<h2>Running Shoe</h2>
<span class="price">$89</span>
<p>Lightweight running shoe.</p>
<a class="btn">Add to Cart</a>
</div>
Page 2
<div class="product">
<img src="boot1.jpg">
<h2>Hiking Boot</h2>
<span class="price">$129</span>
<p>Waterproof hiking boot.</p>
<a class="btn">Add to Cart</a>
</div>
Page 3
<div class="product">
<img src="sandal1.jpg">
<h2>Beach Sandal</h2>
<span class="price">$45</span>
<p>Comfortable beach sandal.</p>
<a class="btn">Add to Cart</a>
</div>
1 Extracted Template (with data holes)
2 Extracted Data Table

Click a row to reconstruct that page from template + data.

Page Image Title Price Description
3 Reconstructed Page
4 Rendered View with Overlay
Click Separate to align the three DOMs and identify template vs data nodes.

How It Works

DOM Tree Alignment

The first step is to align the DOM trees of multiple pages generated from the same template. The algorithm treats each web page as a rooted, ordered tree of DOM nodes. Given two or more such trees, it computes an alignment that matches corresponding nodes across pages. Nodes that appear in all pages at the same structural position are candidate template nodes; nodes whose content differs are candidate data nodes.

The alignment algorithm extends classical tree-edit-distance techniques. It finds the minimum-cost mapping between nodes of different pages, where the cost reflects both structural position and textual content. Nodes with identical tag names, attributes, and text that appear in matching positions across all pages are aligned together.

Constant vs Variant Classification

Once nodes are aligned, the method classifies each position in the template as either constant or variant:

The template is the tree with constant nodes intact and variant positions replaced by "holes." The data is a table where each row corresponds to one input page and each column corresponds to one hole in the template.

Lossless Reconstruction

The separation is lossless: given the extracted template and any row from the data table, you can reconstruct the original page exactly. No information is lost in the separation process. This property is what distinguishes this approach from lossy extraction methods that may drop structural details or fail to capture all data fields.

Formally, for any page p generated from the template, there exists a data record d such that Template(d) = p. The template acts as a function from data records to pages, and the separation inverts this function.

Results

The method accurately separates template from data on real-world websites, achieving high precision and recall across a diverse set of web domains. It handles complex template structures including nested loops, optional sections, and pages with varying numbers of data items. The lossless guarantee ensures no information is discarded, making it suitable for downstream tasks like knowledge base construction and data integration.

@inproceedings{omari2016lossless, title={Lossless Separation of Web Pages into Layout Code and Data}, author={Omari, Adi and Kimelfeld, Benny and Shoham, Sharon and Yahav, Eran}, booktitle={Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)}, year={2016} }