contributed;articles
Doi: 10.1145/1897816.1897839
Google’s WebTables and Deep Web Crawler
identify and deliver this otherwise inaccessible
resource directly to end users.
By micHAeL J. cAfAReLLA, ALon HALeVy,
AnD JAyAnt mADHAVAn
structured
Data on
the Web
developed at Google over the past
five years. The first, Web Tables, compiles a huge collection of databases
by crawling the Web to find small relational databases expressed using
the HTML table tag. By performing
data mining on the resulting extracted information, Web Tables is able to
introduce new data-centric applications (such as schema completion
and synonym finding). The second,
the Google Deep Web Crawler, attempts to surface information from
the Deep Web, referring to data on
the Web available only by filling out
Web forms, so cannot be crawled by
traditional crawlers. We describe how
this data is crawled by automatically
submitting relevant queries to a vast
number of Web forms. The two projects are just the first steps toward exposing and managing structured Web
data largely ignored by Web search
engines.
ThoUgh The WeB is best known as a vast repository
of shared documents, it also contains a significant
amount of structured data covering a complete range
of topics, from product to financial, public-record,
scientific, hobby-related, and government. Structured
data on the Web shares many similarities with the
kind of data traditionally managed by commercial
database systems but also reflects some unusual
characteristics of its own; for example, it is embedded
in textual Web pages and must be extracted prior to
use; there is no centralized data design as there is in
a traditional database; and, unlike traditional
databases that focus on a single domain, it covers
everything. Existing data-management systems do
not address these challenges and assume their data
is modeled within a well-defined domain.
This article discusses the nature of Web-embedded
structured data and the challenges of managing it. To
begin, we present two relevant research projects
Web Data
Structured data on the Web exists in
several forms, including HTML tables, HTML lists, and back-end Deep
Web databases (such as the books
sold on Amazon.com). We estimate
in excess of one billion data sets as of
February 2011. More than 150 million
sources come from a subset of all Eng-lish-language HTML tables, 4, 5 while
Elmeleegy et al11 suggested an equal
number from HTML lists, a total that
does not account for the non-English
Web. Finally, our experience at Google
key insights
;;; Because data on the Web is about
everything, any approach that attempts
to leverage it cannot rely on building a
model of the data ahead of time but on
domain-independent methods instead.
;;; the sheer quantity and heterogeneity of
structured data on the Web enables new
approaches to problems involving data
integration from multiple sources.
;;; While the content of structured data is
typically different from what is found in
text on the Web, each content collection
can be leveraged to better understand
other collections.