example, if a year-founded string
is in the first row, there is nothing to
prevent the string macchiatone from
appearing beneath it. Any useful application making use of Web data
must also be able to address uncertain data design and quality.
Vast number of topics. A tradi-
tional database typically focuses on
a particular domain (such as prod-
ucts or proteins) and therefore can
be modeled in a coherent schema. On
the Web, data covers everything, and
is also one of its appeals. The breadth
and cultural variations of data on the
Web make it inconceivable that any
figure 1. typical use of the table tag to describe relational data that has structure never
explicitly declared by the author, including metadata consisting of several typed and labeled
columns, but that is obvious to human observers. the navigation bars at the top of the page
are also implemented through the table tag but do not contain relational-style data.
figure 2. Results of a keyword query search for “city population,” returning a relevance-ranked list of databases. the top result contains a row for each of the most populous 125
cities and columns for “city/urban Area,” “country,” “Population,” and “rank” (by population
among all the cities in the world). the system automatically generated the image at right,
showing the result of clicking on the “Paris” row. the title (“city mayors…”) links to the page
where the original HtmL table is located.
manual effort would be able to create
a clean model of all of it.
Before addressing the challenges
associated with accessing structured
data on the Web, it is important to ask
what users might do with such data.
Our work is inspired by the following
example benefits:
Improve Web search. Structured
Web data can help improve Web
search in a number of ways; for example, Deep Web databases are not
generally available to search engines,
and, by surfacing this data, a Deep
Web exploration tool can expand the
scope and quality of the Web-search
index. Moreover, the layout structure
can be used as a relevance signal to
the search ranker; for example, an
HTML table-embedded database with
a column calories and a row latte,
should be ranked fairly high in response to the user query latte calories. Traditionally, search engines
use the proximity of terms on a page
as a signal of relatedness; in this case,
the two terms are highly related, even
though they may be distant from each
other on the page.
Enable question answering. A long-standing goal for Web search is to
return answers in the form of facts;
for example, in the latte calories
query, rather than return a URL a
search engine might return an actual
numerical value extracted from the
HTML table. Web search engines return actual answers for very specific
query domains (such as weather and
flight conditions), but doing so in a
domain-independent way is a much
greater challenge.
Enable data integration from multiple Web sources. With all the data
sets available on the Web, the idea
of combining and integrating them
in ad hoc ways is immensely appealing. In a traditional database setting,
this task is called data integration;
on the Web, combining two disparate
data sets is often called a “mashup.”
While a traditional database administrator might integrate two employee
databases with great precision and at
great cost, most combinations of Web
data should be akin to Web search—
relatively imprecise and inexpensive;
for example, a user might combine
the set of coffeehouses with a database of WiFi hotspots, where speed