ble includes labels for each column.
When inspecting tables by hand, we
found 70% of good relational-style tables contain such a metadata row. As
with relational filtering, we used a set
of trained classifiers to automatically
determine whether or not the schema
row is present.
The two techniques together allowed Web Tables to recover 125 million high-quality databases from a
large general Web crawl (several billion Web pages). The tables in this
corpus contained more than 2. 6 million unique “schemas,” or unique sets
of attribute strings. This enormous
data set is a unique resource we explore in the following paragraphs.
Leveraging extracted data.
Aggregating data over the extracted
WebTables data, we can create new
applications previously difficult or
impossible through other techniques.
One such application is structured
data search. Traditional search engines are tuned to return relevant documents, not data sets, so users searching for data are generally ill-served.
Using the extracted WebTables data,
we implemented a search engine that
takes a keyword query and returns a
ranked list of databases instead of
URLs; Figure 2 is a screenshot of the
prototype system. Because WebTables extracted structural information
for each object in the search engine’s
index, the results page can be more
interesting than in a standard search
engine. Here, the page of search results contains an automatically drawn
map reflecting the cities listed in the
data set; imagine the system being
used by knowledge workers who want
to find data to add to a spreadsheet.
In addition to the data in the ta-
bles, we found significant value in the
collection of the tabular schemata
we collected. We created the Attri-
bute Correlation Statistics Database
(ACSDb) consisting of simple fre-
quency counts for each unique piece
of metadata WebTables extracts; for
example, the database of presidents
mentioned earlier adds a single count
to the four-element set president,
party, term-as-president, vice-
president. By summing individual
attribute counts over all entries in the
ACSDb, Web Tables is able to compute
various attribute probabilities, given a
randomly chosen database; for exam-
ple, the probability of seeing the name
attribute is far higher than seeing the
roaster attribute.