is more important than flawless accuracy. Unlike most existing mashup
tools, we do not want users to be limited to data that has been prepared for
integration (such as already available
in XML).
The Web is home to many kinds of
structured data, including embedded
in text, socially created objects, HTML
tables, and Deep Web databases. We
have developed systems that focus on
HTML tables and Deep Web databases. Web Tables extracts relational data
from crawled HTML tables, thereby
creating a collection of structured databases several orders of magnitude
larger than any other we know of. The
other project surfaces data obtained
from the Deep Web, almost all hidden
behind Web forms and thus inaccessible. We have also constructed a tool
(not discussed here) called Octopus
that allows users to extract, clean,
and integrate Web-embedded data. 3
Finally, we built a third system, called
Google Fusion Tables, 13 a cloud-based
service that facilitates creation and
publication of structured data on the
Web, therefore complementing the
two other projects.
Webtables
The WebTables system4, 5 is designed
to extract relational-style data from
the Web expressed using the HTML
table tag. Figure 1 is a table listing
American presidents (http://www.
enchantedlearning.com/history/us/
pres/ list.shtml) with four columns,
each with topic-specific label and type
(such as President and Term as President) as a date range; also included is
a tuple of data for each row. Although
most of the structured-data metadata
is implicit, this Web page essentially
contains a small relational database
anyone can crawl.
Not all table tags carry relational
data. Many are used for page layout,
calendars, and other nonrelational
purposes; for example, in Figure 1,
the top of the page contains a table
tag used to lay out a navigation bar
with the letters A–Z. Based on a hu-man-judged sample of raw tables, we
estimate up to 200 million true relational databases in English alone on
the Web. In general, less than 1% of
the content embedded in the HTML
table tags represents good tables. In-
Any useful
application making
use of Web data
must also be able
to address
uncertain data
design and quality.
deed, the relational databases in the
WebTables corpus form the largest
database corpus we know of, by five
orders of decimal magnitude.a
WebTables focuses on two main
problems surrounding these databases: One, perhaps more obvious,
is how to extract them from the Web
in the first place, given that 98.9% of
tables carry no relational data. Once
we address this problem, we can move
to the second—what to do with the resulting huge collection of databases.
Table extraction. The WebTables
table-extraction process involves two
steps: First is an attempt to filter out
all the nonrelational tables. Unfortunately, automatically distinguishing a
relational table from a nonrelational
table can be difficult. To do so, the system uses a combination of handwritten and statistically trained classifiers
that use topic-independent features
of each table; for example, high-quality data tables often have relatively few
empty cells. Another useful feature is
whether each column contains a uniform data type (such as all dates or all
integers). Google Research has found
that finding a column toward the left
side of the table with values drawn
from the same semantic type (such as
country, species, and institution) is a
valuable signal for identifying high-quality relational tables.
The second step is to recover metadata for each table passing through
the first filter. Metadata is information that describes the data in the database (such as number of columns,
types, and names). In the case of the
presidents, the metadata contains
the column labels President, Party, and so on. For coffeehouses, it
might contain Name, Speciality,
and Roaster. Although metadata
for a traditional relational database
can be complex, the goal for Web Tables metadata is modest—determine
whether or not the first row of the ta-
a The second-largest collection we know is due
to Wang and Hu, 22 who also tried to gather data
from Web pages but with a relatively small and
focused set of input pages. Other research on
table extraction has not focused on large collections. 10, 12, 23 Our discussion here refers to the
number of distinct databases, not the number
of tuples. Limaye et al16 described techniques
for mapping entities and columns in tables to
an ontology.