signer or drawn automatically from
a pre-compiled linguistic resource
(such as a thesaurus). However, the
task of synonym finding is complicated by the fact that attribute names
are often acronyms or word combinations, and their meanings are highly
contextual. Unfortunately, manually
computing a set of synonyms is burdensome and error-prone.
Web Tables uses probabilities from
the ACSDb to encode three observations about good synonyms:
˲ ˲ Two synonyms should not appear
together in any known schema, as it
would be repetitive on the part of the
database designer;
˲ ˲Two synonyms should share
common co-attributes; for example,
phone-number and phone-# should
both appear along with name and address; and
˲ ˲ The most accurate synonyms are
popular in real-world use cases.
WebTables can encode each of
these observations in terms of attribute probabilities using ACSDb data.
Combining them, we obtain a formula for a synonym-quality score Web Tables uses to sort and rank every possible attribute pair; Table 2 lists a series
of input domains and the output pairs
of the synonym-finding system.
gines; few hyperlinks point to Web
pages resulting from form submissions, and Web crawlers did not have
the ability to automatically fill out
forms. Hence, the names “Deep,”
“Hidden,” and “Invisible Web” have
all been used to refer to the content
accessible only through forms. Bergman2 and He et al14 have speculated
that the data in the Deep Web far exceeds the data indexed by contemporary search engines. We estimate
at least 10 million potentially useful
distinct forms18; our previous work17
has a more thorough discussion of the
Deep Web literature and its relation to
the projects described here.
The goal of Google’s Deep Web
Crawl Project is to make Deep Web
content accessible to search-engine
users. There are two complementary approaches to offering access to
it: create vertical search engines for
specific topics (such as coffee, presidents, cars, books, and real estate)
and surface Deep Web content. In the
first, for each vertical, a designer must
create a mediated schema visible to
users and create semantic mappings
from the Web sources to the mediated
schema. However, at Web scale, this
approach suffers from several drawbacks:
˲ ˲ A human must spend time and
effort building and maintaining each
mapping;
˲ ˲ When dealing with thousands of
domains, identifying the topic relevant to an arbitrary keyword query is
extremely difficult; and
˲ ˲Data on the Web reflects every
topic in existence, and topic boundaries are not always clear.
The Deep Web Crawl project followed the second approach to surface
Deep Web content, pre-computing the
most relevant form submissions for
all interesting HTML forms. The URLs
resulting from these submissions can
then be added to the crawl of a search
engine and indexed like any other
HTML page. This approach leverages
the existing search-engine infrastructure, allowing the seamless inclusion
of Deep Web pages into Web-search
results. The system currently surfaces
content for several million Deep Web
databases spanning more than 50 languages and several hundred domains,
and the surfaced pages contribute results to more than 1,000 Web-search
queries per second on Google.com.
For example, as of the writing of this
article, a search query for citibank
atm 94043 will return in the first position a parameterized URL surfacing
Deep Web Databases
Not all structured data on the Web is
published in easily accessible HTML
tables. Large volumes of data stored
in back-end databases are often made
available to Web users only through
HTML form interfaces; for example, a
large chain of coffeehouses might have
a database of store locations that are
retrieved by zip code using the HTML
form on the company’s Web site, and
users retrieve data by performing valid
form submissions. On the back-end,
HTML forms are processed by either
posing structured queries over relational databases or sending keyword
queries over text databases. The retrieved content is published on Web
pages in structured templates, often
including HTML tables.
While WebTables-harvested tables
are potentially reachable by users
posing keyword queries on search
engines, the content behind HTML
forms was for a long time believed
to be beyond the reach of search en-
table 1. sample output from the schema autocomplete tool. to the left is a
user’s input attribute; to the right are sample schemas.
input attribute
name
instructor
elected
ab
sqft
Auto-completer output
name, size, last-modified, type
instructor, time, title, days, room, course
elected, party, district, incumbent, status, opponent, description
ab, h, r, bb, so, rbi, avg, lob, hr, pos, batters
sqft, price, baths, beds, year, type, lot-sqft, days-on-market, stories
table 2. sample output from the synonym-finding tool. to the left are the input
context attributes; to the right are synonymous pairs generated by the system.
input context
name
instructor
elected
ab
sqft
synonym-finder outputs
e-mail|email, phone|telephone, e-mail address|email address,
date|last-modified
course-title|title, day|days, course|course-#, course-name|course-title
candidate|name, presiding-officer|speaker
k|so, h|hits, avg|ba, name|player
bath|baths, list|list-price, bed|beds, price|rent
FeBrUAry 2011 | voL. 54 | No. 2 | communicAtions of tHe Acm 77