results from a database of ATM locations—a very useful search result that
would not have appeared otherwise.
Pre-computing the set of relevant
form submissions for any given form
is the primary difficulty with surfacing; for example, a field with label
roaster should not be filled in with
value toyota. Given the scale of a
Deep Web crawl, it is crucial there be
no human involvement in the process
of pre-computing form submissions.
Hence, previous work that either addressed the problem by constructing
mediator systems one domain at a
time8, 9, 21 or needed site-specific wrappers or extractors to extract documents from text databases1, 19 could
not be applied.
Surfacing Deep Web content involves two main technical challenges:
˲ ˲ Values must be selected for each
input in the form; value selection is
trivial for select menus but very challenging for text boxes; and
˲ ˲ Forms have multiple inputs, and
using a simple strategy of enumerating all possible form submissions can
be wasteful; for example, the search
form on cars.com has five inputs, and
a cross product will yield more than
200 million URLs, even though cars.
com lists only 650,000 cars for sale. 7
The full details on how we addressed these challenges are in Madhavan et al. 18 Here, we outline how we
approach the two problems:
Selecting input values. A large number of forms have text-box inputs and
require valid input values for the retrieval of any data. The system must
therefore choose a good set of values
to submit in order to surface useful
result pages. Interestingly, we found
it is not necessary to have a complete
understanding of the semantics of
the form to determine good candidate
text inputs. To understand why, first
note that text inputs fall into one of
two categories: generic search inputs
that accept most keywords and typed
text inputs that accept only values in a
particular topic area.
For search boxes, the system pre-
dicts an initial set of candidate key-
words by analyzing text from the form
site, using the text to bootstrap an
iterative probing process. The sys-
tem submits the form with candidate
keywords; when valid form submis-
sions result, the system extracts more
keywords from the resulting pages.
This iterative process continues un-
til either there are no new candidate
keywords or the system reaches a pre-
specified target number of results.
The set of all candidate keywords can
then be pruned, choosing a small
number that ensures diversity of the
exposed database content. Similar it-
erative probing approaches have been
used to extract text documents from
specific databases. 1, 6, 15, 19, 20
next steps
These two projects represent first
steps in retrieving structured data on
the Web and making it directly accessible to users. Searching it is not a
solved problem; in particular, search
over large collections of data is still an
area in need of significant research,
as well as integration with other
Web search. An important lesson we
learned is there is significant value in
analyzing collections of metadata on
the Web, in addition to the data itself.
Specifically, from the collections
we have worked with—forms and
HTML tables—we have extracted several artifacts:
˲ ˲A collection of forms (input
names that appear together and values for select menus associated with
input names);
˲ ˲A collection of several million
schemata for tables, or sets of column
names appearing together; and
˲ ˲ A collection of columns, each with
values in the same domain (such as
city names, zip codes, and car makes).
Semantic services. Generalizing
from our synonym finder and schema
auto-complete, we build from the
schema artifacts a set of semantic services that form a useful infrastructure
for many other tasks. An example of
such a service is that, given a name
of an attribute, return a set of values
for its column; such a service can automatically fill out forms in order to
surface Deep Web content. A second