Technology | DOI: 10.1145/1400181.1400187
Alex Wright
searching the Deep web
While the Semantic Web may be a long time coming,
Deep Web search strategies offer the promise of a semantic Web.
THE WEB IS bigger than it
looks. Beyond the billions
of pages that populate the
major search engines lies an
even vaster, hidden Web of
data: classified ads, library catalogs, airline reservation systems, phone books,
scientific databases, and all kinds of
other information that remains largely
concealed from view behind a curtain
of query forms. Some estimates have
pegged the size of the Deep Web at up to
500 times larger than the Surface Web
(also known as the Shallow Web) of static HTML pages.
Researchers have been trying to crack
the Deep Web for years, but most of those
efforts to date have focused on building specialized vertical applications like
comparison shopping portals, business
intelligence tools, or top-secret national
security projects that scour hard-to-crawl
overseas data sources. These projects
have succeeded largely by targeting narrow domains where a search application
can be fine-tuned to query a relatively
small number of databases and return
highly targeted results.
Bringing Deep Web search techniques to bear on the public Web poses
a more difficult challenge. While a few
high-profile sites like Amazon or You-Tube provide public Web services or
custom application programming interfaces that open their databases to
search engines, many more sites do not.
Multiply that problem by the millions of
possible data sources now connected to
the Web—all with different form-han-dling rules, languages, encodings, and
an almost infinite array of possible results—and you’re have one tough assignment. “This is the most interesting data
integration problem imaginable,” says
Alon Halevy, a former University of Washington computer science professor who
is now leading a Google team trying to
solve the Deep Web search conundrum.
Deep web search 101
There are two basic approaches to
searching the Deep Web. To borrow a
fishing metaphor, these approaches
might be described as trawling and angling. Trawlers cast wide nets and pull
them to the surface, dredging up whatever they can find along the way. It’s a
brute force technique that, while inelegant, often yields plentiful results. Angling, by contrast, requires more skill.
Anglers cast their lines with precise
techniques in carefully chosen locations. It’s a difficult art to master, but
when it works, it can produce more satisfying results.
The trawling strategy—also known
as warehousing or surfacing—involves
spidering as many Web forms as possible, running queries and stockpiling
the results in a searchable index. While
this approach allows a search engine to
retrieve vast stores of data in advance,
it also has its drawbacks. For one thing,
this method requires blasting sites with
uninvited queries that can tax unsuspecting servers. And the moment data is
retrieved, it becomes instantly becomes
out of date. “You’re force-fitting dynamic
data into a static document model,” says
Anand Rajaraman, a former student of
Halevy’s and co-founder of search startup Kosmix. As a result, search queries
may return incorrect results.
The angling approach—also known
as mediating—involves brokering a
search query in real time across multiple sites, then federating the results
for the end user. While mediating produces more timely results, it also has
some drawbacks. Chief among these
is determining where to plug a given
set of search terms into the range of
possible input fields on any given Web
form. Traditionally, mediated search
engines have relied on developing custom “wrappers” that serve as a kind of
Rosetta Stone for each data source. For
example, a wrapper might describe how
to query an online directory that accepts
inputs for first name and last name, and
returns a mailing address as a result. At
Vertica Systems, engineers create these
wrappers by hand, a process that usually takes about 20 minutes per site. The
wrappers are then added to a master ontology stored in a database table. When
users enter a search query, the engine
converts the output into Resource Description Framework (RDF), turning
each site into, effectively, a Web service.
By looking for subject-verb-object combinations in the data, engineers can
create RDF triples out of regular Web
search results. Vertica founder Mike
Stonebraker freely admits this hands-on
method, however, has limitations. “The
problem with our approach is that there
are millions of Deep Web sites,” he says.
“It won’t scale.” Several search engines
are now experimenting with approaches for developing automated wrappers
that can scale to accommodate the vast
number of Web forms available across
the public Web.
The other major problem confronting mediated search engines lies in determining which sources to query in the
first place. Since it would be impossible
to search every possible data source at
once, mediated search engines must
identify precisely which sites are worth
searching for any given query.
“You can’t indiscriminately scrub dynamic databases,” says former Bright-Planet CEO Mike Bergman. “You would
not want to go to a recipe site and ask
PHO TOGRAPH BY AMANCAY MAAHS