about nuclear physics.” To determine
which sites to target, a mediated search
engine has to run some type of textual
analysis on the original query, then use
that interpretation to select the appropriate sites. “Analyzing the query isn’t
hard,” says Halevy. “The hard part is figuring out which sites to query.”
At Kosmix, the team has developed
an algorithmic categorization technology that analyzes the contents of users’
queries—requiring heavy computation
at runtime—and maps it against a taxonomy of millions of topics and the relationships between them, then uses that
analysis to determine which sites are
best suited to handle a particular query.
Similarly, at the University of Utah’s
School of Computing, assistant professor Juliana Freire is leading a project
team working on crawling and indexing the entire universe of Web forms.
To determine the subject domain of
a particular form, they fire off sample
queries to develop a better sense of the
content inside. “The naïve way would be
to query all the words in the dictionary,”
says Freire. “Instead we take a heuristic-based approach. We try to reverse-engi-neer the index, so we can then use that
to build up our understanding of the
databases and choose which words to
search.” Freire claims that her team’s
approach allows the crawler to retrieve
better than 90% of the content stored in
each targeted site.
Google’s Deep Web search strategy
has evolved from a mediated search
technique that originated in Halevy’s
work at Transformic (which was acquired by Google in 2005), but has
since evolved toward a kind of smart
warehousing model that tries to accommodate the sheer scale of the Web as a
whole. “The approaches we had taken
before [at Transformic] wouldn’t work
because of all the domain engineering
required,” says Halevy.
Instead, Google now sends a spider to pull up individual query forms
and indexes the contents of the form,
analyzing each form for clues about
the topic it covers. For example, a page
that mentions terms related to fine art
would help the algorithm guess a subset
of terms to try, such as “Picasso,” “
Rembrandt,” and so on. Once one of those
terms returns a hit, the search engine
can analyze the results and refine its
model of what the database contains.
Rather than relying
on web site owners
to mark up their
data, couldn’t search
engines simply do it
for them?
“At Google we want to query any
form out there,” says Halevy, “whether
you’re interested in buying horses in
China, parking tickets in India, or researching museums in France.” When
Google adds the contents of each data
source to its search engine, it effectively
publishes them, enabling Google to assign a PageRank to each resource. Adding Deep Web search resources to its
index—rather than mediating the results in real time—allows Google to use
Deep Web search to augment its existing service. “Our goal is to put as much
interesting content as possible into our
index,” says Halevy. “It’s very consistent
with Google’s core mission.”
a Deep semantic web?
The first generation of Deep Web search
engines were focused on retrieving
documents. But as Deep Web search
engines continue to penetrate the far
reaches of the database-driven Web,
they will inevitably begin trafficking in
more structured data sets. As they do so,
the results may start to yield some of the
same benefits of structure and interoperability that are often touted for the
Semantic Web. “The manipulation of
the Deep Web has historically been at a
document level and not at the level of a
Web of data,” says Bergman. “But the retrieval part is indifferent to whether it’s
a document or a database.”
So far, the Semantic Web community
has been slow to embrace the challenges of the Deep Web, focusing primarily
on encouraging developers to embrace
languages and ontology definitions that
can be embedded into documents rather than incorporated at a database level.
“The Semantic Web has been focused
on the Shallow Web,” says Stonebraker,
“but I would be thrilled to see the Se-
mantic Web community focus more on
the Deep Web.”
Some critics have argued that the Semantic Web has been slow to catch on
because it hinges on persuading data
owners to structure their information
manually, often in the absence of a clear
economic incentive for doing so. While
the Semantic Web approach may work
well for targeted vertical applications
where there is a built-in economic incentive to support expensive mark-up
work (such as biomedical information),
such a labor-intensive platform will never scale to the Web as a whole. “I’m not
a big believer in ontologies because they
require a lot of work,” says Freire. “But
by clustering the attributes of forms and
analyzing them, it’s possible to generate
something very much like an ontology.”
While the Semantic Web may be a
long time coming, Deep Web search
strategies hold out hope for the possibility of a semantic Web. After all, Deep Web
search inherently involves structured data
sets. Rather than relying on Web site owners to mark up their data, couldn’t search
engines simply do it for them?
Google is exploring just this approach, creating a layer of automated
metadata based on analysis of the site’s
contents rather than relying on site
owners to take on the cumbersome task
of marking up their content. Bergman’s
startup, Zitgist, is exploring a concept
called Linked Data, predicated on the
notion that every bit of data available
over the Web could potentially be addressed by a Uniform Resource Indicator. If that vision came to fruition, it
would effectively turn the entire Web
into a giant database. “For more than
30 years, the holy grail of IT has been to
eliminate stovepipes and federate data
across the enterprise,” says Bergman,
who thinks the key to joining Deep Web
search with the Semantic Web lies in
RDF. “Now we have a data model that’s
universally acceptable,” he says. “This
will let us convert legacy relational schemas to http.”
Will the Deep Web and Semantic
Web ever really coalesce in the real
world of public-facing Web applications? It’s too early to say. But when and
if that happens, the Web may just get a
whole lot deeper.
Alex Wright is a writer and information architect who
lives and works in New York City.