about nuclear physics.” To determine which sites to target, a mediated search engine has to run some type of textual analysis on the original query, then use that interpretation to select the appropriate sites. “Analyzing the query isn’t hard,” says Halevy. “The hard part is figuring out which sites to query.”
At Kosmix, the team has developed an algorithmic categorization technology that analyzes the contents of users’ queries—requiring heavy computation at runtime—and maps it against a taxonomy of millions of topics and the relationships between them, then uses that analysis to determine which sites are best suited to handle a particular query. Similarly, at the University of Utah’s School of Computing, assistant professor Juliana Freire is leading a project team working on crawling and indexing the entire universe of Web forms. To determine the subject domain of a particular form, they fire off sample queries to develop a better sense of the content inside. “The naïve way would be to query all the words in the dictionary,” says Freire. “Instead we take a heuristic-based approach. We try to reverse-engi-neer the index, so we can then use that to build up our understanding of the databases and choose which words to search.” Freire claims that her team’s approach allows the crawler to retrieve better than 90% of the content stored in each targeted site.
Google’s Deep Web search strategy has evolved from a mediated search technique that originated in Halevy’s work at Transformic (which was acquired by Google in 2005), but has since evolved toward a kind of smart warehousing model that tries to accommodate the sheer scale of the Web as a whole. “The approaches we had taken before [at Transformic] wouldn’t work because of all the domain engineering required,” says Halevy.
Instead, Google now sends a spider to pull up individual query forms and indexes the contents of the form, analyzing each form for clues about the topic it covers. For example, a page that mentions terms related to fine art would help the algorithm guess a subset of terms to try, such as “Picasso,” “ Rembrandt,” and so on. Once one of those terms returns a hit, the search engine can analyze the results and refine its model of what the database contains.
“At Google we want to query any form out there,” says Halevy, “whether you’re interested in buying horses in China, parking tickets in India, or researching museums in France.” When Google adds the contents of each data source to its search engine, it effectively publishes them, enabling Google to assign a PageRank to each resource. Adding Deep Web search resources to its index—rather than mediating the results in real time—allows Google to use Deep Web search to augment its existing service. “Our goal is to put as much interesting content as possible into our index,” says Halevy. “It’s very consistent with Google’s core mission.”
The first generation of Deep Web search engines were focused on retrieving documents. But as Deep Web search engines continue to penetrate the far reaches of the database-driven Web, they will inevitably begin trafficking in more structured data sets. As they do so, the results may start to yield some of the same benefits of structure and interoperability that are often touted for the Semantic Web. “The manipulation of the Deep Web has historically been at a document level and not at the level of a Web of data,” says Bergman. “But the retrieval part is indifferent to whether it’s a document or a database.”
So far, the Semantic Web community has been slow to embrace the challenges of the Deep Web, focusing primarily on encouraging developers to embrace languages and ontology definitions that can be embedded into documents rather than incorporated at a database level. “The Semantic Web has been focused on the Shallow Web,” says Stonebraker, “but I would be thrilled to see the Se-
mantic Web community focus more on the Deep Web.”
Some critics have argued that the Semantic Web has been slow to catch on because it hinges on persuading data owners to structure their information manually, often in the absence of a clear economic incentive for doing so. While the Semantic Web approach may work well for targeted vertical applications where there is a built-in economic incentive to support expensive mark-up work (such as biomedical information), such a labor-intensive platform will never scale to the Web as a whole. “I’m not a big believer in ontologies because they require a lot of work,” says Freire. “But by clustering the attributes of forms and analyzing them, it’s possible to generate something very much like an ontology.”
While the Semantic Web may be a long time coming, Deep Web search strategies hold out hope for the possibility of a semantic Web. After all, Deep Web search inherently involves structured data sets. Rather than relying on Web site owners to mark up their data, couldn’t search engines simply do it for them?
Google is exploring just this approach, creating a layer of automated metadata based on analysis of the site’s contents rather than relying on site owners to take on the cumbersome task of marking up their content. Bergman’s startup, Zitgist, is exploring a concept called Linked Data, predicated on the notion that every bit of data available over the Web could potentially be addressed by a Uniform Resource Indicator. If that vision came to fruition, it would effectively turn the entire Web into a giant database. “For more than 30 years, the holy grail of IT has been to eliminate stovepipes and federate data across the enterprise,” says Bergman, who thinks the key to joining Deep Web search with the Semantic Web lies in RDF. “Now we have a data model that’s universally acceptable,” he says. “This will let us convert legacy relational schemas to http.”
Will the Deep Web and Semantic Web ever really coalesce in the real world of public-facing Web applications? It’s too early to say. But when and if that happens, the Web may just get a whole lot deeper.
Alex Wright is a writer and information architect who lives and works in New York City.
References:
Archives