Science | DOI: 10.1145/1629175.1629183
Neil Savage
new search challenges
and opportunities
If search engines can extract more meaning from text and better understand what
people are looking for, the Web’s resources could be accessed more effectively.
THe weB is a huge, dynamic landscape of information, and navigating through it not an easy task. There are billions of Web pages, and
the type of content is expanding dramatically, with blogs and Twitter feeds,
maps and videos, photos and podcasts.
People, typing on a computer in their
cubicle or using their smartphone on a
street corner, are trying to sift through
this growing morass of data, looking
for everything from car repair advice
to a nearby Thai restaurant that’s not
too expensive. For search engines, this
enormous variety of data and formats is
providing both new challenges and new
opportunities.
“The ability to produce information and store information has far outstripped human cognitive capacity,
which is basically fixed,” says Oren Etzioni, a professor of computer science
and engineering at the University of
Washington. “The haystack keeps getting bigger. Obviously we need better
and better tools to find the proverbial
needles.”
PHotoGraPH Courtesy oF tHe unIVersIty oF WasHInGton
Today’s search engines do a fine job
of cataloging text, counting links, and
delivering lists of pages relevant to a user’s search topic. But in the coming decade, Etzioni believes, search will move
beyond keyword queries and automate
the time-consuming task of sifting
through those documents. With a better
understanding both of what documents
mean and what searchers are looking
for, he predicts, some tasks could be reduced from hours to minutes.
Etzioni is attempting to get more information out of text using a technique
called open information extraction,
which is built on a long-used technology that examines natural language
text and tries to derive data about the
relationships between words. An algorithm looks for triples, which follow the
structure of entity-relationship-entity,
such as “Beijing is the capital of China”
or “Franz Kafka was born in Prague.”
The system is open because it derives
the relations from the structure of the
language rather than relying on hand-labeled examples of relationships,
which would not be scalable to the Web
as a whole.
Etzioni developed a program called
TextRunner that uses a general model
of language to assign labels to words in
a sentence, then to calculate the beginning and end of strings of words that
contain the entity-relationship-entity
structure. It extracts those triples so they
can be indexed and searched. A searcher who asks “Where was Kafka born?”
should quickly receive a precise answer,
not just a list of pages that contain the
words “Kafka” and “born.” Given the
vast number of Web pages, Etzioni says,
the search engine should be able to notice errors such as one page saying Kafka’s birthplace is Peking is less likely to
be correct, for example, than the tens of
thousands that say Prague.
It’s more challenging for a computer
to extract more subjective data from
text, such as judgments about hotels or
movies, but a well-designed algorithm
can figure out cues, such as which de-
like many other computer scientists, the university of Washington’s oren etzioni is developing new tools for searching the Web’s growing morass of text, images, and other content.
oren etzioni’s
approach examines
natural language
text and tries to
derive data about
the relationships
between words.