each class, to train its learning algorithms for generating the subsumption graph among classes.
The Kylin/KOG project combines all
three knowledge-gathering paradigms:
Semantic-Web-oriented by being targeted at infoboxes; Social-Web-based
by leveraging the input of the large
Wikipedia community; and Statistical-Web-style through learning methods.
Searching and Ranking
YAGo with nAGA
The query language we designed for
YAGO adopts concepts from the standardized SPARQL Protocol and RDF
Query Language for RDF data but extends them through more expressive
pattern matching and ranking.
3, 18 The
prototype system that implements
these features is called NAGA (for Not
Another Google Answer, www.mpi-inf.
mpg.de/yago/). Viewing the knowledge
base as a graph, users and programmers alike can construct a query with
the help of subgraph templates; Figure
3 outlines three examples related to the
question scenarios discussed earlier.
The leftmost query in the Figure
about politicians who are also scientists shows two nodes matched by the
desired results and one node (labeled
$x) denoting a variable for which the
query must find all bindings. The edge
labels denote relationships and need
to be matched by the results. Here,
“isa” is shorthand notation for a composition of two connected edges that
correspond to the relationships
instanceOf between an entity and a class and
subclass between two classes. This way
the user also finds people who belong
to the classes “mayor” and “physicist.”
The query in the middle of Figure
3 (a simpler variant of the German-Nobel-laureate question) generalizes
the point about labels referring to
compositions of relations. The label
(bornIn|livesIn|citizenOf).locatedIn* is
a regular expression that allows users
to avoid overspecifying their information demand. We may be generous
when we call a person German, and the
locatedIn relationship often reflects
geographical hierarchies (such as with
cities, counties, states, and countries).
The rightmost query of Figure 3 is a
broad relatedness query that looks for
commonalities or other connections
among several entities. Here again, us-
ers or programmers would use regular
expressions as edge labels in the query’s graph template.
NAGA queries often return too many
results and so must rank these results.
For example, a query like “What is
known about Einstein?,” which may be
phrased as a single-edge graph pattern
isa(Einstein, $y), returns dozens if not
hundreds of results, including many
uninteresting ones like isa(Einstein,
Entity), isa(Einstein, Organism), and
isa(Einstein, Colleague). Ranking models for such results is much more difficult than for traditional search engines, as the system must consider the
graph structure in both queries and
results. Three general criteria must be
accommodated:
Informativeness. Users prefer informative answers—salient or interesting facts, as opposed to overly generic
facts or facts that are trivially known already. In the example query about Einstein the user would prefer the answers
isa(Einstein, Physicist) or isa(Einstein,
NobelLaureate) over the near-trivial results or a nontrivial but still less important fact like isa(Einstein, Vegetarian).
However, when asking a different query about noteworthy vegetarians, the
latter fact should be one of the highest-ranked results. Informativeness is a
query-dependent (and potentially user-and-situation-dependent) notion;
Confidence. Users may occasionally
find uncertain, dubious, or false statements in the YAGO knowledge base.
Each fact is annotated with a confidence value derived from the fact’s
original data sources and the extraction methods YAGO used. The quality
of the sources tapped by YAGO can be
quantified in terms of authority and
trust measures in the spirit of Google’s
PageRank, and the quality of different
extractors can be empirically assessed.
Various ways are available for combining these measures into a single confidence value, and high-confidence
7, 8
answers are preferred; and
Compactness. Whenever a query
returns paths or graphs rather than
individual nodes, we are interested in
compact graphs and short paths. For
example, a query about how Einstein
and Bohr are related should return a
short answer path that says something
like “both are physicists,” rather than
a convoluted answer like “Einstein
was a vegetarian like Tom Cruise who
was born in the same year Bohr died.”
A good ranking function is needed
to combine all three criteria. Here, we
sketch our approach for informativeness. We developed for NAGA a new
kind of statistical LM for graph-struc-tured data and queries. The parameters
of the model are estimated from corpus
or workload statistics. Consider the
simple queries isa(Einstein; $y) about
Einstein and bornIn($y; Frankfurt)
about people born in Frankfurt. For the
Einstein query YAGO estimates conditional co-occurrence probabilities
P[EinsteinÙPhysicist]
P[Einstein]
and
P[EinsteinÙVegetarian]
P[Einstein]
to compare and rank two possible answers. For the Frankfurt query YAGO
computes and compares
P[GoetheÙborn^Frankfurt]
P[bornÙFrankfurt]
against
P[WeikumÙborn^Frankfurt]
P[bornÙFrankfurt]
clearly favoring Goethe as a top result.
This LM-based ranking allows
NAGA to rank politicians who are
also scientists with high informativeness, with Benjamin Franklin, Angela
Merkel, and other prominent figures
showing up in the top ranks. So while
the YAGO knowledge base is primarily
a Semantic-Web endeavor, the ranking for its search engine is built on
Statistical-Web assets.
Despite good progress, these approaches face three notable challenges:
Scalable harvesting. Most new kno wl-edge is produced in textual form (such
as in scientific publications). Methods
for natural-language IE face inherent
trade-offs regarding training effort vs.
easy deployment and precision vs. recall of the results. Scaling up the IE machinery for higher throughput without
sacrificing quality is a formidable problem. For example, can IE tools process
all blog postings on the planet at the
same rate they are produced, without
missing relevant facts or producing too
many false positives?;
Expressive ranking. The LM-based
ranking models pursued by Libra and
APriL 2009 | voL. 52 | no. 4 | communicAtionS of the Acm
63