figure 3: example queries for the YAGo knowledge base.
Sept. 2, 1945
July 28, 1914
<
$b
Germany
Politician
>
(bornin|
livesin|
citizenof)
.locatedin*
*
Max Planck
Angela Merkel
isa
*
*
$x
hasWon
$x nobel Prize
isa
bornon
diedon
$d
>
Jim Gray
fatherof
Scientist
$c
diedon
$y
KnowItAll and TextRunner are examples of Statistical-Web methods for
large-scale knowledge acquisition.
YAGo for Large-Scale
Semantic Knowledge
Our YAGO project shares the
23, 24
KnowItAll and TextRunner goal of
large-scale knowledge harvesting but
emphasizes high accuracy and consistency rather than high recall (
coverage). YAGO is best characterized as
a Semantic-Web approach, gathering
its knowledge by (primarily) integrating information from Wikipedia and
WordNet. It also employs text-mining-based techniques. YAGO contains
close to two million entities and about
20 million facts about them, where
facts are instances of binary relations.
Extensive sampling has shown that
YAGO accuracy is at least 95%, and
many of its errors (false positives) are
due to incorrect entries in Wikipedia
itself. YAGO is publicly available at
www.mpi-inf.mpg.de/yago/.
Two Wikipedia assets—infoboxes
and the category system—are almost
structured data. Infoboxes are collections of attribute name-value pairs
often based on templates and reused
for important types of entities (such
as countries, companies, scientists,
music bands, and sports teams). For
example, the infobox for Max Planck
delivers such data as birth_date = April
23, 1858, birth_place = Kiel, death_date
= October 4, 1947, nationality = Germany, and alma_mater = Ludwig-Max-imilians-Universität München. As for
the category system, the Max Planck
article is manually placed in such categories as German_Nobel_laureates,
Nobel_laureates_in_physics, quantum_
physics, and University_of_Munich_
alumni. All give YAGO clues about instanceOf relations, so it can infer that
the entity Max Planck is an instance of
the classes GermanNobelLaureates,
NobelLaureatesInPhysics, and Uni-versityOfMunichAlumni. But YAGO
must be careful, as the placement in
category quantum_physics does not
mean that Max Planck is an instance
of QuantumPhysics. The YAGO extractors employ linguistic processing
(noun phrase parsing) and mapping
rules to achieve high accuracy in harvesting the categories information.
These examples of YAGO information extraction indicate that relying
solely on Wikipedia infoboxes and categories may result in a large but incoherent collection of facts. For example,
we may know that Max Planck is an instance of GermanNobelLaureates but
be unable to automatically infer that
he is also an instance of Germans and
of Nobel Laureates. Likewise, the fact
that he was a physicist does not automatically tell us he was a scientist. To
address these shortcomings, YAGO
makes intensive use of the WordNet
thesaurus (lightweight ontology), integrating the facts it harvests from Wikipedia with the taxonomic backbone
provided by WordNet.
While WordNet knows many abstract classes and the “is-a” and “
part-of” relationships among them, it has
only sparse information about individual entities that would populate
its classes. The wealth of entities in
Wikipedia complements WordNet
nicely; conversely, the rigor and extensive coverage of WordNet’s taxonomy
compensate for the gaps and noise in
the Wikipedia category system. Each
individual entity YAGO discovers must
be mapped into at least one existing
YAGO class. If this fails, the entity
and its related facts are not admitted
into the knowledge base. Analogously,
classes derived from Wikipedia category names (such as GermanNobelLaureates) must be mapped with a
subclass relationship to one or more
superclasses (such as NobelLaureates and Germans). These procedures
ensure that YAGO maintains a consistent knowledge base, where consistency eliminates dangling entities
and classes and guarantees that the
subclass relation is acyclic.
Kylin/KOG. The “Intelligence in
Wikipedia” project also extracts information from Wikipedia through
its tools Kylin and Kylin Ontology
27
Generator (KOG).
26 Whenever an infobox type includes an attribute in
some articles but the attribute has
no value for a given article, Kylin
analyzes the full text of the article
to derive the most likely value. Like
KnowItAll and TextRunner (but unlike Libra, Cimple, and YAGO), Kylin
pursues open extraction by considering all potentially significant attributes, even if they occur only sparsely
in the entire Wikipedia corpus. KOG
builds on Kylin’s output, unifies attribute names, derives type signatures,
and (like YAGO) maps these entities
onto the WordNet taxonomy through
statistical relational learning.
15 KOG
goes beyond YAGO by discovering
new relationship types. It builds on
the class system of both YAGO and
DBpedia,
4 along with the entities in
62 communicAtionS of the Acm | APriL 2009 | voL. 52 | no. 4