in a sentence. If phrase returns the same result for two relation mentions, they receive the same weight. We explain
weight tying in more detail in Section 3. 3. In general, phrase
could be an arbitrary UDF that operates in a per-tuple fashion. This allows DeepDive to support common feature types
ranging from “bag-of-words” to context-aware NLP features
to feature sets incorporating domain-specific dictionaries
and ontologies. In addition to specifying sets of classifiers,
DeepDive inherits Markov Logic’s ability to specify rich correlations between entities via weighted rules. Such rules are
particularly helpful for data cleaning and data integration.
Supervision. Just as in Markov Logic, DeepDive can use
training data or evidence about any relation; in particular,
each user relation is associated with an evidence relation
with the same schema and an additional field that indicates
whether the entry is true or false. Continuing our example,
the evidence relation MarriedMentions_Ev could contain
mention pairs with positive and negative labels. Operationally,
two standard techniques generate training data: ( 1) hand-labeling and ( 2) distant supervision, which we illustrate here.
Example 3. 4. Distant supervision19, 30 is a popular technique to
create evidence in KBC systems. The idea is to use an incomplete KB
of married entity pairs to heuristically label (as True evidence)
all relation mentions that link to a pair of married entities:
(S1) MarriedMentions_Ev(m1, m2, true) : -
MarriedCandidates(m1, m2), EL(m1, e1),
EL(m2, e2), Married(e1, e2).
Here, Married is an (incomplete) list of married real-world
persons that we wish to extend. The relation EL is for “entity
linking” that maps mentions to their candidate entities. At first
blush, this rule seems incorrect. However, it generates noisy,
imperfect examples of sentences that indicate two people are
married. Machine learning techniques are able to exploit
redundancy to cope with the noise and learn the relevant
phrases (e.g., and his wife). Negative examples are generated
by relations that are largely disjoint (e.g., siblings). Similar to
DIPRE4 and Hearst patterns, 18 distant supervision exploits the
“duality” 4 between patterns and relation instances; furthermore, it allows us to integrate this idea into DeepDive’s unified
probabilistic framework.
NLP preprocessing tools, including HTML stripping, part-of-speech tagging, and linguistic parsing. After this loading
step, DeepDive executes two types of queries: ( 1) candidate
mappings, which are SQL queries that produce possible
mentions, entities, and relations, and ( 2) feature extractors,
which associate features to candidates, for example, “. . . and
his wife . . .” in Example 3. 1.
Example 3. 2. Candidate mappings are usually simple. Here,
we create a relation mention for every pair of candidate persons
in the same sentence (s):
(R1) MarriedCandidate(m1, m2): -
PersonCandidate(s, m1), PersonCandidate (s, m2).
Candidate mappings are simply SQL queries with UDFs
that look like low-precision but high-recall extract–transform–
load (ETL) scripts. Such rules must be high recall: if the
union of candidate mappings misses a fact, DeepDive has
no chance to extract it.
We also need to extract features, and we extend classical
Markov logic11 in two ways: ( 1) user-defined functions (UDFs)
and ( 2) weight tying, which we illustrate by example.
Example 3. 3. Suppose that phrase(m1, m2, sent) returns
the phrase between two mentions in the sentence, for example,
“and his wife” in the above example. The phrase between two
mentions may indicate whether two people are married. We
would write this as:
(FE1) MarriedMentions(m1, m2) : -
MarriedCandidate(m1, m2), Mention(s, m1),
Mention(s, m2), Sentence(s, sent)
weight = phrase(m1, m2, sent).
One can think about this as a classifier: This rule says that
whether the text indicates that the mentions m1 and m2 are
married is influenced by the phrase between those mention
pairs. The system will infer, based on training data, its confidence (by estimating the weight) that two mentions are indeed
indicated to be married.
Technically, phrase returns an identifier that determines
which weights should be used for a given relation mention
Figure 7. An example KBC system (see Section 3. 2 for details).
SID MID
S1 M2
PersonCandidate Sentence (documents)
SID Content
MarriedCandidate
MID1 MID2
M1 M2
(3a) Candidate Mapping and Feature Extraction
(R1) MarriedCandidate(m1, m2) :-
PersonCandidate(s,m1),PersonCandidate(s,m2).
(3b) Supervision Rules
(S1) MarriedMentions_Ev(m1, m2,true) :-
MarriedCandidate(m1, m2), EL(m1, e1), EL(m2, e2),
Married(e1,e2).
MID1 MID2 Value
M1 M2 True
MarriedMentions_Ev
EID1 EID2
Barack
Obama
Michelle
Obama
Married
EL
MID EID
M2 Michelle Obama
B. Obama and Michelle
were married on October 3, 1992.
Malia and Sasha Obama
attended the state dinner.
Person1 Person2
Barack
Obama
Michelle
Obama
HasSpouse
(1b) Structured Information
(FE1) MarriedMentions(m1, m2) :-
MarriedCandidate(m1,m2),Mentions(s,m1),
Mentions(s,m2),Sentence(s,sent)
weight=phrase(m1,m2,sent).
SID MID
S1 M2
Mentions