sume a wide range of weak supervision
techniques and effectively give non-machine-learning experts a simple
way to “program” ML models. Moreover, Snorkel automatically learns the
accuracies of the LFs and reweights
their outputs using statistical modeling techniques, effectively denoising
the training data, which can then be
used to supervise the KBC system. In
this paper, the authors demonstrate
that Snorkel improves over prior weak
supervision approaches by enabling
the easy use of many noisy sources,
and comes within several percentage
points of performance using massive
hand-labeled training sets, showing
the efficacy of weak supervision for
making high-performance KBC systems faster and easier to develop.
Embeddings: Representation and
Incorporation of Distributed Knowledge
S., Riedel, L. Yao, A. McCallum and B.M. Marlin
Relation extraction with matrix factorization
and universal schemas. In Proceedings of
the Conference of the North American Chapter
of the Association for Computational Linguistics–
Human Language Technologies: 2013, 74–84.
Finally, a critical decision in KBC is
how to represent data: both the input
unstructured data and the resulting
output constituting the knowledge
base. In both KBC and more general
ML settings, the use of dense vector
embeddings to represent input data,
especially text, has become an omnipresent tool. 12 For example, word
embeddings, learned by applying
PCA (principal component analysis)
or some approximate variant to large
unlabeled corpora, can inherently represent meaningful semantics of text
data, such as synonymy, and serve as a
powerful but simple way to incorporate
statistical knowledge from large corpora. Increasingly sophisticated types
of embeddings, such as hyperbolic, 14
multimodal, and graph5 embeddings,
can provide powerful boosts to end-system performance in an expanded
range of settings.
In their paper, Riedel et al. provide
an interesting perspective by showing
how embeddings can also be used to
represent the knowledge base itself.
In traditional KBC, an output schema
(that is, which types of relations are to
be extracted) is selected first and fixed,
which is necessarily a manual process.
Instead, Riedel et al. propose using
dense embeddings to represent the
KB itself and learning these from the
union of all available or potential target schemas.
Moreover, they argue that such
an approach unifies the traditionally separate tasks of extraction and
integration. Generally, extraction is
the process of going from input data
to an entry in the KB—for example,
mapping a text string X likes Y to a
KB relation Likes(X,Y)—while
integration is the task of merging or linking related entities and relations. In
their approach, however, both input
text and KB entries are represented
in the same vector space, so these operations become essentially equivalent. These embeddings can then be
learned jointly and queried for a variety of prediction tasks.
KBC Becoming More Accessible
This article has reviewed approaches to three critical design points of
building a modern KBC system and
how they have the potential to accelerate the KBC process: coupling
multiple component models to learn
them jointly; using weak supervision
to supervise these models more efficiently and flexibly; and choosing a
dense vector representation for the
data. While ML-based KBC systems
are still large and complex, one practical benefit of today’s interest and
investment in ML is the plethora of
state-of-the-art models for various
KBC subtasks available in the open
source, and well-engineered frameworks such as PyTorch and Tensor-Flow with which to run them. Together with techniques and systems
for putting the pieces all together
like those reviewed, high-performance KBC is becoming more accessible than ever.
1. Bunescu, R.C., Mooney, R.J. Learning to extract
relations from the Web using minimal supervision.
In Proceedings of the 45th Annual Meeting Assoc.
Computational Linguistics, 2007, 576–583.
2. Cafarella, M. J., Downey, D., Soderland, S., Etzioni, O.
KnowItNow: Fast, scalable information extraction
from the Web. In Proceedings of Conf. on Human
Language Tech. Empirical Methods in Natural
Language Processing, 2005, 563–570.
3. Caruana, R. Multitask learning: A knowledge-based
source of inductive bias. In Proceedings of the 10th
Intern. Conf. Machine Learning, 1993, 41-48.
4. Dong, X. et al. Knowledge Vault: A Web-scale approach
to probabilistic knowledge fusion. In Proceedings
of the 20th ACM SIGKDD Intern. Conf. Knowledge
Discovery and Data Mining, 2014, 601–610.
5. Grover, A. and Leskovec, J. node2vec: Scalable feature
learning for networks. In Proceedings of the 22nd ACM
SIGKDD Intern. Conf. Knowledge Discovery and Data
Mining, 2016, 855–864.
6. Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L.,
Weld, D.S. Knowledge-based weak supervision for
information extraction of overlapping relations.
In Proceedings of the 49th Annual Meeting of the
Assoc. Computational Linguistics–Human Language
Technologies, 1, 2011, 541–550.
7. Lehmann, J. et al. DBpedia—A large-scale,
multilingual knowledge base extracted from
Wikipedia. Semantic Web 6, 2 (2014), 167–195.
8. Mahdisoltani, F., Biega, J. and Suchanek, F.M. YAGO3:
A knowledge base from multilingual wikipedias. In
Proceedings of the 7th Biennial Conf. Innovative Data
Systems Research, 2013.
9. Mallory, E.K., Zhang, C., Ré, C. and Altman, R.B. Large-scale extraction of gene interactions from full-text
literature using DeepDive. Bioinformatics 32, 1 (2015),
10. Mann, G.S. and McCallum, A. Generalized expectation
criteria for semi-supervised learning with weakly
labeled data. J. Machine Learning Research 11 (Feb
11. Manning, C. Representations for language: From
word embeddings to sentence meanings. Presented
at Simons Institute for the Theory of Computing, UC
12. Mikolov, T., Chen, K., Corrado, G. and Dean, J. Efficient
estimation of word representations in vector space,
2013; arXiv preprint arXiv:1301.3781.
13. Mintz, M., Bills, S., Snow, R. and Jurafsky, D. Distant
supervision for relation extraction without labeled
data. In Proceedings of the Joint Conf. 47th Annual
Meeting of the Assoc. Computational Linguistics and
the 4th Conf. Asian Federation of Natural Language
Processing, 2009, 1003–1011.
14. Nickel, M. and Kiela, D. Poincaré embeddings for
learning hierarchical representations. Advances in
Neural Information Processing Systems 30 (2017),
15. Ratner, A., Bach, S., Varma, P. and Ré, C. Weak
supervision: the new programming paradigm for
machine learning. Hazy Research; https://hazyresearch.
16. Ren, X., He, W., Qu, M., Voss, C. R., Ji, H., Han, J. Label
noise reduction in entity typing by heterogeneous
partial-label embedding. In Proceedings of the 22nd
ACM SIGKDD Intern. Conf. Knowledge Discovery and
Data Mining, (2016), 1825–1834.
17. Ruder, S. An overview of multi-task learning in
deep neural networks, 2017; arXiv preprint arXiv:
18. Zhang, C., Ré, C., Cafarella, M., De Sa, C., Ratner,
A., Shin, J., Wang, F., Wu, S. DeepDive: Declarative
knowledge base construction. Commun. ACM 60, 5
(May 2017), 93–102.
19. Zhang, C., Shin, J., Ré, C., Cafarella, M. and Niu, F.
Extracting databases from dark data with DeepDive.
In Proceedings of the Intern. Conf. Management of
Data, 2016, 847–859.
Alex Ratner is a Ph. D. candidate in computer science
at Stanford University, advised by Chris Ré, where his
research focuses on weak supervision—using higher-level,
noisier input from domain experts to train complex state-of-the-art models where limited hand-labeled training
data is available. He leads the development of the Snorkel
framework for weakly supervised ML, which has been
applied to KBC problems in domains such as genomics,
clinical diagnostics, and political science. He is supported
by a Stanford Bio-X SIGF fellowship.
Christopher Ré is an associate professor of computer
science at Stanford University. His work focuses on
enabling users and developers to build applications that
more deeply understand and exploit data. Work from his
group has been incorporated into major scientific and
humanitarian efforts, including the IceCube neutrino
detector, PaleoDeepDive, and MEMEX in the fight against
human trafficking, and into commercial products from
major Web and enterprise companies.
Copyright held by owners/authors.
Publication rights licensed to ACM. $15.00.