reasoning required to successfully answer these example questions. Question-answering systems developed for
the message-understanding conferences6 and text-retrieval conferences13 have
historically focused on retrieving answers from text, the former from news-wire articles, the latter from various
large corpora (such as the Web, micro-blogs, and clinical data). More recent
work has focused on answer retrieval
from structured data (such as “In which
city was Bill Clinton born?” from FreeBase, a large publicly available collaborative knowledgebase). 4, 5, 15 However,
these systems rely on the information
being stated explicitly in the underlying data and are unable to perform the
reasoning steps that would be required
to conclude this information from indirect supporting evidence.
A few systems attempt some form
of reasoning: Wolfram Alpha14 answers
mathematical questions, providing they
are stated either as equations or with
relatively simple English; Evi10 is able to
combine facts to answer simple questions (such as “Who is older: Barack or
Michelle Obama?”); and START, 8 which
likewise is able to answer simple inference questions (such as “What South
American country has the largest population?”) using Web-based databases.
However, none of them attempts the
level of complex question processing
and reasoning that is indeed required to
successfully answer many of the science
questions in the Allen AI Challenge.
As the 2015 Allen AI Science Challenge
demonstrated, achieving a high score
on a science exam requires a system
that can do more than sophisticated
information retrieval. Project Aristo at
AI2 is focused on the problem of successfully demonstrating artificial intelligence using standardized science
exams, developing an assortment of approaches to address the challenge. AI2
plans to release additional datasets and
software for the wider AI research community in this effort. 1
1. Allen Institute for Artificial Intelligence. Datasets;
2. Aron, J. Software tricks people into thinking it is human.
New Scientist 2829 (Sept. 6, 2011).
3. BBC News. Computer AI passes Turing Test in ‘world
first.’ BBC News (June 9, 2014); http://www.bbc.com/
4. Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic
parsing on Freebase from question-answer pairs. In
Proceedings of the 2013 Conference on Empirical
Methods in Natural Language Processing (Seattle, WA,
Oct. 18–21). Association for Computational Linguistics,
Stroudsburg, PA, 2013, 6.
5. Fader, A., Zettlemoyer, L., and Etzioni, O. Open question
answering over curated and extracted knowledge
bases. In Proceedings of the 20th ACM SIGKDD
International Conference on Knowledge Discovery and
Data Mining (New York, Aug. 24–27). ACM Press, New
6. Grishman, R. and Sundheim, B. Message understanding
Conference-6: A brief history. In Proceedings of the 16th
Conference on Computational Linguistics (Copenhagen,
Denmark, Aug. 5–9). Association for Computational
Linguistics, Stroudsburg, PA, 1996, 466–471.
7. Kaggle. The Allen AI Science Challenge; https://www.
8. Katz, B., Borchardt, G., and Felshin, S. Natural language
annotations for question answering. In Proceedings
of the 19th International Florida Artificial Intelligence
Research Society Conference (Melbourne Beach, FL,
May 11–13). AAAI Press, Menlo Park, CA, 2006.
9. Marcus, G., Rossi, F., and Veloso, M., Eds. Beyond the
Turing Test. AI Magazine (Special Edition) 37, 1 (Spring
10. Simmons, J. True Knowledge: The natural language
question answering Wikipedia for facts. Semantic Focus
(Feb. 26, 2008); http://www.semanticfocus.com/blog/
11. Turing, A. M. Computing machinery and intelligence.
Mind 59, 236 (Oct. 1950), 433–460.
12. Turk, V. The plan to replace the Turing Test with a
‘ Turing Olympics.’ Motherboard (Jan. 28, 2015); https://
13. Voorhees, E. and Ellis, A., Eds. In Proceedings of the
24th Text REtrieval Conference (Gaithersburg, MD, Nov.
17–20). Publication SP 500-319, National Institute of
Standards and Technology, Gaithersburg, MD, 2015.
14. Wolfram, S. Making the world’s data computable.
Stephen Wolfram Blog (Sept. 24, 2010); http://blog.
15. Yao, X. and Van Durme, B. Information extraction over
structured data: Question answering with Freebase.
In Proceedings of the 52nd Annual Meeting of the
Association for Computational Linguistics (Baltimore,
MD, June 22–27). Association for Computational
Linguistics, Stroudsburg, PA, 2014, 956–966.
Carissa Schoenick ( firstname.lastname@example.org) is the senior
program manager for Project Aristo at the Allen Institute
for Artificial Intelligence in Seattle, WA.
Peter Clark ( email@example.com) is the senior research
manager for Project Aristo at the Allen Institute for
Artificial Intelligence in Seattle, WA.
Oyvind Tafjord ( firstname.lastname@example.org) is a senior research
scientist and engineer at the Allen Institute for Artificial
Intelligence in Seattle, WA.
Peter Turney ( email@example.com) was a senior
research scientist for Project Aristo at the Allen Institute
for Artificial Intelligence in Seattle, WA, and is now retired.
Oren Etzioni ( firstname.lastname@example.org) is the Chief Executive
Officer of the Allen Institute for Artificial Intelligence
in Seattle, WA, and a professor in the Allen School for
Computer Science at the University of Washington in
Copyright held by the authors.
Publication rights licensed to ACM. $15.00
key to achieving scores of 80% and higher and demonstrating what might be
considered true artificial intelligence.
A few other example questions each
of the top three models got wrong highlight the more interesting, complex nuances of language and chains of reasoning an AI system must be able to handle
in order to answer the following questions correctly and for which information-retrieval methods are not sufficient:
What do earthquakes tell scientists
about the history of the planet?
(A) Earth’s climate is constantly
(B) The continents of Earth are continually moving.
(C) Dinosaurs became extinct about
65 million years ago.
(D) The oceans are much deeper today than millions of years ago.
This involves the causes behind
earthquakes and the larger geographic
phenomena of plate tectonics and is not
easily solved by looking up a single fact.
Additionally, other true facts appear in
the answer options (“Dinosaurs became
extinct about 65 million years ago.”) but
must be intentionally identified and
discounted as incorrect in the context
of the question.
Which statement correctly describes
a relationship between the distance
from Earth and a characteristic of a star?
(A) As the distance from Earth to the
star decreases, its size increases.
(B) As the distance from Earth to the
star increases, its size decreases.
(C) As the distance from Earth to the
star decreases, its apparent brightness
(D) As the distance from Earth to the
star increases, its apparent brightness
This requires general common-sense-type knowledge of the physics of
distance and perception, as well as the
semantic ability to relate one statement
to another within each answer option to
find the right directional relationship.
While numerous question-answering
systems have emerged from the AI community, none has addressed the challenges of scientific and commonsense
Watch the authors discuss
their work in this exclusive