They got 29 out of 30 letters correct,
missing only one “rare letter.” They recovered 60% of the Ugaritic words that
have a similar meaning in Hebrew and
come from a common ancestor so they
have a similar pronunciation.
Snyder believes the project illus-
trates how the field of linguistics will
undergo a shift toward greater use of
computational methods, but those
methods “will be guided by our knowl-
edge of linguistics and what are the
relevant features of language to look
at. So, in a sense, the design of the al-
gorithm still needs to be guided by hu-
man linguistic knowledge.”
Knight concurs that the future of
linguistics will definitely depend on
software—and how much of the data
regarding languages can be assembled
online and available to computers.
“The big enabler will be getting all
this data online, organizing the da-
tabases, and allowing computers to
analyze it all,” he says. “There’s a clear
parallel here to DNA sequencing and
biological data analysis. Computers
have totally taken over in that area of
biological classification and, I predict,
they’ll totally take over in the area of
linguistic reconstruction, for sure.”
As for Dan Klein and his NSF-fund-
ed reconstruction efforts, he is prepar-
ing for the next steps, which include
further scaling up the current models
so that he and his team can reconstruct
even further back than the 7,000 years
they’ve been able to so far. That’s a
matter of gathering more data—larg-
er collections of languages that are
even more distantly related—and, at
the same time, tweaking the software,
since the further back you want to go,
the better the models have to be.
For example, he would like to feed
in all the languages of the world and
then draw inferences about what their
roots looked like.
“Obviously, because so much data
is involved, it will require computation, not just manual work,” says
Klein. “But a historical linguist would
observe that what we are attempting
is not to automate what people have
been doing by hand, because people
are very good at the kind of research
they do by hand. They would say that
tools like ours give us a way to answer
new kinds of questions that are impractical to answer by hand.”
Further Reading
Bouchard-Cote, A., Hall, D.,
Griffiths, T., and Klein, D.
“Automated Reconstruction of Ancient
Languages Using Probabilistic Models
of Sound Change,” March 12, 2013, the
national Academy of Sciences of the United
States of America, http://www.pnas.org/
content/110/11/4224
Kim, Y., Snyder, B.
“Unsupervised Consonant-Vowel Prediction
Over hundreds Of Languages,” to be
published at the summer 2013 Association
of Computational Linguistics Conference,
http://pages.cs.wisc.edu/~bsnyder/papers/
consvowel-acl2013.pdf
Snyder, B., Barzilay, R., and Knight, K.
“A Statistical Model for Lost Language
Decipherment,” July 13, 2010, the 2010
Association of Computational Linguistics
Conference, http://people.csail.mit.edu/
bsnyder/papers/bsnyder_acl2010.pdf
Bouchard-Cote, A., Griffiths, T., Klein, D.
“Improved Reconstruction of Protolanguage
Word Forms,” May 31, 2009, the 2009
Annual Conference of the north American
Chapter of the Association for Computational
Linguistics, http://www.aclweb.org/
anthology-new/n/n09/n09-1008.pdf
Hall, D. and Klein, D.
“Large-Scale Cognate Recovery,” July 27,
2011, the 2011 Conference on Empirical
Methods in natural Language Processing
(EMnLP ’ 11), http://www.aclweb.org/
anthology-new/D/D11/D11-1032.pdf
“Kevin Knight: Language Translation
and Code-Building” (video), April
18, 2013, https://www.youtube.com/
watch?v=bcfOT-jFazc
Paul hyman is a science and technology writer based in
great neck, ny.
© 2013 aCm 0001-0782/13/10 $15.00
and seeing the patterns in that data.
Knight is a senior research scientist at
USC/Information Sciences Institute.
“On the one hand, computers are
much more thorough and much more
patient than people are at searching for
patterns,” he says, “but they only look
for what you tell them to look for. If the
text uses some other method of encod-
ing that you didn’t tell the computer
about, it’s not going to find an answer.
Humans, on the other hand, are much
better at this kind of flexible pattern-
matching and adapting.”
For example, he says, there are mul-
titudes of ways to write the letter ‘A’ in
English, including capital, lower-case,
cursive, and so on.
“I could show you 50 different ways
and you would look at them and say,
‘yeah, that’s right, they are all A’s’— dif-
ferent from each other but recogniz-
able as A’s,” he says. “But while humans
can do that naturally, it’s difficult to
program computers to do it—although
they are getting much better at it.”
He predicts the decision whether to
do linguistic work by hand or software
will depend on the specific issue un-
der consideration, “although in much
of our work, a joint human-computer
team tends to be the best way to go.”
For instance, three years ago, Knight
teamed up with Ben Snyder and Regi-
na Barzilay, an associate professor in
MIT’s Computer Science and Artificial
Intelligence Lab, to present the paper
“A Statistical Model for Lost Language
Decipherment” at ACL 2010. The pa-
per demonstrated how some of the
logic and intuition of human linguists
can be successfully modeled, allowing
computational tools to be used in the
decipherment process.
“Through a painstaking trial-and-
error process, scholars were able to de-
cipher Ugaritic, a 3,000- to 5,000-year-
old language from ancient Syria known
almost only in the form of writings
from ruins,” says Knight. “It took them
five years to do that by hand.”
In their 2010 paper, Knight and
his co-researchers described how it
took them six months to develop a
computational model to do the same
task—and about an hour to run it and
achieve results.
The team then evaluated their software by comparing those results to what
the linguists had achieved by hand.
“Computers have
totally taken over in
that area of biological
classification and,
i predict, they’ll
totally take over in
the area of linguistic
reconstruction,
for sure.”