by first doing the types of detailed language analysis employed in Sampark.
“What one can do in the future is to
first do monolingual analysis of one or
both sides in paralleled corpora, and
then use that to improve the quality of
machine learning from the parallel corpora,” he says. “So what we have done
would also be useful if larger parallel
corpora became available tomorrow.”
Another advantage of the transfer
approach, says AT&T’s Bangalore, is its
generalizability. “If you give me a parallel corpus dealing with financial news,
and I train it up with millions of sentences of that sort, and two days later
you say, ‘Translate a sports article,’ it’s
not going to perform as well.”
But that kind of application domain
change has been explicitly anticipated
by Sampark’s developers. The first
version, rolling out now language-by-language, is general purpose and optimized for tourism-related uses, but
it will be made available to large users
who wish to customize it for other domains, says Dipti Sharma, an associate
professor at IIIT-H. That would involve
building a new domain dictionary, incorporating rules that handle domain-specific grammatical structures, and
perhaps retraining some modules such
as Part of Speech Tagger and Named
Entity Recognizer.
The effort required to make those
changes is minimized by building on
the existing multilingual dictionary,
Sharma says. It is sense- or meaning-based, so that for one domain or language, “bank,” for example, would
most likely represent a financial institution, but for another it might refer
to the edge of a river, Sharma says. The
dictionary currently allows translation
among nine languages.
Sangal says the language-translation
system has two especially noteworthy
attributes. First, the linguistic analysis
based on Panini is “extremely good,”
he says. “It was initially chosen for Indian languages, but we find it is also
suitable for other languages.” Initially,
hard work is needed, he says, in setting
it up by developing standards for parts-of-speech tags and dependency tree labels and for figuring out ways to handle
unique language constructs.
The second attribute of special note
is the system’s software architecture.
It is an open architecture in which all
modules produce output in Shakti
Standard Format (SSF). The architecture allows modules written in different programming languages to be
plugged in. Readability of SSF helps in
development and debugging because
the input and output of any module
can be easily seen. Also, a dashboard
tool supports the architecture in a variety of ways. Custom written, it is “
extremely robust,” Sangal says. “If a module fails to perform a proper analysis,
the next module will still work, albeit
in a degraded mode. So the system nev-
er gives up; it always tries to produce
something.”
Further Reading
Naskar, S. and Bandyopadhyay, S.
Use of machine translation in India:
current status. Machine Translation Review
15, Dec. 2005.
Bharati, A., Sangal, R., Mishra, D., V.,
Sriram, T., Papi Reddy
handling multi-word expressions explicit
linguistic rules in an MT system. Proceedings
of the Seventh International Conference on
Text, Speech and Dialogue, 2004.
Lavie, A., Vogel, S., Levin, L., Peterson, E.,
Probst, K., Llitjos, A.F., Reynolds, R., Carbonell,
J., Cohen, R.
Experiments with a hindi-to-English
transfer-based MT system under a miserly
data scenario. ACM Trans. on Asian
Language Processing 2, 2, June 2003.
Bharati, A., Chaitanya, V., Kulkarni, A.,
Sangal, R., Umamaheshwara Rao, G.
Anusaaraka: overcoming the language
barrier in India. Anuvad: Approaches to
Translation, Sage, New Delhi, 2001.
Manning, C.
Foundations of Statistical Natural Language
Processing. MIT Press, Cambridge, MA,
1999.
Bharati, A. and Sangal, R..
Parsing free-order languages in the
Paninian framework. Proceedings of the
31st Annual Meeting of the Association for
Computational Linguistics, June, 1993.
Gary Anthes is a technology writer and editor based in
arlington, Va.
© 2010 aCM 0001-0782/10/0100 $10.00
Milestones
SIGUCCS Hall of Fame and Other CS Awards
select members of the computer
science community were recently
honored for their innovative
service and research.
siGuccs HAll of fAme
Recognized for their years of
service, the 2009 inductees to the
siGUCCs Hall of Fame are nancy
Bauer-Runyan, Ross University;
Jim Kerlin; Lynnell Lacy,
University of illinois at Urbana-Champaign; teresa (terry)
Lockard, University of Virginia;
and Glenn Ricart.
GoRDon Bell PRizes
the Gordon Bell Award was
presented at sC09 to recognize
outstanding achievement in
high-performance computing
applications. the purpose of the
award is to track the progress
over time of parallel computing,
with particular emphasis
on rewarding innovation in
applying high-performance
computing to applications in
science.
A team led by tsuyoshi
Hamada of nagasaki
University won for its paper
“ 42 tFlops Hierarchical
n-body simulations on GpUs
with Applications in both
Astrophysics and turbulence”
in the lower price/performance
category.
there were two winners in
the special category. A team led
by David e. shaw of D.e. shaw
Research won for its paper
“Millisecond-scale Molecular
Dynamics simulations on Anton”
and a team led by Rajagopal
Ananthanarayanan of iBM
Almaden Research Center won
for its paper “the Cat is out of the
Bag: Cortical simulations with 109
neurons, 1013 synapses.”
in the peak performance
category, a team led by Markus
eisenbach of oak Ridge national
Laboratory won for its paper “A
scalable Method for Ab initio
Computation of Free energies in
nanoscale systems.”
Penn Y cRAne AWARD
Robert paterson, vice president for
information technology, planning
and research at Molloy College,
received the penny Crane Award at
siGUCCs in recognition of significant
contributions to siGUCCs and
computing in higher education.
DiAnA AWARD
on behalf of Apple, sandy
Korzenny, director of Apple
product documentation,
received the Diana Award, which
siGDoC presents every two years
to an organization, institution,
or business for its long-term
contribution to the field of
communication design.