Technology | DOI: 10.1145/1629175.1629184
Gary Anthes
Automated Translation
of indian languages
India faces a daunting task trying to manually translate among 22 official languages, but
assistance, in the form of advanced technology enabled by a lot of hard work, is on the way.
THe CoMpLeXitY AnD diversity of human languages makes automated translation one of the hardest problems in computerscience. Yetthejob
is becoming more important as writing
and speech are increasingly digitized
and as the traditional separations between societies dissolve.
Few parts of the globe have as much
need to translate from one language to
another as does India. According to India’s 2001 census, the country has 122
languages, 22 of which are designated
as official languages by the government.
The top six—Hindi, Bengali, Telugu,
Marathi, Tamil, and Urdu—are spoken
by 850 million people worldwide.
Now a decades-long effort by researchers is about to bear fruit. A multi-part machine translation architecture,
Sampark, is nearing completion as
the combined effort of 11 institutions
led by the Language Technologies Research Center at the International Institute of Information Technology in
Hyderabad (IIIT-H).
Sampark combines both traditional
rules- and dictionary-based algorithms
with statistical machine learning, and
will be rolled out to the public at http://
sampark.iiit.ac.in/. By this month, systems for 12 out of 18 language pairs
(nine languages) will be online and
available for experimentation, with six
more to follow soon after.
Many Indian languages are derived
from Sanskrit, which is based on rules
set down by Panini, the 4th century
B.C. grammarian. Even those Indian
languages that are not derived from
Sanskrit are structurally similar to others in India. This common underpinning makes the translation from one
Indian language to another easier than
from, say, German to Chinese. Nevertheless, there are 462 pair-wise translations (counting each direction for a
india is the home of 122 languages, 22 of which are designated as official languages.
pair) possible among the 22 official Indian languages, so clearly the researchers had to find a generalized approach
that could be easily adapted from one
language to another.
The chosen method, a transfer-based approach, consists of three
major parts: analyze, transfer, and
generate. First, the source sentence is
analyzed, then the results are transferred in a standard format to a set of
modules that turn it into the target language. Each step consists of multiple
translation “modules.”
An advantage of the three-step approach, says Rajeev Sangal, director of
the Language Technologies Research
Center, is that a particular language
analyzer, one for Telugu, for example,
can be developed once, independent of
other languages, and then paired with
generators in various other languages,
such as Hindi.
The 13 major translation modules
together form a hybrid system that combines rules-based approaches—where
grammar and usage conventions are
codified—with statistical-based methods in which the software in essence
discovers its own rules through “
training” on text tagged in various ways by
human language experts.
A Transfer-Based Approach
Translation systems for major languages today—from companies like Google
and Microsoft, for example—often use
statistical approaches based on parallel corpora, huge databases of corresponding sentences in two languages.
These systems use probability and statistics to learn by example which translation of a word or phrase is most likely
correct. And they move directly from