source language to target language
with no intermediate transfer step.
“The statistical direct translation
approach is, in a sense, the lazy man’s
approach, because all it requires is that
you go and hunt for parallel corpora
and you turn the crank and you get
what you get,” says Srinivas Bangalore,
a speech and language processing specialist at AT&T Research in Florham
Park, NJ. “But the transfer-based approach is much more linguistically motivated, because you are trying to analyze the sentence and trying to arrive at
something that is close to a representation of its meaning.”
Parallel corpora are specialized databases consisting of sentences very
carefully translated and then mapped
one-for-one to their translations. Moreover, to do a good job of training translation systems, the parallel corpora
must be very large—in the billions of
sentences. “People are coming to grips
with the fact that parallel data are not
easy to come by,” Bangalore says. “This
is a very specialized kind of data.”
Indeed, parallel corpora for many
Indian language pairs do not exist and
cannot easily be built, in part because
not much Indian language text has
been digitized. Nevertheless, developers at the Language Technologies Research Center were able to apply statistical machine learning in a limited way
by annotating small monolingual corpora and analyzing the tagged text with
statistical techniques, Sangal says.
So although machine learning techniques were employed in some of the
modules, developers painstakingly
developed multilanguage dictionaries
and codified rules in the Computational Paninian Grammar framework. They
also held workshops of experts of all
these languages to develop a standard
tag set, and then used those tags to annotate the monolingual corpora.
“Most machine translation is not inspiration, it’s perspiration,” Bangalore
says. “The hard part is building all the
resources required, like dictionaries,
morphological analyzers, parsers, and
generators. It’s a lot of grunt work.”
Sangal says the effort that Sampark
developers put into language analysis could have a broad impact beyond
translating Indian languages. He says
that even the best purely statistical
systems can be made more accurate
An automated system for translating one indian language to another, sampark
is a hybrid system consisting of traditional rules-based algorithms and
dictionaries and newer statistical machine-learning techniques. it consists of
three major parts and 13 modules arranged in a pipeline.
How Sampark Works
source Analysis
Tokenizer
Morphological
analyzer
Part of
speech tagger
Chunker
Named entity
recognizer
simple parser
Word sense
disambiguator
Transfer
syntax transfer
lexical transfer
Transliteration
Target Generation
Agreement
insertion of Vibhakti
(case-markers)
Word generator
souRce AnAl Ysis
Tokenizer: Converts text into a sequence
of tokens (words, punctuation
marks, etc.) in shakti standard Format.
Morphological analyzer: Uses rules
to identify the root and grammatical
features of a word. splits the word into its
root and grammatical suffixes.
Part of speech tagger: Based on statistical
techniques, assigns a part of speech,
such as noun, verb or adjective, to each
word.
Chunker: Uses statistical methods to
identify and tag parts of a sentence,
such as noun phrases, verb groups, and
adjectival phrases, and a rule base to give
it a suitable chunk tag.
Named entity recognizer: identifies and
tags entities such as names of persons
and organizations.
Simple parser: identifies and names
relations between a verb and its
participants in the sentence, based on
the Computational paninian Grammar
framework.
Word sense disambiguation: identifies
the correct sense of a word, such as
whether “bank” refers to a financial
institution or a part of a river.
TRAnsfeR
Syntax transfer: Converts the parse
structure in the source language to the
structure in the target language that
gives the correct word order, as well as a
change in structure, if any.
Lexical transfer: Root words identified
by the morphological analyzer are looked
up in a bilingual dictionary for the target
language equivalent.
Transliteration: Allows a source word
to be rendered in the script of the
target language. Useful in cases where
translation fails for a word or a chunk.
TARGe T GeneRATion
Agreement: performs gender-number-
person agreement between related words
in the target sentence.
Insertion of Vibhakti: Adds post position
and other markers that indicate the
meanings of words in the sentence.
Word generator: takes root words and
their associated grammatical features,
generates the appropriate suffixes and
concatenates them. Combines the
generated words into a sentence.