good problem-solving strategies, and
helps immensely in error analysis and
explainability. Many Indic languages do
not have a linguistics tradition.
Script complexity and non-standard input mechanisms. In an Indic
language such as Devanagari, there
are 13 vowels, 33 consonants, 12 vowel
marks or matras, complex conjunct
characters, and special symbols such as
anusvara, visarga, chandra bindu, and
Nukta.f This makes input speed slow
( 8–10 words per minute, compared to
20–30 w.p.m. in English). Though an
InScript keyboard layout has been mandated by the Government of India, there
are questions on its optimality and ease
of use. Suggestions for more efficient
keyboard layouts keep appearing. The
problem is compounded by the presence of 13 different scripts, which drives
people to resort to Roman input through
transliteration most of the time.
Non-standard transliteration.
There are variations in representation
when it comes to transliteration in Roman. For example, the Hindi word for
“mango” (a fruit) can be transliterated
as “am,” “Am,” or “aam.” This creates
a challenge for processing, and does
not help the English-illiterate.
Non-standard storage. The appearance of Unicode for Indic languages
and its adoption as the standard
encoding of Indic language e-content
was rather slow. As a result, many proprietary fonts exist, and the content of
those fonts require downloading and
algorithmic adaptation.
Man-made problems. Problems are
further compounded by the fact that
noise levels on the subcontinent average
about 70dB, while the maximum permissible level is about 55dB. This challenges
speech recognition technologies.
Some challenging language phenomena. A language phenomenon across
major Indian languages is compound
verbs (CVs), whose processing is a must
for Indic-language NLP (INLP). CVs are
composed of two verbs such that the
main information content of actual action is carried by the first verb (called the
polar) and the Gender-Number-Tense-Aspect-Modality (GNPTAM) information
are marked on the second verb (called
the vector). Elaborate machinery is
needed for computational processing of
f These are diacritic marks.
CVs, starting from morphology, and up
to the pragmatic level.
3 As an illustration,
consider the Hindi compound verb:g
H1: bol uthaa (Hindi string)
G1: speak rose (gloss)
T1: spoke up (English translation)
There is a sense of abruptness/
urgency/letting-out-pent-up-feeling
that is an additional layer of meaning
carried by the vector verb on top of the
main action of speaking (the polar).
Catching such fine nuance is essential, for example, in sentiment and
emotion analysis.
8
Morpheme stacking. Many Indian
languages show heavy stacking of
morphemes (for the example, sub-
script 2 means the second sentence in
the document):
M2: gharaasamorchyaanii malaa
saaMgitle (Marathi sentence).
P2: ghar+aa+samor+chyaa+nii+mala
a+saMgit+le (showing morphemes).
G2: house+<morpheme: oblique
marker>+front+of+<ergataive marker:
agent> me told (gloss).
T2: The one in front of the house
told me (translation).
This example is typical of the
processing of most Indic languages.
P2 (denoting parts) shows the constituents of the word strings. This needs
sophisticated word segmenters and
morphology analyzers.
State of the Art and Achievements
Despite the aforementioned challenges,
the Indic language computing community has taken notable strides forward.
This is seen on multiple fronts, such
as corpus creation, NLP tool-building,
end-user application development,
research funding, collaboration, and
standards and policy setting.
Fortunately for NLP, huge amounts
of text in electronic form have become
available in many walks of life (such
as customer interactions in banks,
reviews of online companies, judicial
documents, contracts, e-books, and
so on), paving the way for researchers
to think about and apply powerful ma-
chine learning techniques to language
technology problems. A case in point
is the use of Europarl Parallel Corpus
g We use transliterated Roman script for uni-
versal readability: H11- sentence no. 1, which
is in Hindi; G11- word for word translation of
sentence no. 1 called gloss; T11- translation in
English of sentence no 1.
There is no doubt
that speech and
natural language
processing of
Indic languages is
hugely important
and relevant, and
has the potential to
influence the lives
and activity of at
least 20% of the
world’s population.