Other languages offer very little
language data. For example, available
parallel corpora for Sinhala-Tamil are
well below 50,000 sentences. Even raw,
clean corpora are of great value for
language computing. Modern-day deep
learning techniques start with word
embeddings (WEs). WEs are learned
from huge amounts of corpora (
millions of words) that capture the context
distribution for words and phrases.
Such distribution captures semantics,
which is an elusive entity, computationally speaking. Many Indic languages
do not have a processable clean corpus
from word lists, WEs, and a rich lexicon can be built. Another application
area that is affected by paucity of data
is ASR-TTS. Spoken signals must be
correct, with proper text units. Then
there are transcriptions of spoken
utterances that need to be accurate.
Although there are subtitled You-Tube videos and lectures, they require
curation, as time alignments are quite
poor. However, the number of available
hours of training data is small, leading
to poor alignments.
Absence of basic speech and NLP
tools. The NLP pipeline starts with
word-level processing, and goes all the
way up to discourse computation (
connecting many sentences together with
attention to coherence and cohesion).
2
The tools used at each stage of this
pipeline are affected by the accuracy
of tools in the preceding stages. For
English, since many groups across
the world have worked on the computational processing of the language,
a staged development of NLP tools
of English occurred. NLTK,d a GATElikee NLP framework came into being,
paving the way for large application
development in English. In contrast,
even basic morphology analyzers that
split words into their roots and suffixes
do not exist for most Indic languages,
and even if they exist, their accuracy
level is low.
Absence of linguistics knowledge.
Though speech processing and NLP
are data-driven, linguistics insight and
understanding of language phenom-
ena often help solve the problem of ac-
curacy saturation. Deep understanding
of language phenomena helps design
d https://www.nltk.org/
e https://gate.ac.uk/
Austro-Asiatic (Khasi in Meghalaya,
and Munda in Chhotonagpur). These
language families each have their own
linguistic characteristics, whose rich-
ness and complexity have been delved
into in multiple scholarly treatises.
11
These complexities, along with techno-human constraints, give rise to the
challenges of Indic language computing, some of which are described here.
Scale and diversity. For Indic languages, solutions must be simultaneously proposed for multiple languages.
There are 22 major languages in India,
written in 13 different scripts, with
over 720 dialects. There is a need to develop approaches that are generic, and
scaling to multiple languages should
be only a task of adaptation. As the languages are quite different, there is a lot
of effort required to arrive at common
solutions. Although E2E (end-to-end)
is the buzzword today, use of multiple
scripts for Indian languages makes
systems complex (as illustrated in the
accompanying figure).
Long utterances. Indian-language
utterances are much longer in duration
compared to English, and hardly contain punctuation. A typical English sentence has about 70 characters, while a
sentence in an Indian language typically
averages 130 characters. E2E systems
perform poorly with long sentences.
Code mixing. Code mixing is the use
of more than one language in text/utter-
ance. Handling code switching from one
language to another in both automatic
speech recognition (ASR) and text to
speech (TTS) is a challenge. In ASR, the
language boundary could be an impor-
tant cue for semantics (assuming the
lexicon accounts for the vocabulary of
both languages). Also, Indian language
words are included in an English sen-
tence, where gerundification (such as
“I’m chalaaoing a car,” meaning “I am
driving a car”) of Indian-language nouns
is common. In T TS, producing code-
switched systems requires the prosodic
characteristics of the language and the
speaker are preserved, especially when
code switching involves stress-timed
and syllable-timed languages. The
interplay between languages in terms
of prosody needs to be understood to
make the sentences sound natural.
Resource scarcity. Indic-language
computing is bogged down by paucity
of data. Language computing these days
is primarily data-driven, with sophis-
ticated machine learning techniques
employed on the data. The success of
these approaches depends crucially on
the availability of large amounts of high-
quality data. We take the example from
automatic machine translation (MT),
which is highly data-driven these days:
the Hansard corpus for English-French
contains 1. 6 billion words; the Europarl
Parallel Corpus for 21 European lan-
guages contains about 30 million words;
WMT 15 data for English-Czeck contains
about 16 million parallel sentences;
and WMT 14 data for English-German
contains about 4. 5 million parallel
sentences. An Indic-language example
with comparable size is the CFILT-IITB
English-Hindi corpus, which includes
800,000 parallel sentences.
Diversity is the name of the game for Indic-language computing; shown here are scripts in
Devanagari, Brahmi, Odia, Tamil, Telugu, Malayalam, and Sinhala, among other languages.