and language technologies. This will
help pool low resources across various
languages to build robust ASR systems
for Indian languages.
˲ In the context of TTS, the major
issue to be addressed is the input
method. Text is available in multiple
Indian scripts, but digital resources in
terms of high-quality parallel corpora
are few and far between. In the context
of both ASR and TTS, generic acoustic
models across various languages, generic language models in the former, and a
generic Indic voice in the latter need to
be designed. This will also address the
issue of code switching.
˲ In T TS, code mixing must find ways
to preserve the speaker’s voice across
languages. Further, the influence of the
native tongue on a non-native tongue
must be preserved. For instance, there
are as many varieties of English as there
are native tongues. Replacing non-native English (which is syllable-timed)
with stress-timed English can make it
difficult for the listener to understand.
˲ Text in social media generally
includes code switching/mixing. Further, there are many words that have
a local cultural connotation. Building
language resources to address these requires the expertise of linguists, speech
scientists, natural language processing
engineers, and ethnographers.
˲ Data is the new oil, and NLP and
ILT is no exception. There is no doubt
that resources with quality and coverage need to be created, and created fast.
Thinking creatively on how to engage
even a small portion of 1 billion hands
for resource creation is a must. Crowdsourcing, in spite of its criticism with
respect to quality, seems to be the way
forward. Providing attractive, helpful
interfaces and remuneration can go a
long way toward resource creation. In
this context, the Language Data Consortium for Indian Languages (LDC-IL)p
initiative of Central Institute of Indian
Languages (CIIL) is noteworthy.
˲ Evaluation is the key to actual use of
language resources and should be taken
very seriously. Like TRECq (USA), CLEFr
(Europe), and NTCIRs (CJK countries),
India’s Forum for Information Retrieval
p http://www.ldcil.org/
q https://trec.nist.gov/
r http://www.clef-initiative.eu/
s http://research.nii.ac.jp/ntcir/index-en.html
Evaluation (FIRE) initiativet has taken up
the cause of evaluation in information
retrieval and allied tasks. A FIRE-like
initiative is needed for all areas of ILT.
Conclusion
Indic Language Computing (ILC) is
too important a problem to be lying in
oblivion. Given spectacular advancements to date in computing science
and technology, Internet, AI, machine
learning, and NLP, the time is ripe for
a concerted thrust for realization and
social penetration of ILC. The energy of
the start-up echo system has to be harnessed with government support, and
guidance from academia. Language resource creation is a precondition for ILC
revolution, and as in all cases of large
infrastructure building (roads, internet,
gas lines, waterways), government spon-sorship is needed for resource building.
t http://fire.irsi.res.in/fire/2019/home
References
1. Bahdanau, D., Cho, K. and Bengio, Y. Neural machine
translation by jointly learning to align and translate.
ICLR, 2015.
2. Bhattacharyya, P. Natural language processing:
A perspective from computation in presence of
ambiguity, resource constraint and multilinguality. CSI
J. Computer Science and Engineering 1, 2 (2012).
3. Chakrabarti, D., Mandalia, H., Priya, R., Sarma, V., and
Bhattacharyya, P. Hindi compound verbs and their
automatic extraction. In Proceedings of Computational
Linguistics, Manchester, U.K., Aug. 2008.
4. Jha, G. N. The TDIL program and the Indian language
corpora initiative. In Proceedings of LREC, 2010.
5. Koehn, P. Europarl: A parallel corpus for statistical
machine translation. In Proceedings of the Machine
Translation Summit, 2005.
6. Kunchukuttan, A., Mishra, A., Chatterjee, R., Shah, R. and
Bhattacharyya, P. Shata-Anuvadak: Tackling multiway
translation of Indian languages. In Proceedings of the
Language Resources Evaluation Conference, 2014.
7. Kunchukuttan, A., Mehta, P., and Bhattacharyya, P.
The IIT Bombay English-Hindi parallel corpus. In
Proceedings of LREC, (Miyazaki, Japan, May 7–12, 2018).
8. Liu. B. Sentiment Analysis and Opinion Mining. Morgan
and Claypool Publishers, 2012.
9. Murthy, R., Kunchukuttan, A., and Bhattacharyya, P.
Addressing word-order divergence in multilingual
neural machine translation for extremely low resource
languages. In Proceedings of LREC, 2019.
10. Ranathunga, S., Farhath, F., Thayasivam, U., Jayasena,
S., and Dias, G. Si-Ta: Machine translation of Sinhala
and Tamil official documents. In Proceedings of the
National Information Technology Conference, 2019.
11. Subbarao K. V. South Asian Languages—A Syntactic
Typology. Cambridge, 2012.
Pushpak Bhattacharyya ( pb@cse.iitb.ac.in) is a professor
in the computer science and engineering department of IIT
Bombay, and director of IIT Patna.
Hema Murthy ( hema@cse.iitm.ac.in) is a professor in
the computer science and engineering department of
IIT Madras.
Surangika Ranathunga ( surangika@cse.mrt.ac.lk) is a
senior lecturer in the department of computer science and
engineering and a member of the faculty of engineering at
the University of Moratuwa.
Ranjiva Munasinghe ( ranjiva@mindlanka.org) is chief
executive officer of MIND Analytics and Management in
Colombo, Sri Lanka.
© 2019 ACM 0001-0782/19/11 $15.00
Code mixing
must find ways
to preserve
the speaker’s voice
across languages.
Further,
the influence
of the native tongue
on a non-native
tongue must
be preserved.