big trends
DOI:10.1145/3343456
BY PUSHPAK BHATTACHARYYA, HEMA MURTHY,
SURANGIKA RANATHUNGA, AND RANJIVA MUNASINGHE
Indic
Language
Computing
IN APRIL 2019, following the Easter Sunday bomb
attacks, the Government of Sri Lanka had to shut
down Facebook and You Tube for nine days to stop
the spreading of hate speech and false news, posted
mainly in the local languages Sinhala and Tamil.
This came about simply because these social media
platforms did not have the capability to detect and
warn about the provocative content.
India’s Ministry of Human Resource Development
(MHRD) wants lectures on Swayama and NPTELb—the
online teaching platforms—to be translated into all
Indian languages. Approximately 2.5 million students
use the Swayam lectures on computer science alone.
The lectures are in English, which students find
difficult to understand. A large number of lectures
are manually subtitled in English. Automatic speech
recognition and machine translation into Indian
languages will be great enablers for the marginalized
sections of society.
Requirements like these are real and abundant.
a https://swayam.gov.in/
b https://nptel.ac.in/
These are social and commercial needs,
whose servicing requires user interaction and information dissemination
in languages other than English. Only
around 10% of India’s population, or
about 125 million people, can speak
English; only about half that number
is comfortable reading and writing in
that language. The social media activity
of the youth of the Indian subcontinent
(where 65% of the population is below
the age of 35) generates a huge amount
of e-content, much of which is in text
form, is multilingual, and even code-mixed (text in multiple languages at the
same time, often in Roman script). The
numbers are mind-boggling:c
˲ 462.1 million Internet users (34% of
the population; the global average is 53%).
˲ 430.3 million users access the Internet via mobile devices (79% of total
Web traffic).
˲ 250 million social media users
(19% of the population; the global average is 42%).
˲ 260 million WhatsApp users, and
53 million Instagram users.
Sri Lanka alone has seven million
Internet users (2018 data), which
equates to a penetration of 32%.
There is no doubt that speech and
natural language processing (NLP) of
Indic languages is hugely important
and relevant, and has the potential to
influence the lives and activity of at
least 20% of the world’s population.
Challenges of Indian
Language Computing
The Indian subcontinent is divided
into seven independent countries:
India, Pakistan, Bangladesh, Nepal,
Bhutan, Sri Lanka, and the Maldives.
There are approximately 1,599
languages in India, out of which about
420–440 are in active use. Languages in
the region fall into four major linguistic
groups: Indo-Aryan (spoken mainly in
the northern part of south Asia and in
Sri Lanka), Dravidian (spoken mainly in
south India), Tibeto-Burman (
spoken mainly in northeast India), and
c India Today, April 2018 issue.
IMAGE BY JOAT