 |
CENTRAL INSTITUTE OF INDIAN LANGUAGES
DEPARTMENT OF HIGHER EDUCATION
Ministry of Human Resource Development, Government of India
Manasagangotri, Mysore - 570 006 |
 |
Linguistic Data Consortium for Indian Languages (LDC-IL)
|
Language data is the key ingredient in terms of research and development in the area of language technology. As the time goes by, an increasing number of researchers are seeing the potential benefits of the use of an electronic corpus as a source of empirical language data for their research. The issues surrounding collection, processing and annotation of the quantities of linguistic data make it necessary to involve a number of disciplines like linguistics, computer science, statistics, engineering etc. Corpus linguists, as we all know, often use computational methods when analyzing their data whereas the computational linguists are dependent on computer-readable linguistic data to use in their research and in building practical tools and programmes. The data from a large number of Indian languages thus collected will be of high quality with defined standards. This has been on demand for a long time in India which will now come true.
In order to fulfil this long-pending need, the Central Institute of Indian Languages, Mysore and several other like-minded institutions working on Indian Languages technology like Indian Institute of Science, Bangalore, Indian Institute of Technology, Bombay, Indian Institute of Technology, Madras, and the International Institute of Information Technology, Hyderabad, etc., have now been allowed by the Government of India to set up a Linguistic Data Consortium for Indian Languages (LDC-IL). |
MAJOR AREAS OF LINGUISTIC RESOURCE DEVELOPMENT
|
I Speech Recognition and Synthesis |
The objective of data collection effort is to primarily build speech recognition and synthesis systems for Indian languages. Although there are such ASR and TTS systems available around the world for a number of mainstream languages, commercially viable speech systems for Indian Languages are not available
Voice User Interfaces for IT applications and services have become more and more prevalent for languages like English, and are valued for their ease of access, especially in telephony-based applications. In a country like India, where the majority of the population is not comfortable using English and given the relatively lower rates of literacy, local language speech interfaces can provide access to IT applications and services, through internet and/or telephones, to the masses. If such technology is available in Indian languages, people in various semi-urban and rural parts of India will be able to use telephones and Internet to access a wide range of services and information on health, agriculture, travel, etc. However, for this to become a reality, a computer has to be able to accept speech input in the user’s language and provide speech output. Also, in multilingual India, if speech technology is coupled with translation systems between the various Indian languages, services and information can be provided across languages more easily.
Although speech technology has been the focus of research in India for a number of years and the technology itself has matured for real-world applications, the main obstacle in customizing this technology for various Indian languages is the lack of appropriate annotated speech databases in these languages. The focus here is (i) to collect data that can be used for building speech enabled systems in Indian languages and (ii) to develop tools that facilitate collection of high quality speech data. |
2. Background - Speech Recognition |
The task of automatic Speech recognition is the task of converting any speech signal into its orthographic representation. There are two different categories of speech recognition systems:
- Isolated word recognition and connected word systems as in command and control applications
- Continuous speech recognition systems. In continuous speech recognition there are two different categories; read speech and spontaneous speech.
|
3. Background - Speech Synthesis |
The task of speech synthesis to convert written text (orthographic representation to speech). The vocabulary should not be restricted for speech synthesis and synthesized speech must be close to natural speech. To enable unrestricted speech synthesis, the sentence is normally converted to a sequence of basic units. Then appropriate rules of synthesis are employed to produce speech sounding natural.
The focus is primarily on building (a) vocabulary independent speech to speech translation (for a pair of Indian languages) and (b) vocabulary dependent isolated word recognition in the Indian languages. |
Next >>
| |