Google Summer of Code 2016 (GSoC)
I feel privileged to have been given the opportunity to work for Libindic organization under the Google Summer of Code 2016. Libindic is an open source library that supports many utilities for text processing of Indian languages. I would be contributing towards the automatic script transliteration between scheduled languages of India including English. For the project, I would be mentored by Riyaz Ahmad Bhat and Santhosh Thottingal.
During Community Bonding
During the community bonding period I explored some of the existing tools on Indic-transliteration. Created different test cases to evaluate these existing systems, to gauge the weaknesses and strengths of these systems.
Transliteration simply means conversion of a text from one script to another. Machine transliteration is the computer automated process of transcribing a character or word from one language script to another. Machine transliteration can play an important role in natural language application such as information retrieval and machine translation, especially for handling proper nouns and technical terms, cross-language applications, data mining and information retrieval system.
Since I am using ML approach to develop Roman-Indic transliteration systems, I need to create the training data first. As mentioned in the proposal, I use the sentence aligned ILCI parallel corpora and Indo-wordnet synsets to extract the transliteration pairs. Initially, the parallel corpus is word-aligned using GIZA++, and the alignments are refined using the grow-diag-final-and heuristic. I extract all word pairs which occur as 1-to-1 alignments in the word-aligned corpus as potential transliteration equivalents.
No matter how amazing the ML approach used to model a problem, it can never guarantee 100% results or even close. The accuracy of the model mostly depends on th number of training samples, the material impact of training samples on the variable being predicted and ML algorithm selected to train the model. We can only approach the best possible results by selecting a proper ML technique which itself depends on the problem definition.
Hindi and Urdu transliteration has always been treated special among Indic script transliterations. It has received a lot of attention from the NLP research community of South Asia. It has been seen to break the barrier that makes the two look different, although they are facets of the same language.
Among Indian languages including English, transliteration has mainly been explored for the following language pairs:
- Gurmukhi-Shahmukhi (Punjabi-Urdu)
Hindi and Urdu are the two most widely used languages in the Indian subcontinent and English is a global language. Thus the reason why Hindi-English and Urdu-English transliteration systems have been extensively researched.
It was a nice and fulfilling experience for me to work with Libindic organization under GSoC 2016. I thoroughly enjoyed the work and learned a lot in the process. I contributed towards the automatic script transliteration between scheduled languages of India including English. I am thankful to Riyaz Ahmad Bhat and Santhosh Thottingal for mentoring me through the programme.