Among Indian languages including English, transliteration has mainly been explored for the following language pairs:
- Gurmukhi-Shahmukhi (Punjabi-Urdu)
Hindi and Urdu are the two most widely used languages in the Indian subcontinent and English is a global language. Thus the reason why Hindi-English and Urdu-English transliteration systems have been extensively researched.
Hindi-Urdu is another important language pair and I have already discussed about this pair in my previous blog.
After Hindi-Urdu, Gurmukhi-Shahmukhi script pair tops the chart. Gurmukhi and Shahmukhi are the two scripts for Punjabi and thus makes this pair an important candidate for exploring transliteration. Perso-Arabic (Shahmukhi) is one of two scripts used for Punjabi, the other being Gurumukhi. Gurmukhi is mainly used by Punjabi speakers in India while Shahmukhi is used by Punjabi speakers in Pakistan.
Other than these language pairs, transliteration systems have been developed for some Indic-Roman scripts like Malayalam-English, Tamil-English, Telugu-English etc. Transliteration within rest of the Indic scripts has not been explored much.
In this blog I will discuss about transliteration among the following languages and will refer them as Indic scripts from here on:
- Devanagari (Hindi, Marathi, Konkani, Nepali and Bodo)
- Bengali and Assamese
The transliteration systems I have developed so far does not use any kind of rules to transliterate a word from one script to another rather uses the ML models trained using the training transliteration pairs. But for Indic scripts I have developed both rule-based and machine learning (ML) systems.
Indic scripts have a special property that their phonemes are one-to-one aligned between their Unicode tables. It seems that the transliteration between any two Indic scripts can be achieved by merely using their unicode tables. But this is not as simple as it seems. There are various issues with this approach. The first issue is to handle the missing phonemes in Indic scripts (already discussed here). Secondly, there are ambiguous caharacter mappings (see example table) in some Indic scripts like Tamil, Bengali etc. These ambiguous mappings actually arise due to missing letters for some phonemes in these scripts. For example Bengali have no character for Va, so it is represented by Ba (ব), thus causes ambiguity for Ba character of Bengali when we transliterate it to some other Indic script as it can be Va or Ba in that script.
I have used WX notation for rule-based transliteration among Indic scripts. WX notation itself is a transliteration scheme for representing Indian languages in ASCII (Roman). This scheme originated at IIT Kanpur for computational processing of Indian languages, and is widely used among the natural language processing (NLP) community in India. The main advantages of WX notation are:
- Memory and computational efficiency; since we are working with ASCII rather than Unicode (utf-8, utf-16 etc.).
- Readability; WX allows one to read a Tamil or Telugu string even if he has no idea about the original script. This helps in analysis of the developed system.
- Common representation; WX tries to give a common representation to all Indic scripts. For example, Hyderabad in Telugu (హైదరాబాద్), Malayalam (ഹൈദരാബാദ്) and Kannada (ಹೈದರಾಬಾದ್) all map to a common representation hExarAbAx in WX. But this is not the case with all Indic scripts, for example, Hyderabad in Tamil (ஹைதெராபாத்) maps to hEweVrApAw in WX.
Internally WX maps the letters of Indic scripts to a common representation in ISCII (refer to page 15-17 of iscii91.pdf) and then maps this ISCII to ASCII which we call WX. There is a single ISCII to ASCII table for WX conversion and a seperate table for Unicode to ISCII for each Indic script. Check WX conversions of various Indic scripts from Indic-WX-Converter.
Using this WX as bridge we convert a string from source script to target script. First the source script is converted to WX and then WX is converted to the target script.
Data Extraction and Training
For data extraction and learning the model parameters, I followed the same approach as described in the Roman-Indic Transliteration post. Statistics of extracted transliteration pairs are given in the below table:
ML Vs Rule-Based System
For closely related language pairs without ambiguous letter mappings like Hindi-Gujarati, Hindi-Punjabi, Tamil-Malayalam, Oriya-Bengali, Telugu-Kannada etc. rule-based system might work better where as for not as much related language pairs like Hindi-Tamil, Hindi-Malayalam, Tamil-Oriya, Malayalam-Gujarati, Kannada-Punjabi ML system might work better.
Given below is a small demo of how one can use indic-trans library to transliterate from one script to another using rule-based and ML systems. Here we first transliterate from English to Hindi and then from Hindi to Telugu, Malayalam and Kannada using both rule-based and ML systems. Note -r flag for rule-based system.
$ echo 'indictrans libindic hyderabad university bhagyalakshmi bharat morocco' |\ indictrans -s eng -t hin | indictrans -s hin -t tel -r # RULE-BASED ఇండిక్ట్రాంస లిబిందిక హైదరాబాద యూనివర్సిటీ భాగ్యాలక్ష్మీ భారత మోరోక్కో $ echo 'indictrans libindic hyderabad university bhagyalakshmi bharat morocco' |\ indictrans -s eng -t hin | indictrans -s hin -t tel # ML ఇండిక్ట్రాంస్ లిబిందిక హైదరాబాద్ యూనివర్శిటీ భాగ్యాలక్ష్మి భారత్ మోరోక్కో $ echo 'indictrans libindic hyderabad university bhagyalakshmi bharat morocco' |\ indictrans -s eng -t hin | indictrans -s hin -t mal -r # RULE-BASED ഇംഡിക്ട്രാംസ ലിബിംദിക ഹൈദരാബാദ യൂനിവര്സിടീ ഭാഗ്യാലക്ഷ്മീ ഭാരത മോരോക്കോ $ echo 'indictrans libindic hyderabad university bhagyalakshmi bharat morocco' |\ indictrans -s eng -t hin | indictrans -s hin -t mal # ML ഇന്ഡിക്ട്രാംസ് ലിബിന്ദിക ഹൈദരാബാദ് യൂനിവര്സിടി ഭാഗ്യാലക്ഷ്മി ഭാരത മോരോക്കോ $ echo 'indictrans libindic hyderabad university bhagyalakshmi bharat morocco' |\ indictrans -s eng -t hin | indictrans -s hin -t tam -r # RULE-BASED இங்டிக்ட்ராங்ஸ லிபிங்திக ஹைதராபாத யூநிவர்ஸிடீ பாக்யாலக்ஷ்மீ பாரத மோரோக்கோ $ echo 'indictrans libindic hyderabad university bhagyalakshmi bharat morocco' |\ indictrans -s eng -t hin | indictrans -s hin -t tam # ML இண்டிக்ட்ராங்ஸ் லிபிந்திக் ஹைதராபாத் யூனிவர்சிடி பாக்யாலக்ஷ்மி பாரதப் மோரோக்கோ