Roman-Indic Transliteration

Since I am using ML approach to develop Roman-Indic transliteration systems, I need to create the training data first. As mentioned in the proposal, I use the sentence aligned ILCI parallel corpora and Indo-wordnet synsets to extract the transliteration pairs. Initially, the parallel corpus is word-aligned using GIZA++, and the alignments are refined using the grow-diag-final-and heuristic. I extract all word pairs which occur as 1-to-1 alignments in the word-aligned corpus as potential transliteration equivalents. I will explain the transliteration pair extraction process on Hindi-Roman dataset. All the other datasets follow the same procedure.

Word alignments with GIZA++

Download and install GIZA++ from here.

Run the following bash script to word-align the parallel text files (say hin.txt and eng.txt):

src=hin.txt
trg=eng.txt
plain2snt.out $src $trg
mkcls -n10 -p$src -V$trg.vcb.classes
mkcls -n10 -p$trg -V$src.vcb.classes
snt2cooc.out $src.vcb $trg.vcb $src"_"$trg".snt" > $src"_"$trg".cooc"
GIZA++ -ml 101 -S $src.vcb -T $trg.vcb -C $src"_"$trg".snt" -CoocurrenceFile $src"_"$trg".cooc"

The resulting word-aligned file (ending with A3.final) will look like this:

# Sentence pair (1) source length 6 target length 6 alignment score : 2.28771e-11
लगातार बुखार से पीड़ित हो ?
NULL ({ }) I ({ }) suffer ({ 4 5 }) from ({ 3 }) fever ({ 2 }) continuously ({ 1 }) . ({ 6 })
# Sentence pair (2) source length 10 target length 11 alignment score : 2.70124e-12
कालाजार , मलेरिया या फिर हाई फीवर हो सकता है ।
NULL ({ }) Kalajar ({ 1 }) , ({ 2 }) Malaria ({ 3 }) or ({ 4 }) high ({ }) fiver ({ 5 6 7 }) may ({ 9 }) have ({ 10 }) happened ({ 8 }) . ({ 11 })
# Sentence pair (3) source length 8 target length 9 alignment score : 8.33816e-16
हर छह माह में करें आँखों की जाँच ।
NULL ({ 4 }) Get ({ 5 }) eyes ({ 6 }) checked ({ 7 8 }) up ({ }) every ({ 1 }) six ({ 2 }) months ({ 3 }) . ({ 9 })
# Sentence pair (4) source length 4 target length 4 alignment score : 1.89822e-07
नियमित आँखें धोना ।
NULL ({ }) Washing ({ 3 }) eyes ({ 2 }) regularly ({ 1 }) . ({ 4 })
.
.
.

These alignments can be further improved using the grow-diag-final-and heuristic. Note that the numerals in curly braces are the indices of Hindi words aligned to English words. From this alignment file the one-to-one alignments are considered as the potential transliteration equivalents. To further augment the transliteration pairs I also added word pairs from Hindi-wordnet synset mappings.

Removal of Noisy Pairs

Obviously there will be a lot of word pairs which are not transliteration equivalents rather are translation pairs or noisy alignments. To filter out the noisy word pairs I use the Levenshtein edit distance measure. In order to compute the edit distances we need a character mapping table between the two scripts. A sample mapping table between Hindi and Roman scripts is shown below:

Hindi-Roman Mapping Table
अ	a,e	ल	l	त	t,th	झ	jh
ब	b	म	m	द	d,dh	ख	kh
च	c,ch	न	n	य	y,i,e	ण	n
ड	d,dh	ओ	o,u	आ	a,aa	औ	o,u
ए	e,i,y,a	प	p	भ	bh	फ	ph,f
ङ	n	ऋ	r,ri,ru	छ	ch	ष	sh
ग	g,gh	र	r	ढ	dh	श	sh
ह	h,gh	स	s,c	ऐ	e,i,y	ठ	th
इ	e,i,y	ट	t,th	ञ	n	ऊ	o,u
ज	j,jh,z,g	उ	o,u	घ	gh	थ	th
क	k,q,c	व	v,w	ई	e,i,y	ध	dh

A straightforward python implementation for a function LevenshteinDistance that takes two strings, src of length m, and trg of length n, and a mapping table map_table, and returns the Levenshtein distance between the two strings is given below:

import numpy as np
def LevenshteinDistance(src, trg, map_table):
  m = len(src) + 1
  n = len(trg) + 1
  d = np.zeros((m, n), np.uint16)
  # source prefixes can be transformed into empty string by
  # dropping all characters
  d[:, 0] = range(m)
  # target prefixes can be reached from empty source prefix
  # by inserting every character
  d[0] = range(n)
  for j in range(1, n):
    for i in range(1, m):
      if trg[j-1] in map_table[src[i-1]]:
        d[i, j] = d[i-1, j-1]
      else:
        d[i, j] = min(d[i-1, j] + 1,    #// deletion
                      d[i, j-1] + 1,    #// insertion
                      d[i-1, j-1] + 1)  #// substitution
  return d[m-1, n-1]

Note that map_table is a python dictionary with Hindi letters as keys and their Roman mappings as values. The edit-distance returned is bounded between 0 to the length of largest string (source and target). These edit-distance scores are normalized between 0 to 1 simply by dividing the returned score with the length of the largest string. An edit-distance score of 1 between two strings with max-string length of 2 has a normalized edit-distance of 0.5, whereas the same score between two strings with max-string length of 10 has a normalized edit distance of 0.1. Translation pairs with a normalized score of less than a small threshold of ~0.3 are considered as transliteration pairs. Note that the threshold is chosen by a rough observation of translation pairs and can vary for different language pairs.

Character alignments with GIZA++

In order to train the transliteration system, we need to character align the transliteration pairs obtained above. We again use GIZA++ for this task. Let Hindi strings be the source and their Roman transliterations be the target for GIZA++ alignments. The resulting character alignments obtained using GIZA++ will look like:

# Sentence pair (1) source length 6 target length 10 alignment score : 3.91859e-13
m i s s i o n a r y
NULL ({ 5 6 8 }) म ({ 1 })  ि ({ 2 }) श ({ 3 4 }) न ({ 7 }) र ({ 9 }) ी ({ 10 })
# Sentence pair (2) source length 8 target length 6 alignment score : 1.5955e-05
u n i t e d
NULL ({ }) य ({ }) ू ({ 1 }) न ({ 2 }) ा ({ }) इ ({ 3 }) ट ({ 4 }) े ({ 5 }) ड ({ 6 })
# Sentence pair (3) source length 6 target length 8 alignment score : 9.23754e-05
a i sh w a r y a
NULL ({ 1 5 }) ऐ ({ 2 }) श ({ 3 }) व् ({ 4 }) र ({ 6 }) य ({ 7 }) ा ({ 8 })
# Sentence pair (4) source length 6 target length 8 alignment score : 0.0111937
n a r e n d r a
NULL ({ 2 8 }) न ({ 1 }) र ({ 3 }) े ({ 4 }) ं ({ 5 }) द ({ 6 }) र् ({ 7 })
.
.
.

Giza++ produces four types of alignments: 1-to-1, 1-to-Many, 1-to-None and None-to-1. Out of these four letter alignments, None-to-1 alignments need to be handled. These alignments are modified by merging the target character with the previous aligned pair if it is not the first character, otherwise it is merged with the succeeding aligned character pair. The final training data for Hindi-to-Roman transliteration system will look like:

म   m
ि   i
श   ssio
न   na
र   r
ी   y

य   _
ू   u
न   n
ा   _
इ   i
ट   t
े   e
ड   d

ऐ   ai
श   sh
व्   wa
र   r
य   y
ा   a

न   na
र   r
े   e
ं   n
द   d
र्   ra

.
.
.

In order to create training data for Roman-to-Hindi system, we have to pass Roman strings as source and their Hindi transliterations as target to the character alignment process.

Training the Transliteration System

Now that we have the training data, we can start the final training process. We can apply any ML algorithm to train the system. I model transliteration as a structure prediction problem with global feature representation. My transliteration model is basically a second order Hidden Markov Model (SHMM) formally represented in Equation below. I denote the sequence of letters in a word in source script as boldface s and the sequence of hidden states which correspond to letter sequences in the target script as boldface t. A basic HMM model has the following parameters:

P(\textbf{s};\textbf{t}) = \underset{t_{1}...t_{n}}{\arg\max}
{\overset{n}{\underset{i=1}{\prod}}}
\underbrace{P(t_{i}|t_{i-1},t_{i-2}) \rule[-5pt]{0pt}{1pt}}_{\mbox{\tiny Transition Probabilities}}
\underbrace{P(s_{i}|t_{i}) \rule[-5pt]{0pt}{1pt}}_{\mbox{\tiny Emission Probabilities}}

\text{where }{$s_{i}...s_{n}$} \text{ is a letter sequence in the source script, and } \\ {$t_{i}...t_{n}$} \text{ is the corresponding letter sequence in the target script.}

I implemented structured perceptron of Collins to learn the model parameters. Given an input training data of aligned character sequences D = d_1 \cdots d_n, a vector feature function \vec{f}(d), and an initial weight vector \vec{w}, the algorithm performs two steps for each training example d_i\in D:

\text{\textbf{Decode: }}\hat{t} = \underset{t_{1} \cdots t_{n}}{\arg\max} (\vec{w} \cdot \vec{f}(d))

\text{\textbf{Update: }}\vec{w} = \vec{w} + \vec{f}(d) - \vec{f}(\hat{t})

With structured perceptron local contextual features can be made relevant to the whole sequence of target letters. It also allows us to use feature based emissions. I replace the basic multinomial emissions P(s_{i}|t_{i}) with the feature-based emissions \vec{w} \cdot \vec{f}(d). The feature template used to learn the emissions is shown in Table below. I use Viterbi-search to decode the best letter sequence in the target script while learning the parameters as well as at the time of testing.

Ngram	Features
Unigrams	[l_i-4], [l_i-3], [l_i-2], [l_i-1], [l_i], [l_i+1], [l_i+2], [l_i+3], [l_i+4]
Bigrams	[l_i-4 l_i-3], [l_i-3 l_i-2], [l_i-2 l_i-1], [l_i-1 l_i], [l_i l_i+1], [l_i+1 l_i+2], [l_i+2 l_i+3], [l_i+3 l_i+4]
Trigrams	[l_i-4 l_i-3 l_i-2], [l_i-3 l_i-2 l_i-1], [l_i-2 l_i-1 l_i], [l_i l_i+1 l_i+2], [l_i+1 l_i+2 l_i+3], [l_i+2 l_i+3 l_i+4]
Tetragrams	[l_i-4 l_i-3 l_i-2 l_i-1], [l_i-3 l_i-2 l_i-1 l_i], [l_i-2 l_i-1 l_i l_i+1], [l_i-1 l_i l_i+1 l_i+2], [l_i l_i+1 l_i+2 l_i+3], [l_i+1 l_i+2 l_i+3 l_i+4]

Note that l_i-n and l_i+n represent the nth preceding and succeeding letters of a given letter l_i.

Till Mid-term Evaluation

My original proposal till midterm evaluation was to develop Indic-to-Roman and Roman-to-Indic transliteration systems for all scheduled Indian languages. The major challenge in the process was to extract and clean the training data for these systems. The extracted data for each model is split using 80-10-10 rule where we use 80% of the data for training the model, 10% of the data for tuning the model parameters and rest 10% of the data for testing the system accuracy. Gladly I completed my work on time and the final systems are ready to use. The work summary till midterm evaluation is given in the below table. Note that the training time is measured on an Intel Core i5 1.2GHz CPU with 8GB RAM and the training-time does not include data extraction or data loading time. Also note that Hindi, Nepali, Marathi, Konkani and Bodo all have Devanagari script and thus use Devanagari-Roman model and Bengali and Assamese both use Bengali models. Devanagari-Roman represent both Devanagari-to-Roman and Roman-to-Devanagari systems and both systems take approximately equal training time. Similar goes for rest of the models.

Model	Number of transliteration pairs extracted	Training time (hours)
Devanagari-Roman	87,520	(7-9)*2
Bengali-Roman	44,939	(4-6)*2
Gujarati-Roman	48,429	(4-6)*2
Punjabi-Roman	33,841	(4-5)*2
Malayalam-Roman	28,555	(3-5)*2
Kannada-Roman	37,877	(4-5)*2
Tamil-Roman	36,478	(4-5)*2
Telugu-Roman	34,717	(4-5)*2
Oriya-Roman	29,542	(3-5)*2

How to use

from indictrans import Transliterator
trn = Transliterator(source='hin', target='eng')

hin = """कांग्रेस पार्टी अध्यक्ष सोनिया गांधी, तमिलनाडु की मुख्यमंत्री जयललिता और रिज़र्व बैंक के गवर्नर रघुराम राजन के बीच एक समानता है.
ये सभी अलग-अलग कारणों से भारतीय जनता पार्टी के राज्यसभा सांसद सुब्रमण्यम स्वामी के निशाने पर हैं.
उनके जयललिता और सोनिया गांधी के पीछे पड़ने का कारण कथित भ्रष्टाचार है."""

eng = trn.transform(hin)
print(eng)
congress party adhyaksh sonia gandhi, tamilnadu kii mukhyamantri jayalalita our reserve baink ke governor raghuram rajan ke beech ek samanta hai.
ye sabi alag-alag carnon se bharatiya janata party ke rajyasabha saansad subramanyam swami ke nishane par hain.
unke jayalalita our sonia gandhi ke peeche padane ka kaaran kathith bhrashtachar hai.

trn = Transliterator(source='eng', target='hin')
hin_ = trn.transform(eng)
print(hin_)
कांग्रेस पार्टी अध्यक्ष सोनिया गांधी, तमिलनाडु की मुख्यमांत्री जयललिता और रिज़र्व बैंक के गवर्नर रघुराम राजन के बीच एक समानता है.
ये सभी अलग-अलग कार्नों से भारतीय जनता पार्टी के राज्यसभा स्Mसद सुब्रमण्यम स्वामी के निशाने पर हैं.
उनके जयललिता और सोनिया गांधी के पीछे पड़ने का कारण कथित भ्रष्टाचार है.