Leveraging monolingual Hindi-English parallel corpus for translating Hinglish

Hinglish, a mix of Hindi and English, is one if the most spoken code-switched language. Code-switching or code-mixing is the juxtaposition, within the same speech utterance, of grammatical units such as words, phrases, and clauses belonging to two or more different languages. Code-switching occurs when speakers use grammatical structures from multiple languages in their speech. This can be challenging for NLP algorithms, as they are typically not designed for code-switched data. As a result, the performance of these algorithms may be degraded when applied to code-switched data. To improve their performance, additional processing steps such as language identification, normalization, and back-transliteration may be necessary.

Here, we present a neural machine translation (NMT) model that can translate Hinglish text directly, without requiring any pre-processing steps. We achieve this by leveraging the monolingual Hindi-English parallel data. We use a transliteration model to romanize the Hindi texts of our Hindi-English training data to generate the Hinglish-English translation data. We train a transformer-based translation model on the generated Hinglish data.


Challenges

  • Romanized Hindi:

    Hindi is often written in Roman script when used in Hinglish rather than its original Devanagari script. Most of the Hindi-English parallel corpora are only available in the original scripts. Even to this date there is hardly any Hinglish-English parallel corpus available for training. One way to overcome this issue is to generate Hinglish-English parallel corpus by romanize the Hindi texts of the parallel corpora. This leads to the requirement of a transliteration model to convert the Hindi words to their romanized form.

  • Spelling Variations:

    Since Roman is not the original script for Hindi, the romanized Hindi words can take many spelling variations based on the user preferences. For example, the word “मैं” can be written as “main”, “mein”, “mien”, “men”, “me” etc. In order to handle all these variations at inference, the model should see them at training time. Thus, on top of romanizing Hindi data, we also need to create multiple variations for each word. This also leads to the issue of exploding vocabulary as the vocabulary size will increase by 3 to 5 times.

  • Code Mixing:

    Another challenge and the most disfficult one to handle is translating code-mixed texts accurately. Since the code-mixed patterns in monolingual Hindi-English data might be next to nothing, it would be very difficult for the model to learn to translate code-mixed texts. In order to translate these texts correctly, we have to add some code-mixed Hinglish data. Since translation models need huge amounts of training data (of the order of million), it does not seem practical to create this data manually. One way around is to synthetically generetae code-mixed data by mixing words and phrases bewteen parallel texts using various linguising insights.


Creating Hinglish Data

As mentioned above, there is hardly any work done in creating Hinglish-English parallel data or mining it from the WEB. But Hindi-English parallel data is available in big numbers especially from resources like CCMatrix, CCAligned and WikiMatrix. We use these Hindi-English parallel corpora to generate the Hinglish-English dataset using a transliteration model.

Transliteration Model: We use a transformer based sequence-to-sequence transliteration model trained on 87,520 Hindi-English transliteration pairs provided by Bhat et al.. The training data and model architecture for the transliteration task is structured exacltly same as in a transltion model. The only difference is in training data where we use word pairs to train the model instead of sentence pairs and the training pairs are sequneces of characters rather than sequence of words or byte-pair tokens. We used a predefined architecture “transformer_iwslt_de_en” from fairseq for training our transliteration model.

Using the transliteration model, we romanize the Hindi part of our Hindi-English parallel corpora. In order to capture the multiple spelling variations in Hinglish, we generate top 5 transliterations for each Hindi word using beam decoding and randomly pick one of them to form the Hinglish sentence. More specifically, for each Hindi sentence, we generate three Hinglish sentences. In the first sentence we only use the first best transliteration of each word to form the Hinglish sentence. For the 2nd and 3rd sentence, we randomly pick one of the top 5 transliterations of each word to generate two more Hinglish variants of the Hindi sentence. An example is shown in the below table.

Hindi
दिल्ली से मुंबई का टिकट बुक करने में मेरी सहायता करें
Hinglish
dilli se mumbai ka ticket book karne main meri sahaita madad karen
delhi se mumbay ka ticat buk karne men meri sahaayata karein
deli se mumbi ka tecket bok karney mein meri sahaitaa kare


At this stage, we have not added any code-mixed phrases/patterns in the training data. The idea is to explore how much of code-mixing does the model handle without explicitly adding any code-mixed training data. The only insights that the model has is the English borrowings in Hindi texts. We hypothesize that the model will work well on simple romanized Hindi texts but will perform poorly on complex code-mixed structures.


Model

Now that we have the training date, we train a transformer-based MT model. Before training the model, we use subword-nmt to learn and apply btpe-pair encoding on the training data. We set a vocabulary size of 30,000 and learn a joint vocabulary for Hinglish and English. The joint vocabulary will allow the model to better understand the usage of English words and phrases in Hindi context. Like the transliteration model, we train the translation model with the predefined “transformer_iwslt_de_en” from fairseq.


Translation Examples

Romanized Hindi

Given below are some unseen romanized Hindi samples and their translations from the trained model. As expected the model seems to be working well on simple romanized texts with no code-mixed patterns, though there is a lot of borrowing words in them.

Input: meri flight cancel ho gayi
Translation: my flight was cancelled .
Input: delhi jane wali sabse pehli flight kon si hai
Translation: which is the first flight to delhi ?
Input: muje 500 rupee transfer karne hai
Translation: i have to transfer 500 rupees .
Input: mujhy 500 rupye transfar karney hain
Translation: i have to transfer rs 500
Input: delhi se mumbai flight book karne main mujhe guide karen
Translation: guide me to book mumbai flight from delhi
Input: deli se mumbi flight book karne men madad karen
Translation: help to book a mumbai flight from deli
Input: best offers ke liye hamari site pe account banaye
Translation: create our site account for the best offers
Input: best offers pane ke liye hamari site pe register karen
Translation: register our site to get the best offers
Input: otp nahi aa raha hai
Translation: otp is not coming .
Input: mera tv on nahi ho raha hai
Translation: my tv is n't getting on .


Code-Mixing

Below are some code-mixed samples and their translations. We can see that the model is less accurate on these texts compared to simple romanized Hindi, but the results seem to be much better than what we expected.

Input: pehli baar linkedin pe job offering ka message aaya hai
Translation: this is the first time linkedin pay job offering message has come .
Input: first time linkedin pe job offering ka message aaya hai
Translation: the first time linkedin pay job offer message has come .
Input: warm up match kon si site pe live aa raha hai
Translation: warm up match which site is coming live ?
Input: i thought mosam different hoga bas fog hai
Translation: i think the weather will be different .
Input: thand bhi odd even formula follow kar rahi hai
Translation: the cold is also following formula .
Input: tum kitne fake account banao ge
Translation: how many fake accounts are you created ?
Input: enjoying dilli ki sardi after a long time
Translation: enjoying delhi winter after a long time
Input: our life revolves around log kya kahenge
Translation: what life revolves around people
Input: some people are double standards ki dukaan
Translation: some people are double standards shops
Input: sunday is the weekly ghar ka saaf safai day
Translation: sunday is the weekly home .