Leveraging monolingual Hindi-English parallel corpus for translating Hinglish

Hinglish, a hybrid language combining elements of Hindi and English, is considered one of the most commonly spoken code-switched languages. Code-switching, also known as code-mixing, involves the simultaneous usage of grammatical units derived from two or more languages within a single speech utterance. This linguistic phenomenon occurs when speakers incorporate grammatical structures from multiple languages into their discourse. However, the presence of code-switching poses a challenge for Natural Language Processing (NLP) algorithms, which are typically not tailored for handling code-switched data. Consequently, the performance of these algorithms may be compromised when applied to code-switched data. To enhance their effectiveness, additional processing steps such as language identification, normalization, and back-transliteration may be required.

In this study, we propose a neural machine translation (NMT) model capable of direct translation of Hinglish text, eliminating the need for any pre-processing steps. To accomplish this, we exploit monolingual Hindi-English parallel data and employ a transliteration model to convert the Hindi texts from our Hindi-English training data into a Hinglish-English translation dataset. Subsequently, we train a transformer-based translation model on the synthesized Hinglish data.


Challenges

  • Romanized Hindi:

    In Hinglish form, Hindi commonly adopts the Roman script as opposed to its native Devanagari script. The majority of Hindi-English parallel corpora exist solely in the original scripts, with a noticeable paucity of Hinglish-English parallel corpora accessible for training purposes. An approach to address this limitation involves the generation of a Hinglish-English parallel corpus through the romanization of Hindi text within the parallel corpora. Thus, necessitating the development of a transliteration model to convert Hindi words into their romanized form.

  • Spelling Variations:

    The Roman script used for Hindi does not have a standardized spelling, resulting in multiple variations based on user preferences. For instance, the word “मैं” can be written as “main”, “mein”, “mien”, “men”, “me”, and so on. To account for these variations during inference, the model must be exposed to them during training. Therefore, in addition to transliterating Hindi data into the Roman script, it is necessary to generate multiple variations for each word. However, this approach also gives rise to the problem of vocabulary explosion, as the size of the vocabulary will increase by approximately three to five times.

  • Code Mixing:

    One of the most daunting challenges is accurately translating code-mixed texts. This task becomes even more difficult when working with monolingual Hindi-English data, as the amount of code-mixed patterns is likely to be minimal. Consequently, teaching a model to correctly translate these texts becomes a significant challenge. To address this issue, it is necessary to incorporate code-mixed Hinglish data into the training process. However, due to the large amount of training data required by translation models (typically in the order of millions), creating this data manually is not a practical solution. To circumvent this limitation, one possible approach is to generate code-mixed data synthetically by combining words and phrases from parallel texts, leveraging various linguistic insights.


Creating Hinglish Data

Previous research has indicated a scarcity in the development of Hinglish-English parallel data or its extraction from the WEB. However, a substantial amount of Hindi-English parallel data is accessible through resources such as CCMatrix, CCAligned, and WikiMatrix. To address this issue, we leverage these existing Hindi-English parallel corpora to create a Hinglish-English dataset by utilizing a transliteration model.

Transliteration Model: We utilized a transformer-based sequence-to-sequence transliteration model that was trained on a dataset of 87,520 Hindi-English transliteration pairs provided by Bhat et al. The training data and model architecture for the transliteration task closely follow those of a translation model, with the only distinction being the use of word pairs instead of sentence pairs for training the model, and the training pairs being sequences of characters rather than sequences of words or byte-pair tokens. To train our transliteration model, we employed a predefined architecture called “transformer_iwslt_de_en” from the fairseq library.

In our study, we utilize the transliteration model to convert the Hindi segment of our Hindi-English parallel corpora into romanized form. To accommodate the various possible spelling variations that exist in Hinglish, we employ beam decoding to derive the top 5 transliterations for each Hindi word. From this pool of transliterations, one is randomly selected to construct the Hinglish sentence. Specifically, for every Hindi sentence, we generate three Hinglish sentences. The first sentence solely incorporates the best transliteration for each word, while the second and third sentences randomly choose one transliteration from the top 5 possibilities for each word, thereby producing two additional Hinglish variations of the original Hindi sentence. An illustrative example is provided in the following table.

Hindi
दिल्ली से मुंबई का टिकट बुक करने में मेरी सहायता करें
Hinglish
dilli se mumbai ka ticket book karne main meri sahaita madad karen
delhi se mumbay ka ticat buk karne men meri sahaayata karein
deli se mumbi ka tecket bok karney mein meri sahaitaa kare


In the current phase of the study, code-mixed phrases or patterns have not been incorporated into the training data. The objective is to investigate the model’s capability in dealing with code-mixing without the explicit inclusion of any code-mixed training data. The model solely relies on the information obtained from English borrowings within Hindi texts. Our hypothesis revolves around the notion that the model is expected to exhibit satisfactory performance when handling straightforward romanized Hindi texts, while demonstrating suboptimal performance when confronted with intricate code-mixed structures.


Model

Once the training data is obtained, a transformer-based machine translation model is trained. Prior to training, subword-nmt is utilized to learn and implement byte-pair encoding on the training data. A vocabulary size of 30,000 is set for the encoding process, ensuring a joint vocabulary is acquired for both Hinglish and English languages. This joint vocabulary aids the model in comprehending the use of English terms and expressions within a Hindi context. Similar to the transliteration model, the translation model is trained using the pre-existing “transformer_iwslt_de_en” provided by fairseq.


Translation Examples

Romanized Hindi

The following table presents a collection of previously unseen romanized Hindi examples along with their translations generated by the trained model. As anticipated, the model performs effectively on straightforward romanized texts that do not involve code-mixed patterns. However, there is a substantial presence of borrowed words within these samples.

Input: meri flight cancel ho gayi
Translation: my flight was cancelled .
Input: delhi jane wali sabse pehli flight kon si hai
Translation: which is the first flight to delhi ?
Input: muje 500 rupee transfer karne hai
Translation: i have to transfer 500 rupees .
Input: mujhy 500 rupye transfar karney hain
Translation: i have to transfer rs 500
Input: delhi se mumbai flight book karne main mujhe guide karen
Translation: guide me to book mumbai flight from delhi
Input: deli se mumbi flight book karne men madad karen
Translation: help to book a mumbai flight from deli
Input: best offers ke liye hamari site pe account banaye
Translation: create our site account for the best offers
Input: best offers pane ke liye hamari site pe register karen
Translation: register our site to get the best offers
Input: otp nahi aa raha hai
Translation: otp is not coming .
Input: mera tv on nahi ho raha hai
Translation: my tv is n't getting on .


Code-Mixing

The following examples present instances of code-mixed text alongside their translations generated from the trained model. It is evident that the current model exhibits reduced accuracy when processing these texts in comparison to straightforward Romanized Hindi, however, the outcomes surpass our initial expectations.

Input: pehli baar linkedin pe job offering ka message aaya hai
Translation: this is the first time linkedin pay job offering message has come .
Input: first time linkedin pe job offering ka message aaya hai
Translation: the first time linkedin pay job offer message has come .
Input: warm up match kon si site pe live aa raha hai
Translation: warm up match which site is coming live ?
Input: i thought mosam different hoga bas fog hai
Translation: i think the weather will be different .
Input: thand bhi odd even formula follow kar rahi hai
Translation: the cold is also following formula .
Input: tum kitne fake account banao ge
Translation: how many fake accounts are you created ?
Input: enjoying dilli ki sardi after a long time
Translation: enjoying delhi winter after a long time
Input: our life revolves around log kya kahenge
Translation: what life revolves around people
Input: some people are double standards ki dukaan
Translation: some people are double standards shops
Input: sunday is the weekly ghar ka saaf safai day
Translation: sunday is the weekly home .