Leveraging monolingual Hindi-English parallel corpus for translating Hinglish
Hinglish, a hybrid language combining elements of Hindi and English, is considered one of the most commonly spoken code-switched languages. Code-switching, also known as code-mixing, involves the simultaneous usage of grammatical units derived from two or more languages within a single speech utterance. This linguistic phenomenon occurs when speakers incorporate grammatical structures from multiple languages into their discourse. However, the presence of code-switching poses a challenge for Natural Language Processing (NLP) algorithms, which are typically not tailored for handling code-switched data. Consequently, the performance of these algorithms may be compromised when applied to code-switched data. To enhance their effectiveness, additional processing steps such as language identification, normalization, and back-transliteration may be required.
In this study, we propose a neural machine translation (NMT) model capable of direct translation of Hinglish text, eliminating the need for any pre-processing steps. To accomplish this, we exploit monolingual Hindi-English parallel data and employ a transliteration model to convert the Hindi texts from our Hindi-English training data into a Hinglish-English translation dataset. Subsequently, we train a transformer-based translation model on the synthesized Hinglish data.
Challenges
-
Romanized Hindi:
In Hinglish form, Hindi commonly adopts the Roman script as opposed to its native Devanagari script. The majority of Hindi-English parallel corpora exist solely in the original scripts, with a noticeable paucity of Hinglish-English parallel corpora accessible for training purposes. An approach to address this limitation involves the generation of a Hinglish-English parallel corpus through the romanization of Hindi text within the parallel corpora. Thus, necessitating the development of a transliteration model to convert Hindi words into their romanized form.
-
Spelling Variations:
The Roman script used for Hindi does not have a standardized spelling, resulting in multiple variations based on user preferences. For instance, the word “मैं” can be written as “main”, “mein”, “mien”, “men”, “me”, and so on. To account for these variations during inference, the model must be exposed to them during training. Therefore, in addition to transliterating Hindi data into the Roman script, it is necessary to generate multiple variations for each word. However, this approach also gives rise to the problem of vocabulary explosion, as the size of the vocabulary will increase by approximately three to five times.
-
Code Mixing:
One of the most daunting challenges is accurately translating code-mixed texts. This task becomes even more difficult when working with monolingual Hindi-English data, as the amount of code-mixed patterns is likely to be minimal. Consequently, teaching a model to correctly translate these texts becomes a significant challenge. To address this issue, it is necessary to incorporate code-mixed Hinglish data into the training process. However, due to the large amount of training data required by translation models (typically in the order of millions), creating this data manually is not a practical solution. To circumvent this limitation, one possible approach is to generate code-mixed data synthetically by combining words and phrases from parallel texts, leveraging various linguistic insights.
Creating Hinglish Data
Previous research has indicated a scarcity in the development of Hinglish-English parallel data or its extraction from the WEB. However, a substantial amount of Hindi-English parallel data is accessible through resources such as CCMatrix, CCAligned, and WikiMatrix. To address this issue, we leverage these existing Hindi-English parallel corpora to create a Hinglish-English dataset by utilizing a transliteration model.
Transliteration Model: We utilized a transformer-based sequence-to-sequence transliteration model that was trained on a dataset of 87,520 Hindi-English transliteration pairs provided by Bhat et al. The training data and model architecture for the transliteration task closely follow those of a translation model, with the only distinction being the use of word pairs instead of sentence pairs for training the model, and the training pairs being sequences of characters rather than sequences of words or byte-pair tokens. To train our transliteration model, we employed a predefined architecture called “transformer_iwslt_de_en” from the fairseq library.
In our study, we utilize the transliteration model to convert the Hindi segment of our Hindi-English parallel corpora into romanized form. To accommodate the various possible spelling variations that exist in Hinglish, we employ beam decoding to derive the top 5 transliterations for each Hindi word. From this pool of transliterations, one is randomly selected to construct the Hinglish sentence. Specifically, for every Hindi sentence, we generate three Hinglish sentences. The first sentence solely incorporates the best transliteration for each word, while the second and third sentences randomly choose one transliteration from the top 5 possibilities for each word, thereby producing two additional Hinglish variations of the original Hindi sentence. An illustrative example is provided in the following table.
Hindi |
---|
दिल्ली से मुंबई का टिकट बुक करने में मेरी सहायता करें |
Hinglish |
dilli se mumbai ka ticket book karne main meri sahaita madad karen |
delhi se mumbay ka ticat buk karne men meri sahaayata karein |
deli se mumbi ka tecket bok karney mein meri sahaitaa kare |
In the current phase of the study, code-mixed phrases or patterns have not been incorporated into the training data. The objective is to investigate the model’s capability in dealing with code-mixing without the explicit inclusion of any code-mixed training data. The model solely relies on the information obtained from English borrowings within Hindi texts. Our hypothesis revolves around the notion that the model is expected to exhibit satisfactory performance when handling straightforward romanized Hindi texts, while demonstrating suboptimal performance when confronted with intricate code-mixed structures.
Model
Once the training data is obtained, a transformer-based machine translation model is trained. Prior to training, subword-nmt is utilized to learn and implement byte-pair encoding on the training data. A vocabulary size of 30,000 is set for the encoding process, ensuring a joint vocabulary is acquired for both Hinglish and English languages. This joint vocabulary aids the model in comprehending the use of English terms and expressions within a Hindi context. Similar to the transliteration model, the translation model is trained using the pre-existing “transformer_iwslt_de_en” provided by fairseq.
Translation Examples
Romanized Hindi
The following table presents a collection of previously unseen romanized Hindi examples along with their translations generated by the trained model. As anticipated, the model performs effectively on straightforward romanized texts that do not involve code-mixed patterns. However, there is a substantial presence of borrowed words within these samples.
Input: meri flight cancel ho gayi Translation: my flight was cancelled . |
Input: delhi jane wali sabse pehli flight kon si hai Translation: which is the first flight to delhi ? |
Input: muje 500 rupee transfer karne hai Translation: i have to transfer 500 rupees . |
Input: mujhy 500 rupye transfar karney hain Translation: i have to transfer rs 500 |
Input: delhi se mumbai flight book karne main mujhe guide karen Translation: guide me to book mumbai flight from delhi |
Input: deli se mumbi flight book karne men madad karen Translation: help to book a mumbai flight from deli |
Input: best offers ke liye hamari site pe account banaye Translation: create our site account for the best offers |
Input: best offers pane ke liye hamari site pe register karen Translation: register our site to get the best offers |
Input: otp nahi aa raha hai Translation: otp is not coming . |
Input: mera tv on nahi ho raha hai Translation: my tv is n't getting on . |
Code-Mixing
The following examples present instances of code-mixed text alongside their translations generated from the trained model. It is evident that the current model exhibits reduced accuracy when processing these texts in comparison to straightforward Romanized Hindi, however, the outcomes surpass our initial expectations.
Input: pehli baar linkedin pe job offering ka message aaya hai Translation: this is the first time linkedin pay job offering message has come . |
Input: first time linkedin pe job offering ka message aaya hai Translation: the first time linkedin pay job offer message has come . |
Input: warm up match kon si site pe live aa raha hai Translation: warm up match which site is coming live ? |
Input: i thought mosam different hoga bas fog hai Translation: i think the weather will be different . |
Input: thand bhi odd even formula follow kar rahi hai Translation: the cold is also following formula . |
Input: tum kitne fake account banao ge Translation: how many fake accounts are you created ? |
Input: enjoying dilli ki sardi after a long time Translation: enjoying delhi winter after a long time |
Input: our life revolves around log kya kahenge Translation: what life revolves around people |
Input: some people are double standards ki dukaan Translation: some people are double standards shops |
Input: sunday is the weekly ghar ka saaf safai day Translation: sunday is the weekly home . |