Leveraging monolingual Hindi-English parallel corpus for translating Hinglish

Hinglish, a hybrid language combining elements of Hindi and English, is considered one of the most commonly spoken code-switched languages. Code-switching, also known as code-mixing, involves the simultaneous usage of grammatical units derived from two or more languages within a single speech utterance. This linguistic phenomenon occurs when speakers incorporate grammatical structures from multiple languages into their discourse. However, the presence of code-switching poses a challenge for Natural Language Processing (NLP) algorithms, which are typically not tailored for handling code-switched data. Consequently, the performance of these algorithms may be compromised when applied to code-switched data. To enhance their effectiveness, additional processing steps such as language identification, normalization, and back-transliteration may be required.

In this study, we propose a neural machine translation (NMT) model capable of direct translation of Hinglish text, eliminating the need for any pre-processing steps. To accomplish this, we exploit monolingual Hindi-English parallel data and employ a transliteration model to convert the Hindi texts from our Hindi-English training data into a Hinglish-English translation dataset. Subsequently, we train a transformer-based translation model on the synthesized Hinglish data.