Transliteration: Background & Approaches
Transliteration simply means conversion of a text from one script to another. Machine transliteration is the computer automated process of transcribing a character or word from one language script to another. Machine transliteration can play an important role in natural language application such as information retrieval and machine translation, especially for handling proper nouns and technical terms, cross-language applications, data mining and information retrieval system.
A thorough description of the project goals can be seen in the original proposal submitted to GSoC 2016. Here I will briefly list down our contribution towards machine transliteration module of Libindic.
- Indic to Indic Script Transliteration
- Indic to Roman Transliteration
A rule-based transliteration system uses character mappings defined between two scripts. There are five types of character mappings possible between two natural language scripts: One-to-One, One-to-Many, Many-to-One, One-to-None and None-to-One. A simple One-to-One character mapping would lead to a more accurate and easy to develop system. However, natural languages are inherently ambiguous. There are many cases of ambiguity in orthographic representation of languages world over. Distribution of different types of character mappings is not generally uniform, though. The distribution is usually skewed; unambiguous characters occur more often then their ambiguous counterparts. Nonetheless, to resolve the ambiguous mappings, heuristics are designed which take the surrounding context of an ambiguous letter into consideration.
- Rule-based transliteration systems between Indic scripts is essentially trivial. Indic scripts have a special property that their phonemes are one-to-one aligned between their Unicode tables. It essentially means that the transliteration between any two Indic scripts can be achieved by merely using their unicode tables.
- Rule-based systems do not require any kind of training data to develop the system.
- Requires domain expertise: To develop a rule-based system between two natural language scripts a domain expert is required.
- Missing phonemes in Indic scripts: The major concern with rule-based system for Indic-to-Indic transliteration is the missing phonemes in Indic scripts. For example, in Tamil there are no characters for d, dh, b, bh etc. d is pronounced as t, dh as th, b and bh are pronounced as p. Similarly, in Bengali there is no character for v, it is pronounced as b.
- Ambiguous Character Mappings: Another major issue with rule-based system for is ambiguous character mappings between two natural language scripts. For example, Tamil has the following ambiguous characters:
Character Mapping k k, kh, g, gh, h c c, ch, j, jh, s t t, th, d, dh T T, Th, D, Dh p p, ph, b, bh
Arthur Samuel defined machine learning as a “Field of study that gives computers the ability to learn without being explicitly programmed”. ML is a method of teaching computers to make and improve predictions or behaviors based on some training data. In case of transliteration, the training data are the transliteration pairs between the language scripts to be transliterated.
- A major benefit with transliteration system is that one is not required to be a domain expert of all the languages.
- ML systems are compact, more generic and less hectic to develop than the rule-based ones.
- A strong list of transliteration pairs is required to train the model. However, such lists are not readily available and are expensive to create manually.
- Selection of proper ML technique, data representation for training the models are some minor challenges.