K-Best Transliterations

No matter how amazing the ML approach used to model a problem, it can never guarantee 100% results or even close. The accuracy of the model mostly depends on th number of training samples, the material impact of training samples on the variable being predicted and ML algorithm selected to train the model. We can only approach the best possible results by selecting a proper ML technique which itself depends on the problem definition. Machine transliteration has many applications like information retrieval, machine translation (handling named-entities), cross-language applications, information retrieval systems, input tools etc. Some of these applications can benefit if we can generate k-best transliterations for an input string rather than a single one. For example, in machine translation if we generate some k transliterations for a named-entity, we can select the best transliteration for the word using any decoding technique like viterbi, A* search etc. In case we use transliteration for input tools, it would highly benefit the user if we provide some k-best transliterations, so the user can pick one that suits the best.

Viterbi Decoding

So far the machine transliteration system returns a single best output by applying viterbi decoding (exact search) on the learned model parameters (emission and transition matrices). The Viterbi algorithm requires knowledge of the parameters of the HMM model and given an observation sequence (source string in this case), it finds the state sequence (target string) that is most likely to have generated that observation sequence. It works by finding a maximum over all possible state sequences using the following two steps:

  • Forward step, calculate path to the best state given all the previously observed states and observations.
  • Backward step, reproduce the path.

Can we use Viterbi search for selecting k-best output sequences? The answer is no. The reason lies in the forward step of the algorithm, viterbi does not keep track of the 2nd best, 3rd best or in general nth best state given previously observed states and observations. So we can not reproduce the n-best paths.

Beam search decoding

Even though beam search is most often used to maintain tractability in large systems with insufficient amount of memory to store the entire search tree, it can also be used to select the k-best output sequences given an HMM model and the observation sequence. Beam search is exactly like viterbi search except that it keeps track of only the k-best hypotheses (k is the beam size) at each step and drop the remaining low scoring hypotheses. This allows us to reproduce all these k-best paths.

from indictrans import transliterator
r2i = transliterator(
                    decode='beam search',

['इंडिक्ट्रांस', 'इंडिक्ट्राँस', 'इंडिक्ट्रास', 'इंडिक्ट्रान्स', 'इण्डिक्ट्रांस']

['लिबिंदिक', 'लीबिंदिक', 'लिबिन्दिक', 'लिबिंदिय', 'लिबिंदिक्']

['दिल्चस्प', 'दिलचस्प', 'दिल्चेस्प', 'दिलचेस्प', 'दिल्चसप']

['श्रुति', 'श्रुती', 'शरुति', 'श्युति', 'श्रुथि']

Effect of Beam Size

In many problems, particularly with generative models, training is done exactly, but decoding is done using an inexact search. For k-best transliterations I followed the same approach. But it has been shown that if learning and decoding are done in the same search framework, the accuracy of the results improve. So there is a possibility of improvement in system performance if we use beam search decoding at training time as well. The only issue with this approach is that we need to find an optimal beam width to train the system i.e., we need to cross validate the system by training it for multiple beam widths, which is quite time consuming.

Another possible improvement for k-best transliterations is that if we optimize for k-best outputs at training time, while in the current scenario we learn the model parameters by optimizing for the single best output whether decoding being exact or inexact.

Resolving ambiguities

K-best transliterations can also help in resolving ambiguities which is not possible in case of single best output. For example, the Hindi word “main” is ambiguous in Roman script. It can either mean “मैं (I)” or “में (in)” based on its context. The single best transliteration system will always transliterate this word to either of the two but if we generate k-best transliterations for the word, we can resolve the ambiguity by using the context of the word. One simplest approach is to train a language model on a raw Hindi corpus and perform sentence-level decoding on the n-best transliterations from the perceptron model.