Module References¶

`indictrans.transliterator` — Transliterator¶

class indictrans.Transliterator(source='hin', target='eng', decode='viterbi', build_lookup=False)[source]¶

Transliterator for Indic scripts including English and Urdu.

Parameters:

source : str, default: hin

Source Language (3 letter ISO-639 code)

target : str, default: eng

Target Language (3 letter ISO-639 code)

decode : str, default: viterbi

Decoding algorithm, either “viterbi” or “beamsearch”.

build_lookup : bool, default: False

Flag to build lookup-table. Fastens the transliteration process if the input text contains repeating words.

Examples

>>> from indictrans import Transliterator
>>> trn = Transliterator(source='hin', target='eng', build_lookup=True)
>>> hin = '''कांग्रेस पार्टी अध्यक्ष सोनिया गांधी, तमिलनाडु की मुख्यमंत्री
... जयललिता और रिज़र्व बैंक के गवर्नर रघुराम राजन के बीच एक
... समानता है. ये सभी अलग-अलग कारणों से भारतीय जनता पार्टी के
... राज्यसभा सांसद सुब्रमण्यम स्वामी के निशाने पर हैं. उनके
... जयललिता और सोनिया गांधी के पीछे पड़ने का कारण कथित
... भ्रष्टाचार है.'''
>>> eng = trn.transform(hin)
>>> print(eng)
congress party adhyaksh sonia gandhi, tamilnadu kii mukhyamantri
jayalalita our reserve baink ke governor raghuram rajan ke beech ek
samanta hai. ye sabi alag-alag carnon se bharatiya janata party ke
rajyasabha saansad subramanyam swami ke nishane par hain. unke
jayalalita our sonia gandhi ke peeche padane ka kaaran kathith
bhrashtachar hai.

Methods

convert

`indictrans.base.BaseTransliterator` — BaseTransliterator¶

class indictrans.base.BaseTransliterator(source, target, decoder, build_lookup=False)[source]¶

Base class for transliterator.

Attributes

vectorizer_	(instance) OneHotEncoder instance for converting categorical features to one-hot features.
classes_	(dict) Dictionary of set of tags with unique ids ({id: tag}).
coef_	(array) HMM coefficient array
intercept_init_	(array) HMM intercept array for first layer of trellis.
intercept_trans_	(array) HMM intercept/transition array for middle layers of trellis.
intercept_final_	(array) HMM intercept array for last layer of trellis.
wx_process	(method) wx2utf/utf2wx method of WX instance
nu	(instance) UrduNormalizer instance for normalizing Urdu scripts.

Methods

`base_fit`
`convert_to_wx`
`load_mappings`
`load_models`
`predict`
`top_n_trans`
`transliterate`

convert_to_wx(text)[source]¶: Converts Indic scripts to WX.

load_models()[source]¶: Loads transliteration models.

predict(word, k_best=5)[source]¶: Given encoded word matrix and HMM parameters, predicts output sequence (target word)

top_n_trans(text, k_best=5)[source]¶

Returns k-best transliterations using beamsearch decoding.

Parameters:

k_best : int, default: 5, optional

Used by Beamsearch decoder to return k-best transliterations.

transliterate(text, k_best=None)[source]¶: Single best transliteration using viterbi decoding.

`indictrans._utils.WX` — WXConverter¶

class indictrans._utils.WX(order=u'utf2wx', lang=u'hin')[source]¶

WX-converter for UTF to WX conversion of Indic scripts and vice-versa.

Parameters:

lang : str, default: hin

Input script

order : str, default: utf2wx

Order of conversion

Examples

>>> from wx import WX
>>> wxc = WX(lang='hin', order='utf2wx')
>>> hin_utf = u'''बीजेपी के सांसद सुब्रमण्यम स्वामी ने कुछ ही दिन पहले
... अपनी ही सरकार को कठघरे में खड़ा करते हुए जीडीपी आंकड़ों पर
... सवाल उठाए हैं.'''
>>> hin_wx = wxc.utf2wx(hin_utf)
>>> print(hin_wx)
bIjepI ke sAMsaxa subramaNyama svAmI ne kuCa hI xina pahale
apanI hI sarakAra ko kaTaGare meM KadZA karawe hue jIdIpI AMkadZoM para
savAla uTAe hEM.
>>> wxc = WX(lang='hin', order='wx2utf')
>>> hin_utf_ = wxc.wx2utf(hin_wx)
>>> print(hin_utf_)
बीजेपी के सांसद सुब्रमण्यम स्वामी ने कुछ ही दिन पहले
अपनी ही सरकार को कठघरे में खड़ा करते हुए जीडीपी आंकड़ों पर
सवाल उठाए हैं.
>>> wxc = WX(lang='mal', order='utf2wx')
>>> mal_utf = u'''വിപണിയിലെ ശുഭാപ്തിവിശ്വാസക്കാരായ കാളകള്‍ക്ക് അനുകൂലമായ
... രീതിയിലാണ് ബി എസ് ഇയില്‍ വ്യാപാരം നടക്കുന്നത്.'''
>>> mal_wx = wxc.utf2wx(mal_utf)
>>> print(mal_wx)
vipaNiyileV SuBApwiviSvAsakkArAya kAlYakalYkk anukUlamAya
rIwiyilAN bi eVs iyil vyApAraM natakkunnaw.
>>> wxc = WX(lang='mal', order='wx2utf')
>>> mal_utf_ = wxc.wx2utf(mal_wx)
>>> print(mal_utf_)
വിപണിയിലെ ശുഭാപ്തിവിശ്വാസക്കാരായ കാളകള്ക്ക് അനുകൂലമായ
രീതിയിലാണ് ബി എസ് ഇയില് വ്യാപാരം നടക്കുന്നത്.

Methods

`fit`
`initialize_utf2wx_hash`
`initialize_wx2utf_hash`
`iscii2unicode`
`iscii2unicode_ben`
`iscii2unicode_guj`
`iscii2unicode_hin`
`iscii2unicode_kan`
`iscii2unicode_mal`
`iscii2unicode_ori`
`iscii2unicode_pan`
`iscii2unicode_tam`
`iscii2unicode_tel`
`iscii2wx`
`map_EY`
`map_EY2`
`map_OY`
`map_OY2`
`map_Z`
`map_ZeV`
`map_ZoV`
`map_a`
`map_eV`
`map_eV2`
`map_lY`
`map_lYY`
`map_nY`
`map_oV`
`map_oV2`
`map_q`
`map_rY`
`normalize`
`unicode2iscii`
`unicode2iscii_ben`
`unicode2iscii_guj`
`unicode2iscii_hin`
`unicode2iscii_kan`
`unicode2iscii_mal`
`unicode2iscii_ori`
`unicode2iscii_pan`
`unicode2iscii_tam`
`unicode2iscii_tel`
`utf2wx`
`wx2iscii`
`wx2utf`

iscii2unicode(iscii)[source]¶: Convert ISCII to Unicode

iscii2wx(my_string)[source]¶: Convert ISCII to WX

normalize(text)[source]¶: Performs some common normalization, which includes: - Byte order mark, word joiner, etc. removal - ZERO_WIDTH_NON_JOINER and ZERO_WIDTH_JOINER removal

unicode2iscii(unicode_)[source]¶: Convert Unicode to ISCII

utf2wx(unicode_)[source]¶: Convert UTF string to WX-Roman

wx2iscii(my_string)[source]¶: Convert WX to ISCII

wx2utf(wx)[source]¶: Convert WX-Roman to UTF

`indictrans._utils.OneHotEncoder` — OneHotEncoder¶

class indictrans._utils.OneHotEncoder[source]¶

Transforms categorical features to continuous numeric features.

Examples

>>> from one_hot_encoder import OneHotEncoder
>>> enc = OneHotEncoder()
>>> sequences = [list('bat'), list('cat'), list('rat')]
>>> enc.fit(sequences)
<one_hot_encoder.OneHotEncoder instance at 0x7f346d71c200>
>>> enc.transform(sequences, sparse=False).astype(int)
array([[0, 1, 0, 1, 1],
       [1, 0, 0, 1, 1],
       [0, 0, 1, 1, 1]])
>>> enc.transform(list('cat'), sparse=False).astype(int)
array([[1, 0, 0, 1, 1]])
>>> enc.transform(list('bat'), sparse=True)
<1x5 sparse matrix of type '<type 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>

Methods

`fit`
`transform`

fit(X)[source]¶

Fit OneHotEncoder to X.

Parameters:

X : array-like, shape [n_samples, n_feature]

Input array of type int.

Returns:

self

transform(X, sparse=True)[source]¶

Transform X using one-hot encoding.

Parameters:

X : array-like, shape [n_samples, n_features]

Input array of categorical features.

sparse : bool, default: True

Return sparse matrix if set True else return an array.

Returns:

X_out : sparse matrix if sparse=True else a 2-d array, dtype=int

Transformed input.

`indictrans._utils.UrduNormalizer` — UrduNormalizer¶

class indictrans._utils.UrduNormalizer[source]¶

Normalizer for Urdu scripts. Normalizes different unicode canonical equivalances to a single unicode code-point.

Examples

>>> from script_normalizer import UrduNormalizer
>>> text = u'''ﺎﻧ کﻭ ﻍیﺮﻗﺎﻧﻮﻧی ﺝگہ کﺱ ﻥے ﺩی؟
... ﻝﻭگﻭں کﻭ ﻖﺘﻟ کیﺍ ﺝﺍﺭ ہﺍ ہے ۔
... ﺏڑے ﻡﺎﻣﻭں ﺎﻧ ﺪﻧﻭں ﻢﺤﻟہ ﺥﺩﺍﺩﺍﺩ ﻡیں ﺭہﺕے ﺕھے۔
... ﻉﻭﺎﻣی یﺍ ﻑﻼﺣی ﺥﺪﻣﺎﺗ ﺍیک ﺎﻟگ ﺩﺎﺋﺭہ ﻊﻤﻟ ہے۔'''
>>> nu = UrduNormalizer()
>>> print(nu.normalize(text))
ان کو غیرقانونی جگہ کس نے دی؟
لوگوں کو قتل کیا جار ہا ہے ۔
بڑے ماموں ان دنوں محلہ خداداد میں رہتے تھے۔
عوامی یا فلاحی خدمات ایک الگ دائرہ عمل ہے۔

Methods

`cnorm`
`normalize`

cnorm(text)[source]¶: Normalize NO_BREAK_SPACE, SOFT_HYPHEN, WORD_JOINER, H_SPACE, ZERO_WIDTH[SPACE, NON_JOINER, JOINER], MARK[LEFT_TO_RIGHT, RIGHT_TO_LEFT, BYTE_ORDER, BYTE_ORDER_2]

normalize(text)[source]¶: normalize text

`indictrans.trunk.StructuredPerceptron` — StructuredPerceptron¶

class indictrans.trunk.StructuredPerceptron(lr_exp=0.1, n_iter=15, random_state=None, verbose=0)[source]¶

Structured perceptron for sequence classification.

The implemention is based on average structured perceptron algorithm of M. Collins.

Parameters:

lr_exp : float, default: 0.1

The Exponent used for inverse scaling of learning rate. Given iteration number t, the effective learning rate is 1. / (t ** lr_exp)

n_iter : int, default: 15

Maximum number of epochs of the structured perceptron algorithm

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

verbose : int, default: 0 (quiet mode)

Verbosity mode.

References

M. Collins (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. EMNLP.

Methods

`fit`
`predict`

fit(X, y)[source]¶

Fit the model to the given set of sequences.

Parameters:

X : {array-like, sparse matrix}, shape (n_sequences, sequence_length,

n_features)

Feature matrix of train sequences.

y : list of arrays, shape (n_sequences, sequence_length)

Target labels.

Returns:

self : object

Returns self.

predict(X)[source]¶

Predict output sequences for input sequences in X.

Parameters:

X : {array-like, sparse matrix}, shape (n_sequences, sequence_length,

n_features)

Feature matrix of test sequences.

Returns:

y : array, shape (n_sequences, sequence_length)

Labels per sequence in X.

Module References¶

indictrans.transliterator — Transliterator¶

indictrans.base.BaseTransliterator — BaseTransliterator¶

indictrans._utils.WX — WXConverter¶

indictrans._utils.OneHotEncoder — OneHotEncoder¶

indictrans._utils.UrduNormalizer — UrduNormalizer¶

indictrans.trunk.StructuredPerceptron — StructuredPerceptron¶

`indictrans.transliterator` — Transliterator¶

`indictrans.base.BaseTransliterator` — BaseTransliterator¶

`indictrans._utils.WX` — WXConverter¶

`indictrans._utils.OneHotEncoder` — OneHotEncoder¶

`indictrans._utils.UrduNormalizer` — UrduNormalizer¶

`indictrans.trunk.StructuredPerceptron` — StructuredPerceptron¶