Module References¶
indictrans.transliterator
— Transliterator¶
-
class
indictrans.
Transliterator
(source='hin', target='eng', decode='viterbi', build_lookup=False)[source]¶ Transliterator for Indic scripts including English and Urdu.
Parameters: source : str, default: hin
Source Language (3 letter ISO-639 code)
target : str, default: eng
Target Language (3 letter ISO-639 code)
decode : str, default: viterbi
Decoding algorithm, either “viterbi” or “beamsearch”.
build_lookup : bool, default: False
Flag to build lookup-table. Fastens the transliteration process if the input text contains repeating words.
Examples
>>> from indictrans import Transliterator >>> trn = Transliterator(source='hin', target='eng', build_lookup=True) >>> hin = '''कांग्रेस पार्टी अध्यक्ष सोनिया गांधी, तमिलनाडु की मुख्यमंत्री ... जयललिता और रिज़र्व बैंक के गवर्नर रघुराम राजन के बीच एक ... समानता है. ये सभी अलग-अलग कारणों से भारतीय जनता पार्टी के ... राज्यसभा सांसद सुब्रमण्यम स्वामी के निशाने पर हैं. उनके ... जयललिता और सोनिया गांधी के पीछे पड़ने का कारण कथित ... भ्रष्टाचार है.''' >>> eng = trn.transform(hin) >>> print(eng) congress party adhyaksh sonia gandhi, tamilnadu kii mukhyamantri jayalalita our reserve baink ke governor raghuram rajan ke beech ek samanta hai. ye sabi alag-alag carnon se bharatiya janata party ke rajyasabha saansad subramanyam swami ke nishane par hain. unke jayalalita our sonia gandhi ke peeche padane ka kaaran kathith bhrashtachar hai.
Methods
convert
indictrans.base.BaseTransliterator
— BaseTransliterator¶
-
class
indictrans.base.
BaseTransliterator
(source, target, decoder, build_lookup=False)[source]¶ Base class for transliterator.
Attributes
vectorizer_ (instance) OneHotEncoder instance for converting categorical features to one-hot features. classes_ (dict) Dictionary of set of tags with unique ids ({id: tag}). coef_ (array) HMM coefficient array intercept_init_ (array) HMM intercept array for first layer of trellis. intercept_trans_ (array) HMM intercept/transition array for middle layers of trellis. intercept_final_ (array) HMM intercept array for last layer of trellis. wx_process (method) wx2utf/utf2wx method of WX instance nu (instance) UrduNormalizer instance for normalizing Urdu scripts. Methods
base_fit
convert_to_wx
load_mappings
load_models
predict
top_n_trans
transliterate
-
predict
(word, k_best=5)[source]¶ Given encoded word matrix and HMM parameters, predicts output sequence (target word)
-
indictrans._utils.WX
— WXConverter¶
-
class
indictrans._utils.
WX
(order=u'utf2wx', lang=u'hin')[source]¶ WX-converter for UTF to WX conversion of Indic scripts and vice-versa.
Parameters: lang : str, default: hin
Input script
order : str, default: utf2wx
Order of conversion
Examples
>>> from wx import WX >>> wxc = WX(lang='hin', order='utf2wx') >>> hin_utf = u'''बीजेपी के सांसद सुब्रमण्यम स्वामी ने कुछ ही दिन पहले ... अपनी ही सरकार को कठघरे में खड़ा करते हुए जीडीपी आंकड़ों पर ... सवाल उठाए हैं.''' >>> hin_wx = wxc.utf2wx(hin_utf) >>> print(hin_wx) bIjepI ke sAMsaxa subramaNyama svAmI ne kuCa hI xina pahale apanI hI sarakAra ko kaTaGare meM KadZA karawe hue jIdIpI AMkadZoM para savAla uTAe hEM. >>> wxc = WX(lang='hin', order='wx2utf') >>> hin_utf_ = wxc.wx2utf(hin_wx) >>> print(hin_utf_) बीजेपी के सांसद सुब्रमण्यम स्वामी ने कुछ ही दिन पहले अपनी ही सरकार को कठघरे में खड़ा करते हुए जीडीपी आंकड़ों पर सवाल उठाए हैं. >>> wxc = WX(lang='mal', order='utf2wx') >>> mal_utf = u'''വിപണിയിലെ ശുഭാപ്തിവിശ്വാസക്കാരായ കാളകള്ക്ക് അനുകൂലമായ ... രീതിയിലാണ് ബി എസ് ഇയില് വ്യാപാരം നടക്കുന്നത്.''' >>> mal_wx = wxc.utf2wx(mal_utf) >>> print(mal_wx) vipaNiyileV SuBApwiviSvAsakkArAya kAlYakalYkk anukUlamAya rIwiyilAN bi eVs iyil vyApAraM natakkunnaw. >>> wxc = WX(lang='mal', order='wx2utf') >>> mal_utf_ = wxc.wx2utf(mal_wx) >>> print(mal_utf_) വിപണിയിലെ ശുഭാപ്തിവിശ്വാസക്കാരായ കാളകള്ക്ക് അനുകൂലമായ രീതിയിലാണ് ബി എസ് ഇയില് വ്യാപാരം നടക്കുന്നത്.
Methods
fit
initialize_utf2wx_hash
initialize_wx2utf_hash
iscii2unicode
iscii2unicode_ben
iscii2unicode_guj
iscii2unicode_hin
iscii2unicode_kan
iscii2unicode_mal
iscii2unicode_ori
iscii2unicode_pan
iscii2unicode_tam
iscii2unicode_tel
iscii2wx
map_EY
map_EY2
map_OY
map_OY2
map_Z
map_ZeV
map_ZoV
map_a
map_eV
map_eV2
map_lY
map_lYY
map_nY
map_oV
map_oV2
map_q
map_rY
normalize
unicode2iscii
unicode2iscii_ben
unicode2iscii_guj
unicode2iscii_hin
unicode2iscii_kan
unicode2iscii_mal
unicode2iscii_ori
unicode2iscii_pan
unicode2iscii_tam
unicode2iscii_tel
utf2wx
wx2iscii
wx2utf
indictrans._utils.OneHotEncoder
— OneHotEncoder¶
-
class
indictrans._utils.
OneHotEncoder
[source]¶ Transforms categorical features to continuous numeric features.
Examples
>>> from one_hot_encoder import OneHotEncoder >>> enc = OneHotEncoder() >>> sequences = [list('bat'), list('cat'), list('rat')] >>> enc.fit(sequences) <one_hot_encoder.OneHotEncoder instance at 0x7f346d71c200> >>> enc.transform(sequences, sparse=False).astype(int) array([[0, 1, 0, 1, 1], [1, 0, 0, 1, 1], [0, 0, 1, 1, 1]]) >>> enc.transform(list('cat'), sparse=False).astype(int) array([[1, 0, 0, 1, 1]]) >>> enc.transform(list('bat'), sparse=True) <1x5 sparse matrix of type '<type 'numpy.float64'>' with 3 stored elements in Compressed Sparse Row format>
Methods
fit
transform
-
fit
(X)[source]¶ Fit OneHotEncoder to X.
Parameters: X : array-like, shape [n_samples, n_feature]
Input array of type int.
Returns: self
-
transform
(X, sparse=True)[source]¶ Transform X using one-hot encoding.
Parameters: X : array-like, shape [n_samples, n_features]
Input array of categorical features.
sparse : bool, default: True
Return sparse matrix if set True else return an array.
Returns: X_out : sparse matrix if sparse=True else a 2-d array, dtype=int
Transformed input.
-
indictrans._utils.UrduNormalizer
— UrduNormalizer¶
-
class
indictrans._utils.
UrduNormalizer
[source]¶ Normalizer for Urdu scripts. Normalizes different unicode canonical equivalances to a single unicode code-point.
Examples
>>> from script_normalizer import UrduNormalizer >>> text = u'''ﺎﻧ کﻭ ﻍیﺮﻗﺎﻧﻮﻧی ﺝگہ کﺱ ﻥے ﺩی؟ ... ﻝﻭگﻭں کﻭ ﻖﺘﻟ کیﺍ ﺝﺍﺭ ہﺍ ہے ۔ ... ﺏڑے ﻡﺎﻣﻭں ﺎﻧ ﺪﻧﻭں ﻢﺤﻟہ ﺥﺩﺍﺩﺍﺩ ﻡیں ﺭہﺕے ﺕھے۔ ... ﻉﻭﺎﻣی یﺍ ﻑﻼﺣی ﺥﺪﻣﺎﺗ ﺍیک ﺎﻟگ ﺩﺎﺋﺭہ ﻊﻤﻟ ہے۔''' >>> nu = UrduNormalizer() >>> print(nu.normalize(text)) ان کو غیرقانونی جگہ کس نے دی؟ لوگوں کو قتل کیا جار ہا ہے ۔ بڑے ماموں ان دنوں محلہ خداداد میں رہتے تھے۔ عوامی یا فلاحی خدمات ایک الگ دائرہ عمل ہے۔
Methods
cnorm
normalize
indictrans.trunk.StructuredPerceptron
— StructuredPerceptron¶
-
class
indictrans.trunk.
StructuredPerceptron
(lr_exp=0.1, n_iter=15, random_state=None, verbose=0)[source]¶ Structured perceptron for sequence classification.
The implemention is based on average structured perceptron algorithm of M. Collins.
Parameters: lr_exp : float, default: 0.1
The Exponent used for inverse scaling of learning rate. Given iteration number t, the effective learning rate is
1. / (t ** lr_exp)
n_iter : int, default: 15
Maximum number of epochs of the structured perceptron algorithm
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by
np.random
.verbose : int, default: 0 (quiet mode)
Verbosity mode.
References
M. Collins (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. EMNLP.
Methods
fit
predict