Model Setup & Training ====================== .. _example-train: Train and Test -------------- Assuming your data is in ``tnt`` format you can encode the data ane train a :class:`indictrans.trunk.StructuredPerceptron` classifier. .. code:: python from indictrans import trunk #load trianing data X, y = trunk.load_data('indictrans/trunk/tests/hin2rom.tnt') #build ngram-context X = trunk.build_context(X, ngram=4) #fit encoder enc, X = trunk.fit_encoder(X) #train structured-perceptron model clf = trunk.train_sp(X, y, n_iter=5, verbose=2) Iteration 1 ... Train-set error = 1.5490 Iteration 2 ... Train-set error = 1.0040 Iteration 3 ... Train-set error = 0.8030 Iteration 4 ... Train-set error = 0.6900 Iteration 5 ... This will train the perceptron for 5 epochs (specified via the ``n_iter`` parameter). Then you can use the trained classifier as follows: .. code:: python #load testing data X_test, y = trunk.load_data('indictrans/trunk/tests/hin2rom.tnt') #build ngram-context for testing data X_test = trunk.build_context(X_test, ngram=4) # ngram value should be same as for train-set #encode test-set X_test = [enc.transform(x) for x in X_test] #predict output sequences y_ = clf.predict(X_test) y[10] # True [u'c', u'l', u'a', u'ne', u'_'] >>> y_[10] # Predicted [u'c', u'l', u'a', u'n', u'_'] >>> y_[100] # True [u'p', u'a', u'r', u'aa', u'n', u'd', u'e'] >>> y_[100] # Predicted [u'p', u'a', u'r', u'aa', u'n', u'd', u'e'] Note that you need to ``build-context`` using the same ``ngram`` value as used for trainig data. Also you need to ``encode`` test data using the encoder ``enc`` developed on training data. .. _example-train-from-console: Train directly from Console --------------------------- `indictrans-trunk` provides a much easier way to train, test and save models directly from console. .. parsed-literal:: user@indic-trans$ indictrans-trunk --help -d , --data-file training data-file: set of sequences -o , --output-dir output directory to dump trained models -n , --ngrams ngram context for feature extraction: default 4 -e , --lr-exp The Exponent used for inverse scaling oflearning rate: default 0.1 -m , --max-iter Maximum number of iterations for training: default 15 -r , --random-state Random seed for shuffling sequences within each iteration. -l , --verbosity Verbosity level: default 0 (quiet moe) -t , --test-file testing data-file: optional: stores output sequences in `test_file.out` user@indic-trans$ indictrans-trunk -d hin2rom.tnt -o /tmp/rom-ind/ -n 4 -e 0.1 -m 5 -l 3 -t hin2rom.tnt Iteration 1 ... First sequence comparision: 0-27 0-95 0-30 0-10 ... loss: 4 Train-set error = 1.8090 Iteration 2 ... First sequence comparision: 120-46 86-86 63-63 120-120 95-95 123-123 10-10 ... loss: 1 Train-set error = 0.6560 Iteration 3 ... First sequence comparision: 123-123 110-110 40-40 46-46 ... loss: 0 Train-set error = 0.3820 Iteration 4 ... First sequence comparision: 2-2 95-95 86-86 77-77 64-64 31-31 120-120 80-80 10-10 ... loss: 0 Train-set error = 0.2240 Iteration 5 ... First sequence comparision: 40-40 120-120 31-31 120-120 125-125 120-120 123-123 117-117 31-31 120-120 ... loss: 0 Train-set error = 0.1540 Testing ... Assuming ``hin2rom.tnt`` was given as ``test-file``, the output file will be generated with the name ``hin2rom.tnt.out``.