| trainmodel | -data train.data\ | data file containing document boundaries |
| -vocab train.vocab\ | vocabulary file. Each word in the vocabulary must appear in the data file | |
| -lap2 lap2file\ | lap2 file containing parameters for SVD. The two integer values may not be bigger than the number of documents. | |
| -docbound thisisdocumentboundary\ | the document boundary string set ### in quotes "###". Each document boundary must be in a new line. | |
| -output lsa-train | basename for output files |
To see all parameters of trainmodel call trainmodel -help.
To train an N-gram model use the ngram-count program documented on the SRILM website.
###
they didn't want an objective study
###
they just wanted to prevent the onslaught of the bums
###
boston and similar markets are exceptions to the rule though
train.vocab example:
an
didn't
exceptions
prevent
rule
similar
| lsalm | -lsamodel lsa-train.term.entrop\ | term entropy lsa model file |
| -interpolate infg\ | interpolation method | |
| -docbound thisisdocumentboundary\ | the document boundary string set ### in quotes "###". If not set sentence boundary is used | |
| -ppl evalfile\ | file to calculate perplexities on | |
| -lm train-ngram.gz\ | language model | |
| -order 4 | language model order | |
| -debug 2 > lsa-train-infg-evalfile | debug and output |
To test an N-gram model skip the first three parameter of the lsalm program in the above example. In the same way the ngram program is used. A good intro to ngram and ngram-count is this .
To see all parameters of lsalm call lsalm -help. You will also see all parameters of ngram additional to the last four above. Only the following parameters are lsalm-specific.
| -lsamodel: | term-entropy lsa model file | |
| -lsamodel1: | term-entropy lsa model file 1 | |
| -lsamodel2: | term-entropy lsa model file 2 | |
| -lsamodel3: | term-entropy lsa model file 3 | |
| -lsamodel4: | term-entropy lsa model file 4 | |
| -lsamodel5: | term-entropy lsa model file 5 | |
| -lsamodel6: | term-entropy lsa model file 6 | |
| -lsamodel7: | term-entropy lsa model file 7 | |
| -lsamodel8: | term-entropy lsa model file 8 | |
| -modelname: | modelname for html output | Default value: "m0" |
| -modelname1: | modelname1 for html output | Default value: "m1" |
| -modelname2: | modelname2 for html output | Default value: "m2" |
| -modelname3: | modelname3 for html output | Default value: "m3" |
| -modelname4: | modelname4 for html output | Default value: "m4" |
| -modelname5: | modelname5 for html output | Default value: "m5" |
| -modelname6: | modelname6 for html output | Default value: "m6" |
| -modelname7: | modelname7 for html output | Default value: "m7" |
| -modelname8: | modelname8 for html output | Default value: "m8" |
| -lambdangram: | ngram model weight for linear, loglinear and infg interpolation. Default value has no effect for infg interpolation! | Default value: 1.0 |
| lambdalsa: | lsamodel weight for linear, loglinear and infg interpolation. Default value has no effect for infg interpolation! | Default value: 1.0 |
| -lambdalsa1: | lsamodel1 weight for linear, loglinear and infg interpolation | Default value: 1.0 |
| -lambdalsa2: | lsamodel2 weight for linear, loglinear and infg interpolation | Default value: 1.0 |
| -lambdalsa3: | lsamodel3 weight for linear, loglinear and infg interpolation | Default value: 1.0 |
| -lambdalsa4: | lsamodel4 weight for linear, loglinear and infg interpolation | Default value: 1.0 |
| -lambdalsa5: | lsamodel5 weight for linear, loglinear and infg interpolation | Default value: 1.0 |
| -lambdalsa6: | lsamodel6 weight for linear, loglinear and infg interpolation | Default value: 1.0 |
| -lambdalsa7: | lsamodel7 weight for linear, loglinear and infg interpolation | Default value: 1.0 |
| -lambdalsa8: | lsamodel8 weight for linear, loglinear and infg interpolation | Default value: 1.0 |
| -wordcluster: | file of wordcluster centers | NOT YET IMPLEMENTED! |
| -binmodel: | binmodel file | |
| -nonorm: | don't use nomalization for loglin and infg interpolation | |
| -1besthist: | add 1-best of last utterance to pseudodoc history, for rescoring | |
| -lsaslice: | maximum probability part for lsa models for infg interpolation | Default value: 0.5 |
| -nosqrts: | don't use square root of Singular Values | |
| -initsent: | init pseudodoc at beginning of sentence, default if docbound is not set | |
| -html: | output html comparison file | use with -debug 2 |
| -probsum: | calculate sum of probs for debugging | NOT YET IMPLEMENTED! |
| -docbound: | document boundary | |
| -decay: | weight for lambda decay in (0,1] | Default value: 0.9 |
| -exp: | exp for similarity smoothing | Default value: 5 |
| -interpolate: | interpolation method for lsa and n-gram [infg, loglin, lin] | Default value: "infg" LIN NOT IMPLEMENTED! |
| -zentropmin: | update zero entropies to | Default value: 0 |
| -skipzentrop: | delete zero entropy words from lsa model | |
| -trainlambda: | does training of lambda values for lsa-ngram interpolation | |
| -doccluster: | document cluster center file | |
| -dumpfeat: | dump similarity features, needs doccluster | |
| -mindumpsim: | minimum similarity for counting as feature | Default value: 0.5 |
If the similarity is bigger than the last interval boundary, the last intervals bin probability is used.
we did not talk about pipes
p( we | ) = [2gram] [0.0115263] 0.0131275 [ -1.88182 ] meet-th: 0.005114 fisher: 0.008618 web: 0.025500
p( did | we ...) = [3gram] [0.0129977] 0.0147011 [ -1.83265 ] meet-th: 0.007520 fisher: 0.009937 web: 0.035061
p( not | did ...) = [4gram] [0.108605] 0.103255 [ -0.986088 ] meet-th: 0.004505 fisher: 0.004057 web: 0.018205
p( talk | not ...) = [3gram] [0.000583654] 0.000588433 [ -3.2303 ] meet-th: 0.014041 fisher: 0.016797 web: 0.039907
p( about | talk ...) = [3gram] [0.299694] 0.267013 [ -0.573467 ] meet-th: 0.005322 fisher: 0.004451 web: 0.021566
p( pipes | about ...) = [2gram] [4.84618e-06] 4.07627e-06 [ -5.38974 ] fisher: 0.168330 web: 0.122102
p( | pipes ...) = [2gram] [0.0664626] 0.0723812 [ -1.14037 ]
1 sentences, 6 words, 0 OOVs
0 zeroprobs, logprob= -15.0344 ppl= 140.532 ppl1= 320.435
Light green means that the lsa model is better than the n-gram (VERY GOOD), dark green means that the lsa model
is better, but the word already appears in the context, e.g. it is a cache model effect (GOOD). If the whole document is taken as the context (not -initsent) then there are a lot of repetitions although words that are far away will have no effect due to the -decay.
Dark red means that the lsa model is worse than the n-gram (BAD). Light red means that the lsa model is worse than the n-gram and that the word already appears in the context (VERY BAD).
meet-th: 0.005114 fisher: 0.008618 web: 0.025500 are the lambdas (for infg interpolation) for the different models named meet-th fisher and web respectively.