1. Installation

Change the variable
  • TCLLIB=/usr/lib/libtcl
  • in ../src/Makefile to point to your tcl library.
    Go to ../src and type make.
    This will create the programs lsalm, trainmodel and ngram-count.
    Either copy them to /usr/local/bin or add ../src to your path.
    In this version only i686 architecture is supported. For other architectures (solaris) get the SRILM libraries.

    2. Usage

    2.1. Training

    To train an LSA model use:

    trainmodel -data train.data\ data file containing document boundaries
    -vocab train.vocab\ vocabulary file. Each word in the vocabulary must appear in the data file
    -lap2 lap2file\ lap2 file containing parameters for SVD. The two integer values may not be bigger than the number of documents.
    -docbound thisisdocumentboundary\ the document boundary string set ### in quotes "###". Each document boundary must be in a new line.
    -output lsa-train basename for output files

    To see all parameters of trainmodel call trainmodel -help.

    To train an N-gram model use the ngram-count program documented on the SRILM website.

    2.1.1 train.data and train.vocab examples

    train.data example:

    ###
    they didn't want an objective study
    ###
    they just wanted to prevent the onslaught of the bums
    ###
    boston and similar markets are exceptions to the rule though

    train.vocab example:

    an
    didn't
    exceptions
    prevent
    rule
    similar

    2.2. Testing

    To test an LSA model use:

    lsalm -lsamodel lsa-train.term.entrop\ term entropy lsa model file
    -interpolate infg\ interpolation method
    -docbound thisisdocumentboundary\ the document boundary string set ### in quotes "###". If not set sentence boundary is used
    -ppl evalfile\ file to calculate perplexities on
    -lm train-ngram.gz\ language model
    -order 4 language model order
    -debug 2 > lsa-train-infg-evalfile debug and output

    To test an N-gram model skip the first three parameter of the lsalm program in the above example. In the same way the ngram program is used. A good intro to ngram and ngram-count is this .

    To see all parameters of lsalm call lsalm -help. You will also see all parameters of ngram additional to the last four above. Only the following parameters are lsalm-specific.

    -lsamodel: term-entropy lsa model file
    -lsamodel1: term-entropy lsa model file 1
    -lsamodel2: term-entropy lsa model file 2
    -lsamodel3: term-entropy lsa model file 3
    -lsamodel4: term-entropy lsa model file 4
    -lsamodel5: term-entropy lsa model file 5
    -lsamodel6: term-entropy lsa model file 6
    -lsamodel7: term-entropy lsa model file 7
    -lsamodel8: term-entropy lsa model file 8
    -modelname: modelname for html output Default value: "m0"
    -modelname1: modelname1 for html output Default value: "m1"
    -modelname2: modelname2 for html output Default value: "m2"
    -modelname3: modelname3 for html output Default value: "m3"
    -modelname4: modelname4 for html output Default value: "m4"
    -modelname5: modelname5 for html output Default value: "m5"
    -modelname6: modelname6 for html output Default value: "m6"
    -modelname7: modelname7 for html output Default value: "m7"
    -modelname8: modelname8 for html output Default value: "m8"
    -lambdangram: ngram model weight for linear, loglinear and infg interpolation. Default value has no effect for infg interpolation! Default value: 1.0
    lambdalsa: lsamodel weight for linear, loglinear and infg interpolation. Default value has no effect for infg interpolation! Default value: 1.0
    -lambdalsa1: lsamodel1 weight for linear, loglinear and infg interpolation Default value: 1.0
    -lambdalsa2: lsamodel2 weight for linear, loglinear and infg interpolation Default value: 1.0
    -lambdalsa3: lsamodel3 weight for linear, loglinear and infg interpolation Default value: 1.0
    -lambdalsa4: lsamodel4 weight for linear, loglinear and infg interpolation Default value: 1.0
    -lambdalsa5: lsamodel5 weight for linear, loglinear and infg interpolation Default value: 1.0
    -lambdalsa6: lsamodel6 weight for linear, loglinear and infg interpolation Default value: 1.0
    -lambdalsa7: lsamodel7 weight for linear, loglinear and infg interpolation Default value: 1.0
    -lambdalsa8: lsamodel8 weight for linear, loglinear and infg interpolation Default value: 1.0
    -wordcluster: file of wordcluster centers NOT YET IMPLEMENTED!
    -binmodel: binmodel file
    -nonorm: don't use nomalization for loglin and infg interpolation
    -1besthist: add 1-best of last utterance to pseudodoc history, for rescoring
    -lsaslice: maximum probability part for lsa models for infg interpolation Default value: 0.5
    -nosqrts: don't use square root of Singular Values
    -initsent: init pseudodoc at beginning of sentence, default if docbound is not set
    -html: output html comparison file use with -debug 2
    -probsum: calculate sum of probs for debugging NOT YET IMPLEMENTED!
    -docbound: document boundary
    -decay: weight for lambda decay in (0,1] Default value: 0.9
    -exp: exp for similarity smoothing Default value: 5
    -interpolate: interpolation method for lsa and n-gram [infg, loglin, lin] Default value: "infg" LIN NOT IMPLEMENTED!
    -zentropmin: update zero entropies to Default value: 0
    -skipzentrop: delete zero entropy words from lsa model
    -trainlambda: does training of lambda values for lsa-ngram interpolation
    -doccluster: document cluster center file
    -dumpfeat: dump similarity features, needs doccluster
    -mindumpsim: minimum similarity for counting as feature Default value: 0.5

    2.3. Data Format

    2.3.1. Term.entrop File Format

    4 3 #words, #singular values
    0.002225 0.002332 0.002374 singularvalue1, singularvalue2, singularvalue3
    i 0.938480 0.000003 -0.000142 0.000042 word, entropy, wordvecvalue1, wordvecvalue2, wordvecvalue3
    think 0.000000 0.000003 -0.000002 0.000000
    this 0.938480 0.000003 -0.000142 0.000042
    works 0.000000 0.000003 -0.000002 0.000000

    2.3.2. Binmodel File Format

    3 #bins
    -0.184038 2.07123e-06 end of similarity interval (e.g. from -1 to -0.184038), bin probability
    0.234566 3,40712e-06
    0.999516 4.07123e-02

    If the similarity is bigger than the last interval boundary, the last intervals bin probability is used.

    2.3.3. Doccluster File Format

    3 2 #cluster centers, #docvecvalues
    0.000003 -0.000142 docvecvalue1, docvecvalue2
    0.000003 -0.000002
    0.000003 -0.000142

    2.4. HTML output

    When lsalm is used with the -html option it produces HTML output of the following style:

    we did not talk about pipes
    p( we | ) = [2gram] [0.0115263] 0.0131275 [ -1.88182 ] meet-th: 0.005114 fisher: 0.008618 web: 0.025500
    p( did | we ...) = [3gram] [0.0129977] 0.0147011 [ -1.83265 ] meet-th: 0.007520 fisher: 0.009937 web: 0.035061
    p( not | did ...) = [4gram] [0.108605] 0.103255 [ -0.986088 ] meet-th: 0.004505 fisher: 0.004057 web: 0.018205
    p( talk | not ...) = [3gram] [0.000583654] 0.000588433 [ -3.2303 ] meet-th: 0.014041 fisher: 0.016797 web: 0.039907
    p( about | talk ...) = [3gram] [0.299694] 0.267013 [ -0.573467 ] meet-th: 0.005322 fisher: 0.004451 web: 0.021566
    p( pipes | about ...) = [2gram] [4.84618e-06] 4.07627e-06 [ -5.38974 ] fisher: 0.168330 web: 0.122102
    p( | pipes ...) = [2gram] [0.0664626] 0.0723812 [ -1.14037 ]
    1 sentences, 6 words, 0 OOVs 0 zeroprobs, logprob= -15.0344 ppl= 140.532 ppl1= 320.435

    Light green means that the lsa model is better than the n-gram (VERY GOOD), dark green means that the lsa model is better, but the word already appears in the context, e.g. it is a cache model effect (GOOD). If the whole document is taken as the context (not -initsent) then there are a lot of repetitions although words that are far away will have no effect due to the -decay.
    Dark red means that the lsa model is worse than the n-gram (BAD). Light red means that the lsa model is worse than the n-gram and that the word already appears in the context (VERY BAD).
    meet-th: 0.005114 fisher: 0.008618 web: 0.025500 are the lambdas (for infg interpolation) for the different models named meet-th fisher and web respectively.