org.carrot2.text.vsm
Class TermDocumentMatrixBuilder

java.lang.Object
  extended by org.carrot2.text.vsm.TermDocumentMatrixBuilder

public class TermDocumentMatrixBuilder
extends Object

Builds a term document matrix based on the provided PreprocessingContext.


Field Summary
 int maximumMatrixSize
          Maximum matrix size.
 double maxWordDf
          Maximum word document frequency.
 ITermWeighting termWeighting
          Term weighting.
 double titleWordsBoost
          Title word boost.
 
Constructor Summary
TermDocumentMatrixBuilder()
           
 
Method Summary
 void buildTermDocumentMatrix(VectorSpaceModelContext vsmContext)
          Builds a term document matrix from data provided in the context, stores the result in there.
 void buildTermPhraseMatrix(VectorSpaceModelContext context)
          Builds a term-phrase matrix in the same space as the main term-document matrix.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

titleWordsBoost

public double titleWordsBoost
Title word boost. Gives more weight to words that appeared in Document.TITLE fields.

Attribute label:
Title word boost
Attribute level:
Medium
Attribute group:
Labels

maximumMatrixSize

public int maximumMatrixSize
Maximum matrix size. The maximum number of the term-document matrix elements. The larger the size, the more accurate, time- and memory-consuming clustering.

Attribute label:
Maximum matrix size
Attribute level:
Medium
Attribute group:
Matrix model

maxWordDf

public double maxWordDf
Maximum word document frequency. The maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than maxWordDf will be ignored. For example, when maxWordDf is 0.4, words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear.

This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting maxWordDf to a value lower than 1.0, e.g. 0.9 may improve the clusters.

Another useful application of this attribute is when there is a need to generate only very specific clusters, i.e. clusters containing small numbers of documents. This can be achieved by setting maxWordDf to extremely low values, e.g. 0.1 or 0.05.

Attribute label:
Maximum word document frequency
Attribute level:
Advanced
Attribute group:
Matrix model

termWeighting

public ITermWeighting termWeighting
Term weighting. The method for calculating weight of words in the term-document matrices.

Attribute label:
Term weighting
Attribute level:
Advanced
Attribute group:
Matrix model
Constructor Detail

TermDocumentMatrixBuilder

public TermDocumentMatrixBuilder()
Method Detail

buildTermDocumentMatrix

public void buildTermDocumentMatrix(VectorSpaceModelContext vsmContext)
Builds a term document matrix from data provided in the context, stores the result in there.


buildTermPhraseMatrix

public void buildTermPhraseMatrix(VectorSpaceModelContext context)
Builds a term-phrase matrix in the same space as the main term-document matrix. If the processing context contains no phrases, VectorSpaceModelContext.termPhraseMatrix will remain null.



Copyright (c) Dawid Weiss, Stanislaw Osinski