org.carrot2.text.vsm
Class TermDocumentMatrixBuilderDescriptor.AttributeBuilder

java.lang.Object
  extended by org.carrot2.text.vsm.TermDocumentMatrixBuilderDescriptor.AttributeBuilder
Enclosing class:
TermDocumentMatrixBuilderDescriptor

public static class TermDocumentMatrixBuilderDescriptor.AttributeBuilder
extends Object

Attribute map builder for the TermDocumentMatrixBuilder component. You can use this builder as a type-safe alternative to populating the attribute map using attribute keys.


Field Summary
 Map<String,Object> map
          The attribute map populated by this builder.
 
Constructor Summary
protected TermDocumentMatrixBuilderDescriptor.AttributeBuilder(Map<String,Object> map)
          Creates a builder backed by the provided map.
 
Method Summary
 TermDocumentMatrixBuilderDescriptor.AttributeBuilder maximumMatrixSize(int value)
          Maximum matrix size.
 TermDocumentMatrixBuilderDescriptor.AttributeBuilder maxWordDf(double value)
          Maximum word document frequency.
 TermDocumentMatrixBuilderDescriptor.AttributeBuilder termWeighting(Class<? extends ITermWeighting> clazz)
          Term weighting.
 TermDocumentMatrixBuilderDescriptor.AttributeBuilder termWeighting(ITermWeighting value)
          Term weighting.
 TermDocumentMatrixBuilderDescriptor.AttributeBuilder titleWordsBoost(double value)
          Title word boost.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

map

public final Map<String,Object> map
The attribute map populated by this builder.

Constructor Detail

TermDocumentMatrixBuilderDescriptor.AttributeBuilder

protected TermDocumentMatrixBuilderDescriptor.AttributeBuilder(Map<String,Object> map)
Creates a builder backed by the provided map.

Method Detail

titleWordsBoost

public TermDocumentMatrixBuilderDescriptor.AttributeBuilder titleWordsBoost(double value)
Title word boost. Gives more weight to words that appeared in Document.TITLE fields.

See Also:
TermDocumentMatrixBuilder.titleWordsBoost

maximumMatrixSize

public TermDocumentMatrixBuilderDescriptor.AttributeBuilder maximumMatrixSize(int value)
Maximum matrix size. The maximum number of the term-document matrix elements. The larger the size, the more accurate, time- and memory-consuming clustering.

See Also:
TermDocumentMatrixBuilder.maximumMatrixSize

maxWordDf

public TermDocumentMatrixBuilderDescriptor.AttributeBuilder maxWordDf(double value)
Maximum word document frequency. The maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than maxWordDf will be ignored. For example, when maxWordDf is 0.4, words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear.

This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting maxWordDf to a value lower than 1.0, e.g. 0.9 may improve the clusters.

Another useful application of this attribute is when there is a need to generate only very specific clusters, i.e. clusters containing small numbers of documents. This can be achieved by setting maxWordDf to extremely low values, e.g. 0.1 or 0.05.

See Also:
TermDocumentMatrixBuilder.maxWordDf

termWeighting

public TermDocumentMatrixBuilderDescriptor.AttributeBuilder termWeighting(ITermWeighting value)
Term weighting. The method for calculating weight of words in the term-document matrices.

See Also:
TermDocumentMatrixBuilder.termWeighting

termWeighting

public TermDocumentMatrixBuilderDescriptor.AttributeBuilder termWeighting(Class<? extends ITermWeighting> clazz)
Term weighting. The method for calculating weight of words in the term-document matrices.

See Also:
TermDocumentMatrixBuilder.termWeighting


Copyright (c) Dawid Weiss, Stanislaw Osinski