org.carrot2.clustering.stc
Class STCClusteringAlgorithm

java.lang.Object
  extended by org.carrot2.core.ProcessingComponentBase
      extended by org.carrot2.clustering.stc.STCClusteringAlgorithm
All Implemented Interfaces:
IClusteringAlgorithm, IProcessingComponent

public final class STCClusteringAlgorithm
extends ProcessingComponentBase
implements IClusteringAlgorithm

Suffix Tree Clustering (STC) algorithm. Pretty much as described in: Oren Zamir, Oren Etzioni, Grouper: A Dynamic Clustering Interface to Web Search Results, 1999. Some liberties were taken wherever STC's description was not clear enough or where we thought some improvements could be made.

Attribute label:
STC Clustering

Field Summary
 List<Cluster> clusters
          Clusters created by the algorithm.
 double documentCountBoost
          Document count boost.
 List<Document> documents
          Documents to cluster.
 int ignoreWordIfInFewerDocs
          Minimum word-document recurrences.
 double ignoreWordIfInHigherDocsPercent
          Maximum word-document ratio.
 int maxBaseClusters
          Maximum base clusters count.
 int maxClusters
          Maximum final clusters.
 int maxDescPhraseLength
          Maximum words per label.
 double maxPhraseOverlap
          Maximum cluster phrase overlap.
 int maxPhrases
          Maximum phrases per label.
 double mergeThreshold
          Base cluster merge threshold.
 double minBaseClusterScore
          Minimum base cluster score.
 int minBaseClusterSize
          Minimum documents per base cluster.
 double mostGeneralPhraseCoverage
          Minimum general phrase coverage.
 MultilingualClustering multilingualClustering
          A helper for performing multilingual clustering.
 int optimalPhraseLength
          Optimal label length.
 double optimalPhraseLengthDev
          Phrase length tolerance.
 BasicPreprocessingPipeline preprocessingPipeline
          Common preprocessing tasks handler.
 String query
          Query that produced the documents.
 double singleTermBoost
          Single term boost.
 
Constructor Summary
STCClusteringAlgorithm()
           
 
Method Summary
 void afterProcessing()
          Memory cleanups.
 void process()
          Performs STC clustering of documents.
 
Methods inherited from class org.carrot2.core.ProcessingComponentBase
beforeProcessing, dispose, getContext, getSharedExecutor, init
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.carrot2.core.IProcessingComponent
beforeProcessing, dispose, init
 

Field Detail

query

public String query
Query that produced the documents. The query will help the algorithm to create better clusters. Therefore, providing the query is optional but desirable.


documents

public List<Document> documents
Documents to cluster.


clusters

public List<Cluster> clusters
Clusters created by the algorithm.


ignoreWordIfInFewerDocs

public int ignoreWordIfInFewerDocs
Minimum word-document recurrences.

Attribute level:
Medium
Attribute group:
Word filtering

ignoreWordIfInHigherDocsPercent

public double ignoreWordIfInHigherDocsPercent
Maximum word-document ratio. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.

Attribute level:
Medium
Attribute group:
Word filtering

minBaseClusterScore

public double minBaseClusterScore
Minimum base cluster score.

Attribute level:
Advanced
Attribute group:
Base clusters

maxBaseClusters

public int maxBaseClusters
Maximum base clusters count. Trims the base cluster array after N-th position for the merging phase.

Attribute level:
Advanced
Attribute group:
Base clusters

minBaseClusterSize

public int minBaseClusterSize
Minimum documents per base cluster.

Attribute level:
Advanced
Attribute group:
Base clusters

maxClusters

public int maxClusters
Maximum final clusters.

Attribute level:
Basic
Attribute group:
Merging and output

mergeThreshold

public double mergeThreshold
Base cluster merge threshold.

Attribute level:
Advanced
Attribute group:
Merging and output

maxPhraseOverlap

public double maxPhraseOverlap
Maximum cluster phrase overlap.

Attribute level:
Advanced
Attribute group:
Label creation

mostGeneralPhraseCoverage

public double mostGeneralPhraseCoverage
Minimum general phrase coverage. Minimum phrase coverage to appear in cluster description.

Attribute level:
Advanced
Attribute group:
Label creation

maxDescPhraseLength

public int maxDescPhraseLength
Maximum words per label. Base clusters formed by phrases with more words than this ratio are trimmed.

Attribute level:
Basic
Attribute group:
Label creation

maxPhrases

public int maxPhrases
Maximum phrases per label. Maximum number of phrases from base clusters promoted to the cluster's label.

Attribute level:
Basic
Attribute group:
Label creation

singleTermBoost

public double singleTermBoost
Single term boost. A factor in calculation of the base cluster score. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.

Attribute level:
Medium
Attribute group:
Base cluster boosts

optimalPhraseLength

public int optimalPhraseLength
Optimal label length. A factor in calculation of the base cluster score.

Attribute level:
Basic
Attribute group:
Base cluster boosts

optimalPhraseLengthDev

public double optimalPhraseLengthDev
Phrase length tolerance. A factor in calculation of the base cluster score.

Attribute level:
Medium
Attribute group:
Base cluster boosts

documentCountBoost

public double documentCountBoost
Document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.

Attribute level:
Medium
Attribute group:
Base cluster boosts

preprocessingPipeline

public final BasicPreprocessingPipeline preprocessingPipeline
Common preprocessing tasks handler.


multilingualClustering

public final MultilingualClustering multilingualClustering
A helper for performing multilingual clustering.

Constructor Detail

STCClusteringAlgorithm

public STCClusteringAlgorithm()
Method Detail

process

public void process()
             throws ProcessingException
Performs STC clustering of documents.

Specified by:
process in interface IProcessingComponent
Overrides:
process in class ProcessingComponentBase
Throws:
ProcessingException - when processing failed. If thrown, the IProcessingComponent.afterProcessing() method will be called and the component will be ready to accept further requests or to be disposed of. Finally, the exception will be rethrown from the controller method that caused the component to perform processing.

afterProcessing

public void afterProcessing()
Memory cleanups.

Specified by:
afterProcessing in interface IProcessingComponent
Overrides:
afterProcessing in class ProcessingComponentBase


Copyright (c) Dawid Weiss, Stanislaw Osinski