org.carrot2.text.preprocessing
Class PhraseExtractor

java.lang.Object
  extended by org.carrot2.text.preprocessing.PhraseExtractor

public class PhraseExtractor
extends Object

Extracts frequent phrases from the provided document. A frequent phrase is a sequence of words that appears in the documents more than once. This phrase extractor aggregates different inflection variants of phrase words into one phrase, returning the most frequent variant. For example, if phrase computing science appears 2 times and computer sciences appears 4 times, the latter will be returned with aggregated frequency of 6.

This class saves the following results to the PreprocessingContext:

This class requires that Tokenizer, CaseNormalizer and LanguageModelStemmer be invoked first.


Field Summary
 int dfThreshold
          Phrase Document Frequency threshold.
 
Constructor Summary
PhraseExtractor()
           
 
Method Summary
 void extractPhrases(PreprocessingContext context)
          Performs phrase extraction and saves the results to the provided context.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

dfThreshold

public int dfThreshold
Phrase Document Frequency threshold. Phrases appearing in fewer than dfThreshold documents will be ignored.

Attribute label:
Phrase Document Frequency threshold
Attribute level:
Advanced
Attribute group:
Phrase extraction
Constructor Detail

PhraseExtractor

public PhraseExtractor()
Method Detail

extractPhrases

public void extractPhrases(PreprocessingContext context)
Performs phrase extraction and saves the results to the provided context.



Copyright (c) Dawid Weiss, Stanislaw Osinski