org.carrot2.text.preprocessing.pipeline
Class BasicPreprocessingPipeline

java.lang.Object
  extended by org.carrot2.text.preprocessing.pipeline.BasicPreprocessingPipeline
Direct Known Subclasses:
CompletePreprocessingPipeline

public class BasicPreprocessingPipeline
extends Object

Performs basic preprocessing steps on the provided documents. The preprocessing consists of the following steps:

  1. Tokenizer.tokenize(PreprocessingContext)
  2. CaseNormalizer.normalize(PreprocessingContext)
  3. LanguageModelStemmer.stem(PreprocessingContext)
  4. StopListMarker.mark(PreprocessingContext)


Field Summary
 CaseNormalizer caseNormalizer
          Case normalizer used by the algorithm, contains bindable attributes.
 ILanguageModelFactory languageModelFactory
          Language model factory.
 LanguageModelStemmer languageModelStemmer
          Stemmer used by the algorithm, contains bindable attributes.
 StopListMarker stopListMarker
          Stop list marker used by the algorithm, contains bindable attributes.
 Tokenizer tokenizer
          Tokenizer used by the algorithm, contains bindable attributes.
 
Constructor Summary
BasicPreprocessingPipeline()
           
 
Method Summary
 PreprocessingContext preprocess(List<Document> documents, String query, LanguageCode language)
          Performs preprocessing on the provided list of documents.
 void preprocess(PreprocessingContext context)
          Performs preprocessing on the provided PreprocessingContext.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

tokenizer

public final Tokenizer tokenizer
Tokenizer used by the algorithm, contains bindable attributes.


caseNormalizer

public final CaseNormalizer caseNormalizer
Case normalizer used by the algorithm, contains bindable attributes.


languageModelStemmer

public final LanguageModelStemmer languageModelStemmer
Stemmer used by the algorithm, contains bindable attributes.


stopListMarker

public final StopListMarker stopListMarker
Stop list marker used by the algorithm, contains bindable attributes.


languageModelFactory

public ILanguageModelFactory languageModelFactory
Language model factory. Creates language the language model to be used by the clustering algorithm. The language models provides the lexical resources required to perform clustering, including stop words and a word stemming algorithm.

Attribute level:
Advanced
Attribute group:
Preprocessing
Constructor Detail

BasicPreprocessingPipeline

public BasicPreprocessingPipeline()
Method Detail

preprocess

public PreprocessingContext preprocess(List<Document> documents,
                                       String query,
                                       LanguageCode language)
Performs preprocessing on the provided list of documents. Results can be obtained from the returned PreprocessingContext.


preprocess

public void preprocess(PreprocessingContext context)
Performs preprocessing on the provided PreprocessingContext.



Copyright (c) Dawid Weiss, Stanislaw Osinski