org.carrot2.text.preprocessing.pipeline
Class BasicPreprocessingPipeline

java.lang.Object
  extended by org.carrot2.text.preprocessing.pipeline.BasicPreprocessingPipeline
All Implemented Interfaces:
IPreprocessingPipeline
Direct Known Subclasses:
CompletePreprocessingPipeline

public class BasicPreprocessingPipeline
extends Object
implements IPreprocessingPipeline

Performs basic preprocessing steps on the provided documents. The preprocessing consists of the following steps:

  1. Tokenizer.tokenize(PreprocessingContext)
  2. CaseNormalizer.normalize(PreprocessingContext)
  3. LanguageModelStemmer.stem(PreprocessingContext)
  4. StopListMarker.mark(PreprocessingContext)


Field Summary
 CaseNormalizer caseNormalizer
          Case normalizer used by the algorithm, contains bindable attributes.
 LanguageModelStemmer languageModelStemmer
          Stemmer used by the algorithm, contains bindable attributes.
 ILexicalDataFactory lexicalDataFactory
          Lexical data factory.
 IStemmerFactory stemmerFactory
          Stemmer factory.
 StopListMarker stopListMarker
          Stop list marker used by the algorithm, contains bindable attributes.
 Tokenizer tokenizer
          Tokenizer used by the algorithm, contains bindable attributes.
 ITokenizerFactory tokenizerFactory
          Tokenizer factory.
 
Constructor Summary
BasicPreprocessingPipeline()
           
 
Method Summary
 PreprocessingContext preprocess(List<Document> documents, String query, LanguageCode language)
          Performs preprocessing on the provided list of documents.
 void preprocess(PreprocessingContext context)
          Performs preprocessing on the provided PreprocessingContext.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

tokenizer

public final Tokenizer tokenizer
Tokenizer used by the algorithm, contains bindable attributes.


caseNormalizer

public final CaseNormalizer caseNormalizer
Case normalizer used by the algorithm, contains bindable attributes.


languageModelStemmer

public final LanguageModelStemmer languageModelStemmer
Stemmer used by the algorithm, contains bindable attributes.


stopListMarker

public final StopListMarker stopListMarker
Stop list marker used by the algorithm, contains bindable attributes.


tokenizerFactory

public ITokenizerFactory tokenizerFactory
Tokenizer factory. Creates the tokenizers to be used by the clustering algorithm.

Attribute level:
Advanced
Attribute group:
Preprocessing

stemmerFactory

public IStemmerFactory stemmerFactory
Stemmer factory. Creates the stemmers to be used by the clustering algorithm.

Attribute level:
Advanced
Attribute group:
Preprocessing

lexicalDataFactory

public ILexicalDataFactory lexicalDataFactory
Lexical data factory. Creates the lexical data to be used by the clustering algorithm, including stop word and stop label dictionaries.

Attribute level:
Advanced
Attribute group:
Preprocessing
Constructor Detail

BasicPreprocessingPipeline

public BasicPreprocessingPipeline()
Method Detail

preprocess

public PreprocessingContext preprocess(List<Document> documents,
                                       String query,
                                       LanguageCode language)
Performs preprocessing on the provided list of documents. Results can be obtained from the returned PreprocessingContext.

Specified by:
preprocess in interface IPreprocessingPipeline

preprocess

public void preprocess(PreprocessingContext context)
Performs preprocessing on the provided PreprocessingContext.

Specified by:
preprocess in interface IPreprocessingPipeline


Copyright (c) Dawid Weiss, Stanislaw Osinski