Package org.carrot2.text.preprocessing

Contains the unified input preprocessing infrastructure (term indexing, stemming, label discovery).

See:
          Description

Class Summary
CaseNormalizer Performs case normalization and calculates a number of frequency statistics for words.
DocumentAssigner Assigns document to label candidates.
LabelFilterProcessor Applies basic filtering to words and phrases to produce candidates for cluster labels.
LabelFormatter Formats cluster labels for final rendering.
LanguageModelStemmer Applies stemming to words and calculates a number of frequency statistics for stems.
PhraseExtractor Extracts frequent phrases from the provided document.
PreprocessedDocumentScanner Iterates over tokenized documents in PreprocessingContext.
PreprocessingContext Document preprocessing context provides low-level (usually integer-coded) data structures useful for further processing.
PreprocessingContext.AllFields Information about all fields processed for the input PreprocessingContext.documents.
PreprocessingContext.AllLabels Information about words and phrases that might be good cluster label candidates.
PreprocessingContext.AllPhrases Information about all frequently appearing sequences of words found in the input PreprocessingContext.documents.
PreprocessingContext.AllStems Information about all unique stems found in the input PreprocessingContext.documents.
PreprocessingContext.AllTokens Information about all tokens of the input PreprocessingContext.documents.
PreprocessingContext.AllWords Information about all unique words found in the input PreprocessingContext.documents.
SparseArray Sparse array encoding utilities.
StopListMarker Marks stop words based on the current language model.
Tokenizer Performs tokenization of documents.
 

Package org.carrot2.text.preprocessing Description

Contains the unified input preprocessing infrastructure (term indexing, stemming, label discovery).

Package Specification

The main output class resulting from preprocessing is PreprocessingContext, it contains several sub-classes with int-indexed arrays.



Copyright (c) Dawid Weiss, Stanislaw Osinski