Package org.carrot2.text.preprocessing

Contains the unified input preprocessing infrastructure (term indexing, stemming, label discovery).

See:
          Description

Class Summary
CaseNormalizer Performs case normalization and calculates a number of frequency statistics for words.
CaseNormalizerDescriptor Metadata and attributes of the CaseNormalizer component.
CaseNormalizerDescriptor.AttributeBuilder Attribute map builder for the CaseNormalizer component.
CaseNormalizerDescriptor.Attributes All attributes of the CaseNormalizer component.
CaseNormalizerDescriptor.Keys Constants for all attribute keys of the CaseNormalizer component.
DocumentAssigner Assigns document to label candidates.
DocumentAssignerDescriptor Metadata and attributes of the DocumentAssigner component.
DocumentAssignerDescriptor.AttributeBuilder Attribute map builder for the DocumentAssigner component.
DocumentAssignerDescriptor.Attributes All attributes of the DocumentAssigner component.
DocumentAssignerDescriptor.Keys Constants for all attribute keys of the DocumentAssigner component.
LabelFilterProcessor Applies basic filtering to words and phrases to produce candidates for cluster labels.
LabelFilterProcessorDescriptor Metadata and attributes of the LabelFilterProcessor component.
LabelFilterProcessorDescriptor.AttributeBuilder Attribute map builder for the LabelFilterProcessor component.
LabelFilterProcessorDescriptor.Attributes All attributes of the LabelFilterProcessor component.
LabelFilterProcessorDescriptor.Keys Constants for all attribute keys of the LabelFilterProcessor component.
LabelFormatter Formats cluster labels for final rendering.
LabelFormatterDescriptor Metadata and attributes of the LabelFormatter component.
LabelFormatterDescriptor.AttributeBuilder Attribute map builder for the LabelFormatter component.
LabelFormatterDescriptor.Attributes All attributes of the LabelFormatter component.
LabelFormatterDescriptor.Keys Constants for all attribute keys of the LabelFormatter component.
LanguageModelStemmer Applies stemming to words and calculates a number of frequency statistics for stems.
LanguageModelStemmerDescriptor Metadata and attributes of the LanguageModelStemmer component.
LanguageModelStemmerDescriptor.AttributeBuilder Attribute map builder for the LanguageModelStemmer component.
LanguageModelStemmerDescriptor.Attributes All attributes of the LanguageModelStemmer component.
LanguageModelStemmerDescriptor.Keys Constants for all attribute keys of the LanguageModelStemmer component.
PhraseExtractor Extracts frequent phrases from the provided document.
PhraseExtractorDescriptor Metadata and attributes of the PhraseExtractor component.
PhraseExtractorDescriptor.AttributeBuilder Attribute map builder for the PhraseExtractor component.
PhraseExtractorDescriptor.Attributes All attributes of the PhraseExtractor component.
PhraseExtractorDescriptor.Keys Constants for all attribute keys of the PhraseExtractor component.
PreprocessedDocumentScanner Iterates over tokenized documents in PreprocessingContext.
PreprocessingContext Document preprocessing context provides low-level (usually integer-coded) data structures useful for further processing.
PreprocessingContext.AllFields Information about all fields processed for the input PreprocessingContext.documents.
SparseArray Sparse array encoding utilities.
StopListMarker Marks stop words based on the current language model.
StopListMarkerDescriptor Metadata and attributes of the StopListMarker component.
StopListMarkerDescriptor.AttributeBuilder Attribute map builder for the StopListMarker component.
StopListMarkerDescriptor.Attributes All attributes of the StopListMarker component.
StopListMarkerDescriptor.Keys Constants for all attribute keys of the StopListMarker component.
Tokenizer Performs tokenization of documents.
TokenizerDescriptor Metadata and attributes of the Tokenizer component.
TokenizerDescriptor.AttributeBuilder Attribute map builder for the Tokenizer component.
TokenizerDescriptor.Attributes All attributes of the Tokenizer component.
TokenizerDescriptor.Keys Constants for all attribute keys of the Tokenizer component.
 

Package org.carrot2.text.preprocessing Description

Contains the unified input preprocessing infrastructure (term indexing, stemming, label discovery).

Package Specification

The main output class resulting from preprocessing is PreprocessingContext, it contains several sub-classes with int-indexed arrays.



Copyright (c) Dawid Weiss, Stanislaw Osinski