org.carrot2.text.preprocessing
Class PreprocessingContext.AllStems

java.lang.Object
  extended by org.carrot2.text.preprocessing.PreprocessingContext.AllStems
Enclosing class:
PreprocessingContext

public class PreprocessingContext.AllStems
extends Object

Information about all unique stems found in the input PreprocessingContext.documents. Each entry in each array corresponds to one base form different words can be transformed to by the IStemmer used while processing. E.g. the English mining and mine will be aggregated to one entry in the arrays, while they will have separate entries in PreprocessingContext.AllWords.

All arrays in this class have the same length and values across different arrays correspond to each other for the same index.


Field Summary
 byte[] fieldIndices
          A bit-packed indices of all fields in which this word appears at least once.
 char[][] image
          Stem image as produced by the IStemmer, may not correspond to any correct word.
 int[] mostFrequentOriginalWordIndex
          Pointer to the PreprocessingContext.AllWords arrays, to the most frequent original form of the stem.
 int[] tf
          Term frequency of the stem, i.e.
 int[][] tfByDocument
          Term frequency of the stem for each document.
 
Constructor Summary
PreprocessingContext.AllStems()
           
 
Method Summary
 String toString()
          For debugging purposes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

image

public char[][] image
Stem image as produced by the IStemmer, may not correspond to any correct word.

This array is produced by LanguageModelStemmer.


mostFrequentOriginalWordIndex

public int[] mostFrequentOriginalWordIndex
Pointer to the PreprocessingContext.AllWords arrays, to the most frequent original form of the stem. Pointers to the less frequent variants are not available.

This array is produced by LanguageModelStemmer.


tf

public int[] tf
Term frequency of the stem, i.e. the sum of all PreprocessingContext.AllWords.tf values for which the PreprocessingContext.AllWords.stemIndex points to this stem.

This array is produced by LanguageModelStemmer.


tfByDocument

public int[][] tfByDocument
Term frequency of the stem for each document. For the encoding of this array, see PreprocessingContext.AllWords.tfByDocument.

This array is produced by LanguageModelStemmer.


fieldIndices

public byte[] fieldIndices
A bit-packed indices of all fields in which this word appears at least once. Indexes (positions) of selected bits are pointers to the PreprocessingContext.AllFields arrays. Fast conversion between the bit-packed representation and byte[] with index values is done by PreprocessingContext.toFieldIndexes(byte)

This array is produced by LanguageModelStemmer

Constructor Detail

PreprocessingContext.AllStems

public PreprocessingContext.AllStems()
Method Detail

toString

public String toString()
For debugging purposes.

Overrides:
toString in class Object


Copyright (c) Dawid Weiss, Stanislaw Osinski