org.carrot2.text.preprocessing
Class PreprocessingContext.AllWords

java.lang.Object
  extended by org.carrot2.text.preprocessing.PreprocessingContext.AllWords
Enclosing class:
PreprocessingContext

public class PreprocessingContext.AllWords
extends Object

Information about all unique words found in the input PreprocessingContext.documents. An entry in each parallel array corresponds to one conflated form of a word. For example, data and DATA will most likely become a single entry in the words table. However, different grammatical forms of a single lemma (like computer and computers) will have different entries in the words table. See PreprocessingContext.AllStems for inflection-conflated versions.

All arrays in this class have the same length and values across different arrays correspond to each other for the same index.


Field Summary
 byte[] fieldIndices
          A bit-packed indices of all fields in which this word appears at least once.
 char[][] image
          The most frequently appearing variant of the word with respect to case.
 int[] stemIndex
          A pointer to the PreprocessingContext.AllStems arrays for this word.
 int[] tf
          Term Frequency of the word, aggregated across all variants with respect to case.
 int[][] tfByDocument
          Term Frequency of the word for each document.
 short[] type
          Token type of this word copied from PreprocessingContext.AllTokens.type.
 
Constructor Summary
PreprocessingContext.AllWords()
           
 
Method Summary
 String toString()
          For debugging purposes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

image

public char[][] image
The most frequently appearing variant of the word with respect to case. E.g. if a token MacOS appeared 12 times in the input and macos appeared 3 times, the image will be equal to MacOS.

This array is produced by CaseNormalizer.


type

public short[] type
Token type of this word copied from PreprocessingContext.AllTokens.type. Additional flags are set for each word by CaseNormalizer and LanguageModelStemmer.

This array is produced by CaseNormalizer. This array is modified by LanguageModelStemmer.

See Also:
ITokenizer

tf

public int[] tf
Term Frequency of the word, aggregated across all variants with respect to case. Frequencies for each variant separately are not available.

This array is produced by CaseNormalizer.


tfByDocument

public int[][] tfByDocument
Term Frequency of the word for each document. The length of this array is equal to the number of documents this word appeared in (Document Frequency) multiplied by 2. Elements at even indices contain document indices pointing to PreprocessingContext.documents, elements at odd indices contain the frequency of the word in the document. For example, an array with 4 values: [2, 15, 138, 7] means that the word appeared 15 times in document at index 2 and 7 times in document at index 138.

This array is produced by CaseNormalizer.


stemIndex

public int[] stemIndex
A pointer to the PreprocessingContext.AllStems arrays for this word.

This array is produced by LanguageModelStemmer.


fieldIndices

public byte[] fieldIndices
A bit-packed indices of all fields in which this word appears at least once. Indexes (positions) of selected bits are pointers to the PreprocessingContext.AllFields arrays. Fast conversion between the bit-packed representation and byte[] with index values is done by PreprocessingContext.toFieldIndexes(byte)

This array is produced by CaseNormalizer.

Constructor Detail

PreprocessingContext.AllWords

public PreprocessingContext.AllWords()
Method Detail

toString

public String toString()
For debugging purposes.

Overrides:
toString in class Object


Copyright (c) Dawid Weiss, Stanislaw Osinski