org.carrot2.text.preprocessing
Class PreprocessingContext.AllTokens

java.lang.Object
  extended by org.carrot2.text.preprocessing.PreprocessingContext.AllTokens
Enclosing class:
PreprocessingContext

public class PreprocessingContext.AllTokens
extends Object

Information about all tokens of the input PreprocessingContext.documents. Each element of each of the arrays corresponds to one individual token from the input or a synthetic separator inserted between documents, fields and sentences. Last element of this array is a special terminator entry.

All arrays in this class have the same length and values across different arrays correspond to each other for the same index.


Field Summary
 int[] documentIndex
          Index of the document this token came from, points to elements of PreprocessingContext.documents.
 byte[] fieldIndex
          Document field the token came from.
 char[][] image
          Token image as it appears in the input.
 int[] lcp
          The Longest Common Prefix for the adjacent suffix-sorted token sequences.
 int[] suffixOrder
          The suffix order of tokens.
 short[] type
          Token's ITokenizer bit flags.
 int[] wordIndex
          A pointer to PreprocessingContext.AllWords arrays for this token.
 
Constructor Summary
PreprocessingContext.AllTokens()
           
 
Method Summary
 String toString()
          For debugging purposes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

image

public char[][] image
Token image as it appears in the input. On positions where type is equal to one of ITokenizer.TF_TERMINATOR, ITokenizer.TF_SEPARATOR_DOCUMENT or ITokenizer.TF_SEPARATOR_FIELD , image is null.

This array is produced by Tokenizer.


type

public short[] type
Token's ITokenizer bit flags.

This array is produced by Tokenizer.


fieldIndex

public byte[] fieldIndex
Document field the token came from. The index points to arrays in PreprocessingContext.AllFields, equal to -1 for document and field separators.

This array is produced by Tokenizer.


documentIndex

public int[] documentIndex
Index of the document this token came from, points to elements of PreprocessingContext.documents. Equal to -1 for document separators.

This array is produced by Tokenizer.

This array is accessed in in CaseNormalizer and PhraseExtractor to compute by-document statistics, e.g. tf-by document, which are then needed to build a VSM or assign documents to labels. An alternative to this representation would be creating an AllDocuments holder and keep there an array of start token indexes for each document and then refactor the model building code to do a binary search to determine the document index given token index. This is likely to be a significant performance hit because model building code accesses the documentIndex array pretty much randomly (in the suffix order), so we'd be doing twice-the-number-of-tokens binary searches. Unless there's some other data structure that can help us here.


wordIndex

public int[] wordIndex
A pointer to PreprocessingContext.AllWords arrays for this token. Equal to -1 for document, field and ITokenizer.TT_PUNCTUATION tokens (including sentence separators).

This array is produced by CaseNormalizer.


suffixOrder

public int[] suffixOrder
The suffix order of tokens. Suffixes starting with a separator come at the end of the array.

This array is produced by PhraseExtractor.


lcp

public int[] lcp
The Longest Common Prefix for the adjacent suffix-sorted token sequences.

This array is produced by PhraseExtractor.

Constructor Detail

PreprocessingContext.AllTokens

public PreprocessingContext.AllTokens()
Method Detail

toString

public String toString()
For debugging purposes.

Overrides:
toString in class Object


Copyright (c) Dawid Weiss, Stanislaw Osinski