org.carrot2.text.analysis
Interface ITokenizer

All Known Implementing Classes:
ChineseTokenizerAdapter, ExtendedWhitespaceTokenizer, ThaiTokenizerAdapter

public interface ITokenizer

Splits input characters into tokens representing e.g. words, digits, acronyms, punctuation. For each token, the following information is available:

token type
Types of tokens: numbers, URIs, punctuation, acronyms and others. See all constants in this class declared with TT_ prefix, e.g. TT_TERM.
token flags
Additional token flags such as an indication whether a punctuation token is a sentence delimiter (TF_SEPARATOR_SENTENCE).

See Also:
TokenTypeUtils

Field Summary
static short TF_COMMON_WORD
          The current token is a common word.
static short TF_QUERY_WORD
          The current token is part of the query.
static short TF_SEPARATOR_DOCUMENT
          Current token is a document separator (never returned from parsing).
static short TF_SEPARATOR_FIELD
          Current token separates document's logical fields.
static short TF_SEPARATOR_SENTENCE
          Current token is a sentence separator.
static short TF_TERMINATOR
          Current token terminates the input (never returned from parsing).
static int TT_ACRONYM
           
static int TT_BARE_URL
           
static int TT_EMAIL
           
static int TT_EOF
          Indicates the end of the token stream.
static int TT_FILE
           
static int TT_FULL_URL
           
static int TT_HYPHTERM
           
static int TT_NUMERIC
           
static int TT_PUNCTUATION
           
static int TT_TERM
           
static int TYPE_MASK
           
 
Method Summary
 short nextToken()
          Returns the next token from the input stream.
 void reset(Reader reader)
          Resets the tokenizer to process new data
 void setTermBuffer(MutableCharArray array)
          Sets the current token image to the provided buffer.
 

Field Detail

TYPE_MASK

static final int TYPE_MASK
See Also:
Constant Field Values

TT_TERM

static final int TT_TERM
See Also:
Constant Field Values

TT_NUMERIC

static final int TT_NUMERIC
See Also:
Constant Field Values

TT_PUNCTUATION

static final int TT_PUNCTUATION
See Also:
Constant Field Values

TT_EMAIL

static final int TT_EMAIL
See Also:
Constant Field Values

TT_ACRONYM

static final int TT_ACRONYM
See Also:
Constant Field Values

TT_FULL_URL

static final int TT_FULL_URL
See Also:
Constant Field Values

TT_BARE_URL

static final int TT_BARE_URL
See Also:
Constant Field Values

TT_FILE

static final int TT_FILE
See Also:
Constant Field Values

TT_HYPHTERM

static final int TT_HYPHTERM
See Also:
Constant Field Values

TT_EOF

static final int TT_EOF
Indicates the end of the token stream.

See Also:
Constant Field Values

TF_SEPARATOR_SENTENCE

static final short TF_SEPARATOR_SENTENCE
Current token is a sentence separator.

See Also:
Constant Field Values

TF_SEPARATOR_DOCUMENT

static final short TF_SEPARATOR_DOCUMENT
Current token is a document separator (never returned from parsing).

See Also:
Constant Field Values

TF_SEPARATOR_FIELD

static final short TF_SEPARATOR_FIELD
Current token separates document's logical fields.

See Also:
Constant Field Values

TF_TERMINATOR

static final short TF_TERMINATOR
Current token terminates the input (never returned from parsing).

See Also:
Constant Field Values

TF_COMMON_WORD

static final short TF_COMMON_WORD
The current token is a common word. This flag is not directly available from the tokenizer.

See Also:
PreprocessingContext.AllWords.type, StopListMarker, Constant Field Values

TF_QUERY_WORD

static final short TF_QUERY_WORD
The current token is part of the query. This flag is not directly available from the tokenizer.

See Also:
PreprocessingContext.AllWords.type, LanguageModelStemmer, Constant Field Values
Method Detail

reset

void reset(Reader reader)
           throws IOException
Resets the tokenizer to process new data

Parameters:
reader - the input to tokenize. The reader will not be closed by the tokenizer when the end of stream is reached.
Throws:
IOException

nextToken

short nextToken()
                throws IOException
Returns the next token from the input stream.

Returns:
the type of the token as defined by the TT_TERM and other constants or TT_EOF when the end of the data stream has been reached.
Throws:
IOException
See Also:
TokenTypeUtils

setTermBuffer

void setTermBuffer(MutableCharArray array)
Sets the current token image to the provided buffer.

Parameters:
array - buffer in which the current token image should be stored


Copyright (c) Dawid Weiss, Stanislaw Osinski