|
Carrot2 v3.5.2
API Documentation |
||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
public interface ITokenizer
Splits input characters into tokens representing e.g. words, digits, acronyms, punctuation. For each token, the following information is available:
TT_ prefix, e.g. TT_TERM.TF_SEPARATOR_SENTENCE).
TokenTypeUtils| Field Summary | |
|---|---|
static short |
TF_COMMON_WORD
The current token is a common word. |
static short |
TF_QUERY_WORD
The current token is part of the query. |
static short |
TF_SEPARATOR_DOCUMENT
Current token is a document separator (never returned from parsing). |
static short |
TF_SEPARATOR_FIELD
Current token separates document's logical fields. |
static short |
TF_SEPARATOR_SENTENCE
Current token is a sentence separator. |
static short |
TF_TERMINATOR
Current token terminates the input (never returned from parsing). |
static int |
TT_ACRONYM
|
static int |
TT_BARE_URL
|
static int |
TT_EMAIL
|
static int |
TT_EOF
Indicates the end of the token stream. |
static int |
TT_FILE
|
static int |
TT_FULL_URL
|
static int |
TT_HYPHTERM
|
static int |
TT_NUMERIC
|
static int |
TT_PUNCTUATION
|
static int |
TT_TERM
|
static int |
TYPE_MASK
|
| Method Summary | |
|---|---|
short |
nextToken()
Returns the next token from the input stream. |
void |
reset(Reader reader)
Resets the tokenizer to process new data |
void |
setTermBuffer(MutableCharArray array)
Sets the current token image to the provided buffer. |
| Field Detail |
|---|
static final int TYPE_MASK
static final int TT_TERM
static final int TT_NUMERIC
static final int TT_PUNCTUATION
static final int TT_EMAIL
static final int TT_ACRONYM
static final int TT_FULL_URL
static final int TT_BARE_URL
static final int TT_FILE
static final int TT_HYPHTERM
static final int TT_EOF
static final short TF_SEPARATOR_SENTENCE
static final short TF_SEPARATOR_DOCUMENT
static final short TF_SEPARATOR_FIELD
static final short TF_TERMINATOR
static final short TF_COMMON_WORD
PreprocessingContext.AllWords.type,
StopListMarker,
Constant Field Valuesstatic final short TF_QUERY_WORD
PreprocessingContext.AllWords.type,
LanguageModelStemmer,
Constant Field Values| Method Detail |
|---|
void reset(Reader reader)
throws IOException
reader - the input to tokenize. The reader will not be closed
by the tokenizer when the end of stream is reached.
IOException
short nextToken()
throws IOException
TT_TERM and other
constants or TT_EOF when the end of the data stream has been
reached.
IOExceptionTokenTypeUtilsvoid setTermBuffer(MutableCharArray array)
array - buffer in which the current token image should be
stored
|
Please refer to project documentation at
http://project.carrot2.org |
||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||