org.carrot2.text.analysis
Class ExtendedWhitespaceTokenizer

java.lang.Object
  extended by org.carrot2.text.analysis.ExtendedWhitespaceTokenizer
All Implemented Interfaces:
ITokenizer

public final class ExtendedWhitespaceTokenizer
extends Object
implements ITokenizer

A tokenizer separating input characters on whitespace, but capable of extracting more complex tokens, such as URLs, e-mail addresses and sentence delimiters.


Field Summary
 
Fields inherited from interface org.carrot2.text.analysis.ITokenizer
TF_COMMON_WORD, TF_QUERY_WORD, TF_SEPARATOR_DOCUMENT, TF_SEPARATOR_FIELD, TF_SEPARATOR_SENTENCE, TF_TERMINATOR, TT_ACRONYM, TT_BARE_URL, TT_EMAIL, TT_EOF, TT_FILE, TT_FULL_URL, TT_HYPHTERM, TT_NUMERIC, TT_PUNCTUATION, TT_TERM, TYPE_MASK
 
Constructor Summary
ExtendedWhitespaceTokenizer()
           
 
Method Summary
 short nextToken()
          Returns the next token from the input stream.
 void reset(Reader input)
          Reset this tokenizer to start parsing another stream.
 void setTermBuffer(MutableCharArray array)
          Sets the current token image to the provided buffer.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ExtendedWhitespaceTokenizer

public ExtendedWhitespaceTokenizer()
Method Detail

reset

public void reset(Reader input)
Reset this tokenizer to start parsing another stream.

Specified by:
reset in interface ITokenizer
Parameters:
input - the input to tokenize. The reader will not be closed by the tokenizer when the end of stream is reached.

nextToken

public short nextToken()
                throws IOException
Description copied from interface: ITokenizer
Returns the next token from the input stream.

Specified by:
nextToken in interface ITokenizer
Returns:
the type of the token as defined by the ITokenizer.TT_TERM and other constants or ITokenizer.TT_EOF when the end of the data stream has been reached.
Throws:
IOException
See Also:
TokenTypeUtils

setTermBuffer

public void setTermBuffer(MutableCharArray array)
Description copied from interface: ITokenizer
Sets the current token image to the provided buffer.

Specified by:
setTermBuffer in interface ITokenizer
Parameters:
array - buffer in which the current token image should be stored


Copyright (c) Dawid Weiss, Stanislaw Osinski