org.carrot2.text.preprocessing
Class CaseNormalizer

java.lang.Object
  extended by org.carrot2.text.preprocessing.CaseNormalizer

public final class CaseNormalizer
extends Object

Performs case normalization and calculates a number of frequency statistics for words. The aim of case normalization is to find the most frequently appearing variants of words in terms of case. For example, if in the input documents MacOS appears 20 times, Macos 5 times and macos 2 times, case normalizer will select MacOS to represent all variants and assign the aggregated term frequency of 27 to it.

This class saves the following results to the PreprocessingContext:

This class requires that Tokenizer be invoked first.


Field Summary
 int dfThreshold
          Word Document Frequency threshold.
 
Constructor Summary
CaseNormalizer()
           
 
Method Summary
 void normalize(PreprocessingContext context)
          Performs normalization and saves the results to the context.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

dfThreshold

public int dfThreshold
Word Document Frequency threshold. Words appearing in fewer than dfThreshold documents will be ignored.

Attribute label:
Word Document Frequency threshold
Attribute level:
Advanced
Attribute group:
Preprocessing
Constructor Detail

CaseNormalizer

public CaseNormalizer()
Method Detail

normalize

public void normalize(PreprocessingContext context)
Performs normalization and saves the results to the context.



Copyright (c) Dawid Weiss, Stanislaw Osinski