Carrot2 v3.4.0 API Documentation

Carrot2 is an Open Source Search Results Clustering Engine, which can automatically organize small collections of documents, for example search results, into thematic categories, see below for more.

See:
          Description

Carrot2 Core
org.carrot2.core Definitions of Carrot2 core interfaces and their implementations.
org.carrot2.core.attribute Attribute annotations for Carrot2 core interfaces.

 

Carrot2 Data Sources
org.carrot2.source Base classes for implementing Carrot2 document sources.
org.carrot2.source.ambient Serves documents from the Ambient test set.
org.carrot2.source.boss Fetches documents from the Yahoo BOSS API.
org.carrot2.source.etools Fetches documents from the eTools Metasearch Engine.
org.carrot2.source.google Fetches documents from a local instance of Google Desktop.
org.carrot2.source.lucene Fetches documents from a local Lucene index.
org.carrot2.source.microsoft Fetches documents from the Bing search engine using its publicly available API.
org.carrot2.source.opensearch Fetches documents from an OpenSearch-compliant search feed.
org.carrot2.source.pubmed Fetches documents from the PubMed medical abstracts database.
org.carrot2.source.solr Fetches documents from the Solr search engine.
org.carrot2.source.xml Fetches documents from the Solr search engine.
org.carrot2.source.yahoo Fetches documents from the Yahoo Search APIs.

 

Carrot2 Clustering Algorithms
org.carrot2.clustering.lingo Implementation of the Lingo clustering algorithm.
org.carrot2.clustering.stc Implementation of the STC clustering algorithm.
org.carrot2.clustering.synthetic Synthetic clustering algorithms.

 

Carrot2 Results post-processing
org.carrot2.output.metrics Cluster quality metrics calculation utilities.

 

Carrot2 Text preprocessing utilities
org.carrot2.text.analysis Lexical analysis utilities.
org.carrot2.text.clustering Multilingual clustering utilities.
org.carrot2.text.linguistic Shallow linguistic processing utilities.
org.carrot2.text.preprocessing Contains the unified input preprocessing infrastructure (term indexing, stemming, label discovery).
org.carrot2.text.preprocessing.filter Text feature filtering utilities.
org.carrot2.text.preprocessing.pipeline Predefined preprocessing pipeline utilities.
org.carrot2.text.suffixtree Implementation of the suffix tree data structure.
org.carrot2.text.util Data structures for text preprocessing.
org.carrot2.text.vsm Vector Space Model utilities.

 

Carrot2 Attribute Binding
org.carrot2.util.attribute A framework for managing Carrot2 component attributes.
org.carrot2.util.attribute.constraint Constraints that can be imposed on attributes provided for Carrot2 components.
org.carrot2.util.attribute.metadata Human-readable information about components and their attributes.

 

Carrot2 Matrix utilities
org.carrot2.matrix NNI-backed implementation of Colt matrices.
org.carrot2.matrix.factorization Matrix factorization implementations.
org.carrot2.matrix.factorization.seeding Matrix seeding strategies.
org.carrot2.matrix.nni Native interfaces for matrix operations.

 

Carrot2 Utility classes
org.carrot2.util Common utility classes.
org.carrot2.util.httpclient Apache Commons HTTP client utilities.
org.carrot2.util.pool A very simple unbounded pool implementation.
org.carrot2.util.resource Resource location abstraction layer.
org.carrot2.util.simplexml Utilities for working with the Simple XML framework.
org.carrot2.util.xslt XSLT handling utilities.
org.carrot2.util.xsltfilter XSLT processor servlet filter.

 

Carrot2 is an Open Source Search Results Clustering Engine, which can automatically organize small collections of documents, for example search results, into thematic categories, see below for more.

Downloads & more information

Java API JAR, JavaDocs and example code
Other Carrot2 applications
User and Developer Manual
Instructions for Maven2 users
Carrot2 project website
Carrot2 on-line demo

Java API usage examples

You can use Carrot2 Java API to fetch documents from various sources (public search engines, Lucene, Solr), perform clustering, serialize the results to JSON or XML and many more. Below is some example code for the most common use cases. Please see the examples/ directory in the Java API distribution archive for more examples.

Clustering documents from document sources

The most common way to use Carrot2 Java API is to fetch a number of documents from some IDocumentSource and cluster them using some IClusteringAlgorithm. The general pattern for this kind of invocation is to put all input data required for processing (query and required number of results in this case) into a map and pass that map to an Controller that will perform all the processing. The code shown below retrieves 100 search results from BingDocumentSource and clusters them using the LingoClusteringAlgorithm.

        /* A controller to manage the processing pipeline. */
        Controller controller = ControllerFactory.createSimple();

        /* Input data for clustering, the query and number of results in this case. */
        Map<String, Object> attributes = new HashMap<String, Object>();
        attributes.put(AttributeNames.QUERY, "data mining");
        attributes.put(AttributeNames.RESULTS, 100);

        /* Perform processing */
        ProcessingResult result = controller.process(attributes,
            BingDocumentSource.class, LingoClusteringAlgorithm.class);
        
        /* Documents fetched from the document source, clusters created by Carrot2. */
        List<Document> documents = result.getDocuments();
        List<Cluster> clusters = result.getClusters();
View full source code

Clustering arbitrary documents

You can also directly pass a list of Document instances for clustering:

        /* A few example documents, normally you would need at least 20 for reasonable clusters. */
        final String [][] data = new String [] []
        {
            {
                "http://en.wikipedia.org/wiki/Data_mining",
                "Data mining - Wikipedia, the free encyclopedia",
                "Article about knowledge-discovery in databases (KDD), the practice of automatically searching large stores of data for patterns."
            },

            {
                "http://www.ccsu.edu/datamining/resources.html",
                "CCSU - Data Mining",
                "A collection of Data Mining links edited by the Central Connecticut State University ... Graduate Certificate Program. Data Mining Resources. Resources. Groups ..."
            },

            {
                "http://www.kdnuggets.com/",
                "KDnuggets: Data Mining, Web Mining, and Knowledge Discovery",
                "Newsletter on the data mining and knowledge industries, offering information on data mining, knowledge discovery, text mining, and web mining software, courses, jobs, publications, and meetings."
            },

            {
                "http://en.wikipedia.org/wiki/Data-mining",
                "Data mining - Wikipedia, the free encyclopedia",
                "Data mining is considered a subfield within the Computer Science field of knowledge discovery. ... claim to perform \"data mining\" by automating the creation ..."
            },

            {
                "http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm",
                "Data Mining: What is Data Mining?",
                "Outlines what knowledge discovery, the process of analyzing data from different perspectives and summarizing it into useful information, can do and how it works."
            },
        };
        ArrayList<Document> documents = new ArrayList<Document>();
        for (String [] row : data)
        {
            documents.add(new Document(row[1], row[2], row[0]));
        }

        /* A controller to manage the processing pipeline. */
        SimpleController controller = new SimpleController();

        /* Input data for clustering, list of Documents in this case. */
        Map<String, Object> attributes = new HashMap<String, Object>();
        attributes.put(AttributeNames.DOCUMENTS, documents);

        /* Perform clustering */
        ProcessingResult result = controller.process(attributes,
            LingoClusteringAlgorithm.class);
  
        /* Clusters created by Carrot2. */
        List<Cluster> clusters = result.getClusters();
View full source code

Pooling of processing component instances, caching of processing results

The examples above used a simple controller to manage the clustering process. While the simple controller is enough for one-shot requests, for long-running applications, such as web applications, it's better to use a controller which supports pooling of processing component instances and caching of processing results.

        /*
         * Create the caching controller. You need only one caching controller instance
         * per application life cycle. This controller instance will cache the results
         * fetched from any document source and also clusters generated by the Lingo
         * algorithm.
         */
        Controller controller = ControllerFactory.createCachingPooling(
            IDocumentSource.class, LingoClusteringAlgorithm.class);

        /*
         * Before using the caching controller, you must initialize it. On initialization,
         * you can set default values for some attributes. In this example, we'll set 
         * the default results number to 50.
         */
        Map<String, Object> globalAttributes = new HashMap<String, Object>();
        globalAttributes.put(AttributeNames.RESULTS, 50);
        controller.init(globalAttributes);

        /*
         * The controller is now ready to perform queries. To show that the documents from
         * the document input are cached, we will perform the same query twice and measure
         * the time for each query.
         */
        Map<String, Object> attributes;
        ProcessingResult result;
        long start, duration;
        
        start = System.currentTimeMillis();
        attributes = new HashMap<String, Object>();
        attributes.put(AttributeNames.QUERY, "data mining");
        result = controller.process(attributes,
            BingDocumentSource.class, LingoClusteringAlgorithm.class);
        duration = System.currentTimeMillis() - start;
        System.out.println(duration + " ms (empty cache)");
        
        start = System.currentTimeMillis();
        attributes = new HashMap<String, Object>();
        attributes.put(AttributeNames.QUERY, "data mining");
        result = controller.process(attributes,
            BingDocumentSource.class, LingoClusteringAlgorithm.class);
        duration = System.currentTimeMillis() - start;
        System.out.println(duration + " ms (documents and clusters from cache)");
View full source code

Clustering non-English content

This example shows how to cluster non-English content. By default Carrot2 assumes that the documents provided for clustering are written in English. When clustering content written in some different language, it is important to indicate the language to Carrot2, so that it can use the lexical resources (stop words, tokenizer, stemmer) appropriate for that language.

There are two ways to indicate the desired clustering language to Carrot2:

  1. By setting the language of each document in their Document.LANGUAGE field. The language does not necessarily have to be the same for all documents on the input, Carrot2 can handle multiple languages in one document set as well. Please see the MultilingualClustering.languageAggregationStrategy attribute for more details.
  2. By setting the fallback language. For documents with undefined Document.LANGUAGE field, Carrot2 will assume the some fallback language, which is English by default. You can change the fallback language by setting the MultilingualClustering.defaultLanguage attribute.
Additionally, a number of document sources automatically set the Document.LANGUAGE of documents they produce based on their specific language-related attributes. Currently, three documents support this scenario:
  1. BingDocumentSource through the BingDocumentSource.market attribute
  2. BossDocumentSource through the BossSearchService.languageAndRegion attribute
  3. EToolsDocumentSource through the EToolsDocumentSource.language attribute

For the document sources that do not set the documents' language automatically, the easiest way to set the clustering language is through the MultilingualClustering.defaultLanguage attribute.

The following example demonstrates both approaches:

/*
 * We use a CachingController to reuse instances of Carrot2 processing components.
 */
Controller controller = ControllerFactory.createCachingPooling(IDocumentSource.class);

/*
 * No special initialization-time attributes in this example.
 */
final Map<String, Object> initAttributes = new HashMap<String, Object>();
controller.init(initAttributes);

/*
 * In the first call, we'll cluster a document list, setting the language for each
 * document separately.
 */
final List<Document> documents = Lists.newArrayList();
for (Document document : SampleDocumentData.DOCUMENTS_DATA_MINING)
{
  documents.add(new Document(document.getTitle(), document.getSummary(),
    document.getContentUrl(), LanguageCode.ENGLISH));
}

final Map<String, Object> attributes = new HashMap<String, Object>();
attributes.put(AttributeNames.DOCUMENTS, documents);
final ProcessingResult englishResult = controller.process(attributes,
  LingoClusteringAlgorithm.class);
ConsoleFormatter.displayResults(englishResult);

/*
 * In the second call, we will fetch results for a Chinese query from MSN Live,
 * setting explicitly the MSN Live's specific language attribute. Based on that
 * attribute, the document source will set the appropriate language for each
 * document.
 */
attributes.clear();
attributes.put(AttributeNames.QUERY, "聚类"); // clustering?
attributes.put("BingDocumentSource.market", MarketOption.CHINESE_CHINA);
attributes.put(AttributeNames.RESULTS, 100);
final ProcessingResult chineseResult = controller.process(attributes,
  BingDocumentSource.class, LingoClusteringAlgorithm.class);
ConsoleFormatter.displayResults(chineseResult);

/*
 * In the third call, we will fetch results for the same Chinese query from
 * Google. As Google document source does not have its specific attribute for
 * setting the language, it will not set the documents' language for us. To make
 * sure the right lexical resources are used, we will need to set the
 * MultilingualClustering.defaultLanguage attribute to Chinese on our own.
 */
attributes.clear();
attributes.put(AttributeNames.QUERY, "聚类"); // clustering?
attributes.put("MultilingualClustering.defaultLanguage",
  LanguageCode.CHINESE_SIMPLIFIED);
attributes.put(AttributeNames.RESULTS, 100);
final ProcessingResult chineseResult2 = controller.process(attributes,
  GoogleDocumentSource.class, LingoClusteringAlgorithm.class);
ConsoleFormatter.displayResults(chineseResult2);
View full source code

 



Copyright (c) Dawid Weiss, Stanislaw Osinski