Skip navigation links

Carrot2 v3.16.0-SNAPSHOT API Documentation

Carrot2 is an Open Source Search Results Clustering Engine, which can automatically organize small collections of documents, for example search results, into thematic categories, see below for more.

See: Description

Carrot2 Core 
Package Description
org.carrot2.core
Definitions of Carrot2 core interfaces and their implementations.
org.carrot2.core.attribute
Attribute annotations for Carrot2 core interfaces.
Carrot2 Data Sources 
Package Description
org.carrot2.source
Base classes for implementing Carrot2 document sources.
org.carrot2.source.ambient
Serves documents from the Ambient test set.
org.carrot2.source.etools
Fetches documents from the eTools Metasearch Engine.
org.carrot2.source.idol
Fetches documents from an Autonmomy IDOL Search engine with an OpenSearch-compliant feed.
org.carrot2.source.lucene
Fetches documents from a local Lucene index.
org.carrot2.source.microsoft.v5  
org.carrot2.source.opensearch
Fetches documents from an OpenSearch-compliant search feed.
org.carrot2.source.pubmed
Fetches documents from the PubMed medical abstracts database.
org.carrot2.source.solr
Fetches documents from the Solr search engine.
org.carrot2.source.xml
Fetches documents from XML streams.
Carrot2 Clustering Algorithms 
Package Description
org.carrot2.clustering.kmeans
Implementation of the bisecting k-means clustering algorithm.
org.carrot2.clustering.lingo
Implementation of the Lingo clustering algorithm.
org.carrot2.clustering.stc
Implementation of the STC clustering algorithm.
org.carrot2.clustering.synthetic
Synthetic clustering algorithms.
Carrot2 Results post-processing 
Package Description
org.carrot2.output.metrics
Cluster quality metrics calculation utilities.
Carrot2 Text preprocessing utilities 
Package Description
org.carrot2.text.analysis
Lexical analysis utilities.
org.carrot2.text.clustering
Multilingual clustering utilities.
org.carrot2.text.linguistic
Shallow linguistic processing utilities.
org.carrot2.text.linguistic.lucene
Shallow linguistic processing utilities dependent on Lucene stemmers and analyzers.
org.carrot2.text.linguistic.morfologik
Shallow linguistic processing utilities dependent on the Morfologik stemming library.
org.carrot2.text.linguistic.snowball  
org.carrot2.text.linguistic.snowball.stemmers  
org.carrot2.text.preprocessing
Contains the unified input preprocessing infrastructure (term indexing, stemming, label discovery).
org.carrot2.text.preprocessing.filter
Text feature filtering utilities.
org.carrot2.text.preprocessing.pipeline
Predefined preprocessing pipeline utilities.
org.carrot2.text.suffixtree
Implementation of the suffix tree data structure.
org.carrot2.text.util
Data structures for text preprocessing.
org.carrot2.text.vsm
Vector Space Model utilities.
Carrot2 Matrix utilities 
Package Description
org.carrot2.matrix  
org.carrot2.matrix.factorization  
org.carrot2.matrix.factorization.seeding
Matrix seeding strategies.
Carrot2 Utility classes 
Package Description
org.carrot2.util
Common utility classes.
org.carrot2.util.annotations
Marker annotations.
org.carrot2.util.attribute
Attribute handling utilities.
org.carrot2.util.factory
A simple object factory.
org.carrot2.util.httpclient
Apache Commons HTTP client utilities.
org.carrot2.util.pool
A very simple unbounded pool implementation.
org.carrot2.util.resource
Resource location abstraction layer.
org.carrot2.util.simplexml
Utilities for working with the Simple XML framework.
org.carrot2.util.tests
Unit test utilities and annotations.
org.carrot2.util.xslt
XSLT handling utilities.
org.carrot2.util.xsltfilter
XSLT processor servlet filter.
Other Packages 
Package Description
org.carrot2.log4j
Log4J utilities.
org.carrot2.mahout.collections  
org.carrot2.mahout.common  
org.carrot2.mahout.math  
org.carrot2.mahout.math.buffer  
org.carrot2.mahout.math.function  
org.carrot2.mahout.math.list  
org.carrot2.mahout.math.map  
org.carrot2.mahout.math.matrix  
org.carrot2.mahout.math.matrix.impl  
org.carrot2.mahout.math.matrix.linalg  
org.carrot2.mahout.math.set  

Carrot2 is an Open Source Search Results Clustering Engine, which can automatically organize small collections of documents, for example search results, into thematic categories, see below for more.

Downloads & more information

Java API JAR, JavaDocs and example code
Other Carrot2 applications
User and Developer Manual
Instructions for Maven2 users
Carrot2 project website
Carrot2 on-line demo

Java API usage examples

You can use Carrot2 Java API to fetch documents from various sources (public search engines, Lucene, Solr), perform clustering, serialize the results to JSON or XML and many more. Below is some example code for the most common use cases. Please see the examples/ directory in the Java API distribution archive for more examples.

Clustering text documents

The easiest way to get started with Carrot2 is to cluster a collection of Documents. Each document can consist of:

  • document content: a query-in-context snippet, document abstract or full text,
  • document title: optional, some clustering algorithms give more weight to document titles,
  • document URL: optional, used by the ByUrlClusteringAlgorithm, ignored by other algorithms.

To make the example short, the code shown below clusters only 5 documents. Use at least 20 to get reasonable clusters. If you have access to the query that generated the documents being clustered, you should also provide it to Carrot2 to get better clusters.

            /* A few example documents, normally you would need at least 20 for reasonable clusters. */
            final String [][] data = new String [] []
            {
                {
                    "http://en.wikipedia.org/wiki/Data_mining",
                    "Data mining - Wikipedia, the free encyclopedia",
                    "Article about knowledge-discovery in databases (KDD), the practice of automatically searching large stores of data for patterns."
                },

                {
                    "http://www.ccsu.edu/datamining/resources.html",
                    "CCSU - Data Mining",
                    "A collection of Data Mining links edited by the Central Connecticut State University ... Graduate Certificate Program. Data Mining Resources. Resources. Groups ..."
                },

                {
                    "http://www.kdnuggets.com/",
                    "KDnuggets: Data Mining, Web Mining, and Knowledge Discovery",
                    "Newsletter on the data mining and knowledge industries, offering information on data mining, knowledge discovery, text mining, and web mining software, courses, jobs, publications, and meetings."
                },

                {
                    "http://en.wikipedia.org/wiki/Data-mining",
                    "Data mining - Wikipedia, the free encyclopedia",
                    "Data mining is considered a subfield within the Computer Science field of knowledge discovery. ... claim to perform \"data mining\" by automating the creation ..."
                },

                {
                    "http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm",
                    "Data Mining: What is Data Mining?",
                    "Outlines what knowledge discovery, the process of analyzing data from different perspectives and summarizing it into useful information, can do and how it works."
                },
            };

            /* Prepare Carrot2 documents */
            final ArrayList<Document> documents = new ArrayList<Document>();
            for (String [] row : data)
            {
                documents.add(new Document(row[1], row[2], row[0]));
            }

            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();

            /*
             * Perform clustering by topic using the Lingo algorithm. Lingo can 
             * take advantage of the original query, so we provide it along with the documents.
             */
            final ProcessingResult byTopicClusters = controller.process(documents, "data mining",
                LingoClusteringAlgorithm.class);
            final List<Cluster> clustersByTopic = byTopicClusters.getClusters();
            
            /* Perform clustering by domain. In this case query is not useful, hence it is null. */
            final ProcessingResult byDomainClusters = controller.process(documents, null,
                ByUrlClusteringAlgorithm.class);
            final List<Cluster> clustersByDomain = byDomainClusters.getClusters();
Full source code: ClusteringDocumentList.java

Clustering documents from document sources

With default settings

One common way to use Carrot2 Java API is to fetch a number of documents from some IDocumentSource and cluster them using some IClusteringAlgorithm. The simplest yet least flexible way to do it is to use the Controller.process(String, Integer, Class...) method from the Controller. The code shown below retrieves 100 search results for query data mining from EToolsDocumentSource and clusters them using the LingoClusteringAlgorithm.
            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();

            /* Perform processing */
            final ProcessingResult result = controller.process("data mining", 100,
                EToolsDocumentSource.class, LingoClusteringAlgorithm.class);
    
            /* Documents fetched from the document source, clusters created by Carrot2. */
            final List<Document> documents = result.getDocuments();
            final List<Cluster> clusters = result.getClusters();
Full source code: ClusteringDataFromDocumentSources.java

With custom settings

If your production code needs to fetch documents from popular search engines, it is very important that you generate and use your own API key rather than Carrot2's default one. You can pass the API key along with the query and the requested number of results in an attribute map. Carrot2 manual lists all supported attributes along with their keys, types and allowed values. The code shown below, fetches and clusters 50 results from Bing5DocumentSource.

            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();
    
            /* Prepare attributes */
            final Map<String, Object> attributes = new HashMap<String, Object>();
            
            /* Put your own API key here! */
            Bing5DocumentSourceDescriptor.attributeBuilder(attributes)
                .apiKey(BingKeyAccess.getKey());

            /* Query an the required number of results */
            attributes.put(CommonAttributesDescriptor.Keys.QUERY, "clustering");
            attributes.put(CommonAttributesDescriptor.Keys.RESULTS, 50);
    
            /* Perform processing */
            final ProcessingResult result = controller.process(attributes, 
                Bing5DocumentSource.class, STCClusteringAlgorithm.class);

            /* Documents fetched from the document source, clusters created by Carrot2. */
            final List<Document> documents = result.getDocuments();
            final List<Cluster> clusters = result.getClusters();
Full source code: ClusteringDataFromDocumentSources.java

Setting attributes of clustering algorithms and document sources

By attribute keys

You can change the default behaviour of clustering algorithms and document sources by changing their attributes. For a complete list of available attributes, their identifiers, types and allowed values, please see Carrot2 manual.

To pass attributes to Carrot2, put them into a Map, along with query or documents being clustered. The code shown below searches the web using Bing5DocumentSource and clusters the results using LingoClusteringAlgorithm customized to create fewer clusters than by default.

            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();
            
            /* Prepare attribute map */
            final Map<String, Object> attributes = new HashMap<String, Object>();

            /* Put attribute values using direct keys. */
            attributes.put(CommonAttributesDescriptor.Keys.QUERY, "data mining");
            attributes.put(CommonAttributesDescriptor.Keys.RESULTS, 100);
            attributes.put("LingoClusteringAlgorithm.desiredClusterCountBase", 15);

            /* Put your own API key here! */
            attributes.put(Bing5DocumentSourceDescriptor.Keys.API_KEY, BingKeyAccess.getKey()); 
            
            /* Perform processing */
            final ProcessingResult result = controller.process(attributes,
                Bing5DocumentSource.class, LingoClusteringAlgorithm.class);
    
            /* Documents fetched from the document source, clusters created by Carrot2. */
            final List<Document> documents = result.getDocuments();
            final List<Cluster> clusters = result.getClusters();
Full source code: UsingAttributes.java

Using attribute builders

As an alternative to the raw attribute map used in the previous example, you can use attribute map builders. Attribute map builders have a number of advantages:

  • Type-safety: the correct type of the value will be enforced at compile time
  • Error prevention: unexpected results caused by typos in attribute name strings are avoided
  • Early error detection: in case an attribute's key changes, your compiler will detect that
  • IDE support: your IDE will suggest the right method names and parameters

A possible disadvantage of attribute builders is that one algorithm's attributes can be divided into a number of builders and hence not readily available in your IDE's auto complete window. Please consult attribute documentation in Carrot2 manual for pointers to the appropriate builder classes and methods.

The code shown below fetches 100 results for query data mining from Bing5DocumentSource and clusters them using the LingoClusteringAlgorithm tuned to create slightly fewer clusters than by default. Please note how the API key is passed and use your own key in production deployments.

            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();
            
            /* Prepare attribute map */
            final Map<String, Object> attributes = new HashMap<String, Object>();

            /* Put values using attribute builders */
            CommonAttributesDescriptor
                .attributeBuilder(attributes)
                    .query("data mining")
                    .results(100);
            LingoClusteringAlgorithmDescriptor
                .attributeBuilder(attributes)
                    .desiredClusterCountBase(15)
                    .matrixReducer()
                        .factorizationQuality(FactorizationQuality.HIGH);
                        
            Bing5DocumentSourceDescriptor
                .attributeBuilder(attributes)
                    .apiKey(BingKeyAccess.getKey()); // use your own key here
            
            /* Perform processing */
            final ProcessingResult result = controller.process(attributes,
                Bing5DocumentSource.class, LingoClusteringAlgorithm.class);
    
            /* Documents fetched from the document source, clusters created by Carrot2. */
            final List<Document> documents = result.getDocuments();
            final List<Cluster> clusters = result.getClusters();
Full source code: UsingAttributes.java

Collecting output attributes

Some algorithms apart from clusters can produce additional, usually diagnostic, output. The output is present in the attributes map contained in the ProcessingResult. You can read the contents of that map directly or through the attribute map builders. Carrot2 manual lists and describes in detail the output attributes of each component.

The code shown below clusters clusters an example collection of Documents using the Lingo algorithm. Lingo can optionally use native platform-specific matrix computation libraries. The example code reads an attribute to find out whether such libraries were successfully loaded and used.

            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();
            
            /* Prepare attribute map */
            final Map<String, Object> attributes = new HashMap<String, Object>();
            CommonAttributesDescriptor
                .attributeBuilder(attributes)
                    .documents(SampleDocumentData.DOCUMENTS_DATA_MINING);
            LingoClusteringAlgorithmDescriptor
                .attributeBuilder(attributes)
                    .desiredClusterCountBase(15)
                    .matrixReducer()
                        .factorizationQuality(FactorizationQuality.HIGH);

            /* Perform processing */
            final ProcessingResult result = controller.process(attributes,
                LingoClusteringAlgorithm.class);
            
            /* Clusters created by Carrot2, read processing time */
            final List<Cluster> clusters = result.getClusters();
            final Long clusteringTime = CommonAttributesDescriptor.attributeBuilder(
                result.getAttributes()).processingTimeAlgorithm();
Full source code: UsingAttributes.java

Pooling of processing component instances, caching of processing results

The examples shown above used a simple controller to manage the clustering process. While the simple controller is enough for one-shot requests, for long-running applications, such as web applications, it's better to use a controller which supports pooling of processing component instances and caching of processing results.

        /*
         * Create the caching controller. You need only one caching controller instance
         * per application life cycle. This controller instance will cache the results
         * fetched from any document source and also clusters generated by the Lingo
         * algorithm.
         */
        final Controller controller = ControllerFactory.createCachingPooling(
            IDocumentSource.class, LingoClusteringAlgorithm.class);

        /*
         * Before using the caching controller, you must initialize it. On initialization,
         * you can set default values for some attributes. In this example, we'll set the
         * default results number to 50 and the API key.
         */
        final Map<String, Object> globalAttributes = new HashMap<String, Object>();
        CommonAttributesDescriptor
            .attributeBuilder(globalAttributes)
                .results(50);
        Bing5DocumentSourceDescriptor
            .attributeBuilder(globalAttributes)
                .apiKey(BingKeyAccess.getKey()); // use your own ID here
        controller.init(globalAttributes);

        /*
         * The controller is now ready to perform queries. To show that the documents from
         * the document input are cached, we will perform the same query twice and measure
         * the time for each query.
         */
        ProcessingResult result;
        long start, duration;

        final Map<String, Object> attributes;
        attributes = new HashMap<String, Object>();
        CommonAttributesDescriptor.attributeBuilder(attributes).query("data mining");

        start = System.currentTimeMillis();
        result = controller.process(attributes, Bing5DocumentSource.class,
            LingoClusteringAlgorithm.class);
        duration = System.currentTimeMillis() - start;
        System.out.println(duration + " ms (empty cache)");

        start = System.currentTimeMillis();
        result = controller.process(attributes, Bing5DocumentSource.class,
            LingoClusteringAlgorithm.class);
        duration = System.currentTimeMillis() - start;
        System.out.println(duration + " ms (documents and clusters from cache)");
Full source code: UsingCachingController.java

Clustering non-English content

This example shows how to cluster non-English content. By default Carrot2 assumes that the documents provided for clustering are written in English. When clustering content written in some different language, it is important to indicate the language to Carrot2, so that it can use the lexical resources (stop words, tokenizer, stemmer) appropriate for that language.

There are two ways to indicate the desired clustering language to Carrot2:

  1. By setting the language of each document in their Document.LANGUAGE field. The language does not necessarily have to be the same for all documents on the input, Carrot2 can handle multiple languages in one document set as well. Please see the MultilingualClustering.languageAggregationStrategy attribute for more details.
  2. By setting the fallback language. For documents with undefined Document.LANGUAGE field, Carrot2 will assume the some fallback language, which is English by default. You can change the fallback language by setting the MultilingualClustering.defaultLanguage attribute.
Additionally, a number of document sources automatically set the Document.LANGUAGE of documents they produce based on their specific language-related attributes. Currently, three documents support this scenario:
  1. Bing5DocumentSource through the Bing5DocumentSource.market attribute,
  2. EToolsDocumentSource through the EToolsDocumentSource.language attribute.
For the document sources that do not set the documents' language automatically, the easiest way to set the clustering language is through the MultilingualClustering.defaultLanguage attribute.
        /*
         * We use a Controller that reuse instances of Carrot2 processing components 
         * and caches results produced by document sources.
         */
        final Controller controller = ControllerFactory.createCachingPooling(IDocumentSource.class);

        /*
         * In the first call, we'll cluster a document list, setting the language for each
         * document separately.
         */
        final List<Document> documents = Lists.newArrayList();
        for (Document document : SampleDocumentData.DOCUMENTS_DATA_MINING)
        {
            documents.add(new Document(document.getTitle(), document.getSummary(),
                document.getContentUrl(), LanguageCode.ENGLISH));
        }

        final Map<String, Object> attributes = Maps.newHashMap();
        CommonAttributesDescriptor.attributeBuilder(attributes)
            .documents(documents);
        final ProcessingResult englishResult = controller.process(
            attributes, LingoClusteringAlgorithm.class);
        ConsoleFormatter.displayResults(englishResult);

        /*
         * In the second call, we will fetch results for a Chinese query from Bing,
         * setting explicitly the Bing's specific language attribute. Based on that
         * attribute, the document source will set the appropriate language for each
         * document.
         */
        attributes.clear();
        
        CommonAttributesDescriptor.attributeBuilder(attributes)
            .query("聚类" /* clustering? */)
            .results(100);

        Bing5DocumentSourceDescriptor.attributeBuilder(attributes)
            .market(MarketOption.CHINESE_CHINA)
            .apiKey(BingKeyAccess.getKey()); // use your own ID here!

        final ProcessingResult chineseResult = controller.process(attributes,
            Bing5DocumentSource.class, LingoClusteringAlgorithm.class);
        ConsoleFormatter.displayResults(chineseResult);

Full source code: ClusteringNonEnglishContent.java
Skip navigation links

Copyright (c) Dawid Weiss, Stanislaw Osinski