Carrot2 v3.6.0-SNAPSHOT API Documentation

Carrot2 is an Open Source Search Results Clustering Engine, which can automatically organize small collections of documents, for example search results, into thematic categories, see below for more.

See:
          Description

Carrot2 Core
org.carrot2.core Definitions of Carrot2 core interfaces and their implementations.
org.carrot2.core.attribute Attribute annotations for Carrot2 core interfaces.

 

Carrot2 Data Sources
org.carrot2.source Base classes for implementing Carrot2 document sources.
org.carrot2.source.ambient Serves documents from the Ambient test set.
org.carrot2.source.etools Fetches documents from the eTools Metasearch Engine.
org.carrot2.source.google Fetches documents from a local instance of Google Desktop.
org.carrot2.source.idol Fetches documents from an Autonmomy IDOL Search engine with an OpenSearch-compliant feed.
org.carrot2.source.lucene Fetches documents from a local Lucene index.
org.carrot2.source.microsoft Fetches documents from the Bing search engine using its publicly available API.
org.carrot2.source.opensearch Fetches documents from an OpenSearch-compliant search feed.
org.carrot2.source.pubmed Fetches documents from the PubMed medical abstracts database.
org.carrot2.source.solr Fetches documents from the Solr search engine.
org.carrot2.source.xml Fetches documents from the Solr search engine.

 

Carrot2 Clustering Algorithms
org.carrot2.clustering.kmeans Implementation of the bisecting k-means clustering algorithm.
org.carrot2.clustering.lingo Implementation of the Lingo clustering algorithm.
org.carrot2.clustering.stc Implementation of the STC clustering algorithm.
org.carrot2.clustering.synthetic Synthetic clustering algorithms.

 

Carrot2 Results post-processing
org.carrot2.output.metrics Cluster quality metrics calculation utilities.

 

Carrot2 Text preprocessing utilities
org.carrot2.text.analysis Lexical analysis utilities.
org.carrot2.text.clustering Multilingual clustering utilities.
org.carrot2.text.linguistic Shallow linguistic processing utilities.
org.carrot2.text.linguistic.lucene Shallow linguistic processing utilities dependent on Lucene stemmers and analyzers.
org.carrot2.text.linguistic.morfologik Shallow linguistic processing utilities dependent on the Morfologik stemming library.
org.carrot2.text.preprocessing Contains the unified input preprocessing infrastructure (term indexing, stemming, label discovery).
org.carrot2.text.preprocessing.filter Text feature filtering utilities.
org.carrot2.text.preprocessing.pipeline Predefined preprocessing pipeline utilities.
org.carrot2.text.suffixtree Implementation of the suffix tree data structure.
org.carrot2.text.util Data structures for text preprocessing.
org.carrot2.text.vsm Vector Space Model utilities.

 

Carrot2 Attribute Binding
org.carrot2.util.attribute A framework for managing Carrot2 component attributes.
org.carrot2.util.attribute.constraint Constraints that can be imposed on attributes provided for Carrot2 components.
org.carrot2.util.attribute.metadata Human-readable information about components and their attributes.

 

Carrot2 Matrix utilities
org.carrot2.matrix Matrix factorization routines.
org.carrot2.matrix.factorization Matrix factorization implementations.
org.carrot2.matrix.factorization.seeding Matrix seeding strategies.

 

Carrot2 Utility classes
org.carrot2.util Common utility classes.
org.carrot2.util.annotations Marker annotations.
org.carrot2.util.factory A simple object factory.
org.carrot2.util.httpclient Apache Commons HTTP client utilities.
org.carrot2.util.pool A very simple unbounded pool implementation.
org.carrot2.util.resource Resource location abstraction layer.
org.carrot2.util.simplexml Utilities for working with the Simple XML framework.
org.carrot2.util.tests  
org.carrot2.util.xslt XSLT handling utilities.
org.carrot2.util.xsltfilter XSLT processor servlet filter.

 

Other Packages
org.carrot2.log4j Log4J utilities.

 

Carrot2 is an Open Source Search Results Clustering Engine, which can automatically organize small collections of documents, for example search results, into thematic categories, see below for more.

Downloads & more information

Java API JAR, JavaDocs and example code
Other Carrot2 applications
User and Developer Manual
Instructions for Maven2 users
Carrot2 project website
Carrot2 on-line demo

Java API usage examples

You can use Carrot2 Java API to fetch documents from various sources (public search engines, Lucene, Solr), perform clustering, serialize the results to JSON or XML and many more. Below is some example code for the most common use cases. Please see the examples/ directory in the Java API distribution archive for more examples.

Clustering text documents

The easiest way to get started with Carrot2 is to cluster a collection of Documents. Each document can consist of:

To make the example short, the code shown below clusters only 5 documents. Use at least 20 to get reasonable clusters. If you have access to the query that generated the documents being clustered, you should also provide it to Carrot2 to get better clusters.

            /* A few example documents, normally you would need at least 20 for reasonable clusters. */
            final String [][] data = new String [] []
            {
                {
                    "http://en.wikipedia.org/wiki/Data_mining",
                    "Data mining - Wikipedia, the free encyclopedia",
                    "Article about knowledge-discovery in databases (KDD), the practice of automatically searching large stores of data for patterns."
                },

                {
                    "http://www.ccsu.edu/datamining/resources.html",
                    "CCSU - Data Mining",
                    "A collection of Data Mining links edited by the Central Connecticut State University ... Graduate Certificate Program. Data Mining Resources. Resources. Groups ..."
                },

                {
                    "http://www.kdnuggets.com/",
                    "KDnuggets: Data Mining, Web Mining, and Knowledge Discovery",
                    "Newsletter on the data mining and knowledge industries, offering information on data mining, knowledge discovery, text mining, and web mining software, courses, jobs, publications, and meetings."
                },

                {
                    "http://en.wikipedia.org/wiki/Data-mining",
                    "Data mining - Wikipedia, the free encyclopedia",
                    "Data mining is considered a subfield within the Computer Science field of knowledge discovery. ... claim to perform \"data mining\" by automating the creation ..."
                },

                {
                    "http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm",
                    "Data Mining: What is Data Mining?",
                    "Outlines what knowledge discovery, the process of analyzing data from different perspectives and summarizing it into useful information, can do and how it works."
                },
            };

            /* Prepare Carrot2 documents */
            final ArrayList<Document> documents = new ArrayList<Document>();
            for (String [] row : data)
            {
                documents.add(new Document(row[1], row[2], row[0]));
            }

            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();

            /*
             * Perform clustering by topic using the Lingo algorithm. Lingo can 
             * take advantage of the original query, so we provide it along with the documents.
             */
            final ProcessingResult byTopicClusters = controller.process(documents, "data mining",
                LingoClusteringAlgorithm.class);
            final List<Cluster> clustersByTopic = byTopicClusters.getClusters();
            
            /* Perform clustering by domain. In this case query is not useful, hence it is null. */
            final ProcessingResult byDomainClusters = controller.process(documents, null,
                ByUrlClusteringAlgorithm.class);
            final List<Cluster> clustersByDomain = byDomainClusters.getClusters();
Full source code: ClusteringDocumentList.java

Clustering documents from document sources

With default settings

One common way to use Carrot2 Java API is to fetch a number of documents from some IDocumentSource and cluster them using some IClusteringAlgorithm. The simplest yet least flexible way to do it is to use the Controller.process(String, Integer, Class...) method from the Controller. The code shown below retrieves 100 search results for query data mining from Bing2WebDocumentSource and clusters them using the LingoClusteringAlgorithm.
            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();
            
            /* Perform processing */
            final ProcessingResult result = controller.process("data mining", 100,
                Bing2WebDocumentSource.class, LingoClusteringAlgorithm.class);
    
            /* Documents fetched from the document source, clusters created by Carrot2. */
            final List<Document> documents = result.getDocuments();
            final List<Cluster> clusters = result.getClusters();
Full source code: ClusteringDataFromDocumentSources.java

With custom settings

If your production code needs to fetch documents from popular search engines, it is very important that you generate and use your own API key rather than Carrot2's default one. You can pass the API key along with the query and the requested number of results in an attribute map. Carrot2 manual lists all supported attributes along with their keys, types and allowed values. The code shown below, fetches and clusters 50 results from Bing2WebDocumentSource.

            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();
    
            /* Prepare attributes */
            final Map<String, Object> attributes = new HashMap<String, Object>();
            
            /* Put your own API key here */
            attributes.put("Bing2WebDocumentSource.appid", Bing2WebDocumentSource.CARROTSEARCH_APPID);
    
            /* Query an the required number of results */
            attributes.put(CommonAttributesDescriptor.Keys.QUERY, "clustering");
            attributes.put(CommonAttributesDescriptor.Keys.RESULTS, 50);
    
            /* Perform processing */
            final ProcessingResult result = controller.process(attributes, 
                Bing2WebDocumentSource.class, STCClusteringAlgorithm.class);

            /* Documents fetched from the document source, clusters created by Carrot2. */
            final List<Document> documents = result.getDocuments();
            final List<Cluster> clusters = result.getClusters();
Full source code: ClusteringDataFromDocumentSources.java

Setting attributes of clustering algorithms and document sources

By attribute keys

You can change the default behaviour of clustering algorithms and document sources by changing their attributes. For a complete list of available attributes, their identifiers, types and allowed values, please see Carrot2 manual.

To pass attributes to Carrot2, put them into a Map, along with query or documents being clustered. The code shown below searches the web using Bing2WebDocumentSource and clusters the results using LingoClusteringAlgorithm customized to create fewer clusters than by default.

            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();
            
            /* Prepare attribute map */
            final Map<String, Object> attributes = new HashMap<String, Object>();

            /* Put values using attribute builders */
            attributes.put(CommonAttributesDescriptor.Keys.QUERY, "data mining");
            attributes.put(CommonAttributesDescriptor.Keys.RESULTS, 100);
            attributes.put("Bing2WebDocumentSource.appid", 
                Bing2WebDocumentSource.CARROTSEARCH_APPID); // user your own ID here
            attributes.put("LingoClusteringAlgorithm.desiredClusterCountBase", 15);
            
            /* Perform processing */
            final ProcessingResult result = controller.process(attributes,
                Bing2WebDocumentSource.class, LingoClusteringAlgorithm.class);
    
            /* Documents fetched from the document source, clusters created by Carrot2. */
            final List<Document> documents = result.getDocuments();
            final List<Cluster> clusters = result.getClusters();
Full source code: UsingAttributes.java

Using attribute builders

As an alternative to the raw attribute map used in the previous example, you can use attribute map builders. Attribute map builders have a number of advantages:

A possible disadvantage of attribute builders is that one algorithm's attributes can be divided into a number of builders and hence not readily available in your IDE's auto complete window. Please consult attribute documentation in Carrot2 manual for pointers to the appropriate builder classes and methods.

The code shown below fetches 100 results for query data mining from Bing2WebDocumentSource and clusters them using the LingoClusteringAlgorithm tuned to create slightly fewer clusters than by default. Please note how the API key is passed and use your own key in production deployments.

            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();
            
            /* Prepare attribute map */
            final Map<String, Object> attributes = new HashMap<String, Object>();

            /* Put values using attribute builders */
            CommonAttributesDescriptor
                .attributeBuilder(attributes)
                    .query("data mining")
                    .results(100);
            LingoClusteringAlgorithmDescriptor
                .attributeBuilder(attributes)
                    .desiredClusterCountBase(15)
                    .matrixReducer()
                        .factorizationQuality(FactorizationQuality.HIGH);
                        
            Bing2WebDocumentSourceDescriptor
                .attributeBuilder(attributes)
                    .appid(Bing2WebDocumentSource.CARROTSEARCH_APPID); // use your own key here
            
            /* Perform processing */
            final ProcessingResult result = controller.process(attributes,
                Bing2WebDocumentSource.class, LingoClusteringAlgorithm.class);
    
            /* Documents fetched from the document source, clusters created by Carrot2. */
            final List<Document> documents = result.getDocuments();
            final List<Cluster> clusters = result.getClusters();
Full source code: UsingAttributes.java

Collecting output attributes

Some algorithms apart from clusters can produce additional, usually diagnostic, output. The output is present in the attributes map contained in the ProcessingResult. You can read the contents of that map directly or through the attribute map builders. Carrot2 manual lists and describes in detail the output attributes of each component.

The code shown below clusters clusters an example collection of Documents using the Lingo algorithm. Lingo can optionally use native platform-specific matrix computation libraries. The example code reads an attribute to find out whether such libraries were successfully loaded and used.

            /* A controller to manage the processing pipeline. */
            final Controller controller = ControllerFactory.createSimple();
            
            /* Prepare attribute map */
            final Map<String, Object> attributes = new HashMap<String, Object>();
            CommonAttributesDescriptor
                .attributeBuilder(attributes)
                    .documents(SampleDocumentData.DOCUMENTS_DATA_MINING);
            LingoClusteringAlgorithmDescriptor
                .attributeBuilder(attributes)
                    .desiredClusterCountBase(15)
                    .matrixReducer()
                        .factorizationQuality(FactorizationQuality.HIGH);

            /* Perform processing */
            final ProcessingResult result = controller.process(attributes,
                LingoClusteringAlgorithm.class);
            
            /* Clusters created by Carrot2, read processing time */
            final List<Cluster> clusters = result.getClusters();
            final Long clusteringTime = CommonAttributesDescriptor.attributeBuilder(
                result.getAttributes()).processingTimeAlgorithm();
Full source code: UsingAttributes.java

Pooling of processing component instances, caching of processing results

The examples shown above used a simple controller to manage the clustering process. While the simple controller is enough for one-shot requests, for long-running applications, such as web applications, it's better to use a controller which supports pooling of processing component instances and caching of processing results.

        /*
         * Create the caching controller. You need only one caching controller instance
         * per application life cycle. This controller instance will cache the results
         * fetched from any document source and also clusters generated by the Lingo
         * algorithm.
         */
        final Controller controller = ControllerFactory.createCachingPooling(
            IDocumentSource.class, LingoClusteringAlgorithm.class);

        /*
         * Before using the caching controller, you must initialize it. On initialization,
         * you can set default values for some attributes. In this example, we'll set the
         * default results number to 50 and the API key.
         */
        final Map<String, Object> globalAttributes = new HashMap<String, Object>();
        CommonAttributesDescriptor
            .attributeBuilder(globalAttributes)
                .results(50);
        Bing2WebDocumentSourceDescriptor
            .attributeBuilder(globalAttributes)
                .appid(Bing2WebDocumentSource.CARROTSEARCH_APPID); // use your own ID here
        controller.init(globalAttributes);

        /*
         * The controller is now ready to perform queries. To show that the documents from
         * the document input are cached, we will perform the same query twice and measure
         * the time for each query.
         */
        ProcessingResult result;
        long start, duration;

        final Map<String, Object> attributes;
        attributes = new HashMap<String, Object>();
        CommonAttributesDescriptor.attributeBuilder(attributes).query("data mining");

        start = System.currentTimeMillis();
        result = controller.process(attributes, Bing2WebDocumentSource.class,
            LingoClusteringAlgorithm.class);
        duration = System.currentTimeMillis() - start;
        System.out.println(duration + " ms (empty cache)");

        start = System.currentTimeMillis();
        result = controller.process(attributes, Bing2WebDocumentSource.class,
            LingoClusteringAlgorithm.class);
        duration = System.currentTimeMillis() - start;
        System.out.println(duration + " ms (documents and clusters from cache)");
Full source code: UsingCachingController.java

Clustering non-English content

This example shows how to cluster non-English content. By default Carrot2 assumes that the documents provided for clustering are written in English. When clustering content written in some different language, it is important to indicate the language to Carrot2, so that it can use the lexical resources (stop words, tokenizer, stemmer) appropriate for that language.

There are two ways to indicate the desired clustering language to Carrot2:

  1. By setting the language of each document in their Document.LANGUAGE field. The language does not necessarily have to be the same for all documents on the input, Carrot2 can handle multiple languages in one document set as well. Please see the MultilingualClustering.languageAggregationStrategy attribute for more details.
  2. By setting the fallback language. For documents with undefined Document.LANGUAGE field, Carrot2 will assume the some fallback language, which is English by default. You can change the fallback language by setting the MultilingualClustering.defaultLanguage attribute.
Additionally, a number of document sources automatically set the Document.LANGUAGE of documents they produce based on their specific language-related attributes. Currently, three documents support this scenario:
  1. Bing2WebDocumentSource through the Bing2DocumentSource.market attribute,
  2. EToolsDocumentSource through the EToolsDocumentSource.language attribute.
For the document sources that do not set the documents' language automatically, the easiest way to set the clustering language is through the MultilingualClustering.defaultLanguage attribute.
        /*
         * We use a Controller that reuse instances of Carrot2 processing components 
         * and caches results produced by document sources.
         */
        final Controller controller = ControllerFactory.createCachingPooling(IDocumentSource.class);

        /*
         * In the first call, we'll cluster a document list, setting the language for each
         * document separately.
         */
        final List<Document> documents = Lists.newArrayList();
        for (Document document : SampleDocumentData.DOCUMENTS_DATA_MINING)
        {
            documents.add(new Document(document.getTitle(), document.getSummary(),
                document.getContentUrl(), LanguageCode.ENGLISH));
        }

        final Map<String, Object> attributes = Maps.newHashMap();
        CommonAttributesDescriptor.attributeBuilder(attributes)
            .documents(documents);
        final ProcessingResult englishResult = controller.process(
            attributes, LingoClusteringAlgorithm.class);
        ConsoleFormatter.displayResults(englishResult);

        /*
         * In the second call, we will fetch results for a Chinese query from Bing,
         * setting explicitly the Bing's specific language attribute. Based on that
         * attribute, the document source will set the appropriate language for each
         * document.
         */
        attributes.clear();
        
        CommonAttributesDescriptor.attributeBuilder(attributes)
            .query("聚类" /* clustering? */)
            .results(100);

        Bing2WebDocumentSourceDescriptor.attributeBuilder(attributes)
            .market(MarketOption.CHINESE_CHINA);
        Bing2WebDocumentSourceDescriptor
            .attributeBuilder(attributes)
                .appid(Bing2WebDocumentSource.CARROTSEARCH_APPID); // use your own ID here

        final ProcessingResult chineseResult = controller.process(attributes,
            Bing2WebDocumentSource.class, LingoClusteringAlgorithm.class);
        ConsoleFormatter.displayResults(chineseResult);

        /*
         * In the third call, we will fetch results for the same Chinese query from
         * Google. As Google document source does not have its specific attribute for
         * setting the language, it will not set the documents' language for us. To make
         * sure the right lexical resources are used, we will need to set the
         * MultilingualClustering.defaultLanguage attribute to Chinese on our own.
         */
        attributes.clear();
        
        CommonAttributesDescriptor.attributeBuilder(attributes)
            .query("聚类" /* clustering? */)
            .results(100);

        MultilingualClusteringDescriptor.attributeBuilder(attributes)
            .defaultLanguage(LanguageCode.CHINESE_SIMPLIFIED);
        GoogleDocumentSourceDescriptor
            .attributeBuilder(attributes)
                .apiKey(GoogleDocumentSource.CARROTSEARCH_API_KEY); // use your own key here

        final ProcessingResult chineseResult2 = controller.process(attributes,
            GoogleDocumentSource.class, LingoClusteringAlgorithm.class);
        ConsoleFormatter.displayResults(chineseResult2);
Full source code: ClusteringNonEnglishContent.java



Copyright (c) Dawid Weiss, Stanislaw Osinski