|
Carrot2 v3.4.0
API Documentation |
||||||||
| PREV NEXT | FRAMES NO FRAMES | ||||||||
See:
Description
| Carrot2 Core | |
|---|---|
| org.carrot2.core | Definitions of Carrot2 core interfaces and their implementations. |
| org.carrot2.core.attribute | Attribute annotations for Carrot2 core interfaces. |
| Carrot2 Data Sources | |
|---|---|
| org.carrot2.source | Base classes for implementing Carrot2 document sources. |
| org.carrot2.source.ambient | Serves documents from the Ambient test set. |
| org.carrot2.source.boss | Fetches documents from the Yahoo BOSS API. |
| org.carrot2.source.etools | Fetches documents from the eTools Metasearch Engine. |
| org.carrot2.source.google | Fetches documents from a local instance of Google Desktop. |
| org.carrot2.source.lucene | Fetches documents from a local Lucene index. |
| org.carrot2.source.microsoft | Fetches documents from the Bing search engine using its publicly available API. |
| org.carrot2.source.opensearch | Fetches documents from an OpenSearch-compliant search feed. |
| org.carrot2.source.pubmed | Fetches documents from the PubMed medical abstracts database. |
| org.carrot2.source.solr | Fetches documents from the Solr search engine. |
| org.carrot2.source.xml | Fetches documents from the Solr search engine. |
| org.carrot2.source.yahoo | Fetches documents from the Yahoo Search APIs. |
| Carrot2 Clustering Algorithms | |
|---|---|
| org.carrot2.clustering.lingo | Implementation of the Lingo clustering algorithm. |
| org.carrot2.clustering.stc | Implementation of the STC clustering algorithm. |
| org.carrot2.clustering.synthetic | Synthetic clustering algorithms. |
| Carrot2 Results post-processing | |
|---|---|
| org.carrot2.output.metrics | Cluster quality metrics calculation utilities. |
| Carrot2 Text preprocessing utilities | |
|---|---|
| org.carrot2.text.analysis | Lexical analysis utilities. |
| org.carrot2.text.clustering | Multilingual clustering utilities. |
| org.carrot2.text.linguistic | Shallow linguistic processing utilities. |
| org.carrot2.text.preprocessing | Contains the unified input preprocessing infrastructure (term indexing, stemming, label discovery). |
| org.carrot2.text.preprocessing.filter | Text feature filtering utilities. |
| org.carrot2.text.preprocessing.pipeline | Predefined preprocessing pipeline utilities. |
| org.carrot2.text.suffixtree | Implementation of the suffix tree data structure. |
| org.carrot2.text.util | Data structures for text preprocessing. |
| org.carrot2.text.vsm | Vector Space Model utilities. |
| Carrot2 Attribute Binding | |
|---|---|
| org.carrot2.util.attribute | A framework for managing Carrot2 component attributes. |
| org.carrot2.util.attribute.constraint | Constraints that can be imposed on attributes provided for Carrot2 components. |
| org.carrot2.util.attribute.metadata | Human-readable information about components and their attributes. |
| Carrot2 Matrix utilities | |
|---|---|
| org.carrot2.matrix | NNI-backed implementation of Colt matrices. |
| org.carrot2.matrix.factorization | Matrix factorization implementations. |
| org.carrot2.matrix.factorization.seeding | Matrix seeding strategies. |
| org.carrot2.matrix.nni | Native interfaces for matrix operations. |
| Carrot2 Utility classes | |
|---|---|
| org.carrot2.util | Common utility classes. |
| org.carrot2.util.httpclient | Apache Commons HTTP client utilities. |
| org.carrot2.util.pool | A very simple unbounded pool implementation. |
| org.carrot2.util.resource | Resource location abstraction layer. |
| org.carrot2.util.simplexml | Utilities for working with the Simple XML framework. |
| org.carrot2.util.xslt | XSLT handling utilities. |
| org.carrot2.util.xsltfilter | XSLT processor servlet filter. |
Carrot2 is an Open Source Search Results Clustering Engine, which can automatically organize small collections of documents, for example search results, into thematic categories, see below for more.
Java API JAR, JavaDocs and example code
Other Carrot2 applications
User and Developer Manual
Instructions for Maven2 users
Carrot2 project website
Carrot2 on-line demo
You can use Carrot2 Java API to fetch documents from various sources (public search engines, Lucene, Solr), perform clustering, serialize the results to JSON or XML and many more. Below is some example code for the most common use cases. Please see the examples/ directory in the Java API distribution archive for more examples.
The most common way to use Carrot2 Java API is to fetch a number of
documents from some IDocumentSource and cluster them
using some IClusteringAlgorithm. The general pattern
for this kind of invocation is to put all input data required for processing
(query and required number of results in this case) into a map and pass that
map to an Controller that will perform all the processing.
The code shown below retrieves 100 search results from BingDocumentSource
and clusters them using the LingoClusteringAlgorithm.
/* A controller to manage the processing pipeline. */
Controller controller = ControllerFactory.createSimple();
/* Input data for clustering, the query and number of results in this case. */
Map<String, Object> attributes = new HashMap<String, Object>();
attributes.put(AttributeNames.QUERY, "data mining");
attributes.put(AttributeNames.RESULTS, 100);
/* Perform processing */
ProcessingResult result = controller.process(attributes,
BingDocumentSource.class, LingoClusteringAlgorithm.class);
/* Documents fetched from the document source, clusters created by Carrot2. */
List<Document> documents = result.getDocuments();
List<Cluster> clusters = result.getClusters();
View full source code
You can also directly pass a list of Document instances for clustering:
/* A few example documents, normally you would need at least 20 for reasonable clusters. */
final String [][] data = new String [] []
{
{
"http://en.wikipedia.org/wiki/Data_mining",
"Data mining - Wikipedia, the free encyclopedia",
"Article about knowledge-discovery in databases (KDD), the practice of automatically searching large stores of data for patterns."
},
{
"http://www.ccsu.edu/datamining/resources.html",
"CCSU - Data Mining",
"A collection of Data Mining links edited by the Central Connecticut State University ... Graduate Certificate Program. Data Mining Resources. Resources. Groups ..."
},
{
"http://www.kdnuggets.com/",
"KDnuggets: Data Mining, Web Mining, and Knowledge Discovery",
"Newsletter on the data mining and knowledge industries, offering information on data mining, knowledge discovery, text mining, and web mining software, courses, jobs, publications, and meetings."
},
{
"http://en.wikipedia.org/wiki/Data-mining",
"Data mining - Wikipedia, the free encyclopedia",
"Data mining is considered a subfield within the Computer Science field of knowledge discovery. ... claim to perform \"data mining\" by automating the creation ..."
},
{
"http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm",
"Data Mining: What is Data Mining?",
"Outlines what knowledge discovery, the process of analyzing data from different perspectives and summarizing it into useful information, can do and how it works."
},
};
ArrayList<Document> documents = new ArrayList<Document>();
for (String [] row : data)
{
documents.add(new Document(row[1], row[2], row[0]));
}
/* A controller to manage the processing pipeline. */
SimpleController controller = new SimpleController();
/* Input data for clustering, list of Documents in this case. */
Map<String, Object> attributes = new HashMap<String, Object>();
attributes.put(AttributeNames.DOCUMENTS, documents);
/* Perform clustering */
ProcessingResult result = controller.process(attributes,
LingoClusteringAlgorithm.class);
/* Clusters created by Carrot2. */
List<Cluster> clusters = result.getClusters();
View full source code
The examples above used a simple controller to manage the clustering process. While the simple controller is enough for one-shot requests, for long-running applications, such as web applications, it's better to use a controller which supports pooling of processing component instances and caching of processing results.
/*
* Create the caching controller. You need only one caching controller instance
* per application life cycle. This controller instance will cache the results
* fetched from any document source and also clusters generated by the Lingo
* algorithm.
*/
Controller controller = ControllerFactory.createCachingPooling(
IDocumentSource.class, LingoClusteringAlgorithm.class);
/*
* Before using the caching controller, you must initialize it. On initialization,
* you can set default values for some attributes. In this example, we'll set
* the default results number to 50.
*/
Map<String, Object> globalAttributes = new HashMap<String, Object>();
globalAttributes.put(AttributeNames.RESULTS, 50);
controller.init(globalAttributes);
/*
* The controller is now ready to perform queries. To show that the documents from
* the document input are cached, we will perform the same query twice and measure
* the time for each query.
*/
Map<String, Object> attributes;
ProcessingResult result;
long start, duration;
start = System.currentTimeMillis();
attributes = new HashMap<String, Object>();
attributes.put(AttributeNames.QUERY, "data mining");
result = controller.process(attributes,
BingDocumentSource.class, LingoClusteringAlgorithm.class);
duration = System.currentTimeMillis() - start;
System.out.println(duration + " ms (empty cache)");
start = System.currentTimeMillis();
attributes = new HashMap<String, Object>();
attributes.put(AttributeNames.QUERY, "data mining");
result = controller.process(attributes,
BingDocumentSource.class, LingoClusteringAlgorithm.class);
duration = System.currentTimeMillis() - start;
System.out.println(duration + " ms (documents and clusters from cache)");
View full source code
This example shows how to cluster non-English content. By default Carrot2 assumes that the documents provided for clustering are written in English. When clustering content written in some different language, it is important to indicate the language to Carrot2, so that it can use the lexical resources (stop words, tokenizer, stemmer) appropriate for that language.
There are two ways to indicate the desired clustering language to Carrot2:
Document.LANGUAGE field. The language does not necessarily
have to be the same for all documents on the input, Carrot2 can handle multiple
languages in one document set as well. Please see the
MultilingualClustering.languageAggregationStrategy
attribute for more details.Document.LANGUAGE field, Carrot2 will assume the some fallback
language, which is English by default. You can change the fallback language by setting
the MultilingualClustering.defaultLanguage
attribute.
Document.LANGUAGE of documents they produce based on their
specific language-related attributes. Currently, three documents support this scenario:
BingDocumentSource through the
BingDocumentSource.market attributeBossDocumentSource through the
BossSearchService.languageAndRegion attributeEToolsDocumentSource through the
EToolsDocumentSource.language attribute
For the document sources that do not set the documents' language automatically, the
easiest way to set the clustering language is through the
MultilingualClustering.defaultLanguage attribute.
The following example demonstrates both approaches:
/*
* We use a CachingController to reuse instances of Carrot2 processing components.
*/
Controller controller = ControllerFactory.createCachingPooling(IDocumentSource.class);
/*
* No special initialization-time attributes in this example.
*/
final Map<String, Object> initAttributes = new HashMap<String, Object>();
controller.init(initAttributes);
/*
* In the first call, we'll cluster a document list, setting the language for each
* document separately.
*/
final List<Document> documents = Lists.newArrayList();
for (Document document : SampleDocumentData.DOCUMENTS_DATA_MINING)
{
documents.add(new Document(document.getTitle(), document.getSummary(),
document.getContentUrl(), LanguageCode.ENGLISH));
}
final Map<String, Object> attributes = new HashMap<String, Object>();
attributes.put(AttributeNames.DOCUMENTS, documents);
final ProcessingResult englishResult = controller.process(attributes,
LingoClusteringAlgorithm.class);
ConsoleFormatter.displayResults(englishResult);
/*
* In the second call, we will fetch results for a Chinese query from MSN Live,
* setting explicitly the MSN Live's specific language attribute. Based on that
* attribute, the document source will set the appropriate language for each
* document.
*/
attributes.clear();
attributes.put(AttributeNames.QUERY, "聚类"); // clustering?
attributes.put("BingDocumentSource.market", MarketOption.CHINESE_CHINA);
attributes.put(AttributeNames.RESULTS, 100);
final ProcessingResult chineseResult = controller.process(attributes,
BingDocumentSource.class, LingoClusteringAlgorithm.class);
ConsoleFormatter.displayResults(chineseResult);
/*
* In the third call, we will fetch results for the same Chinese query from
* Google. As Google document source does not have its specific attribute for
* setting the language, it will not set the documents' language for us. To make
* sure the right lexical resources are used, we will need to set the
* MultilingualClustering.defaultLanguage attribute to Chinese on our own.
*/
attributes.clear();
attributes.put(AttributeNames.QUERY, "聚类"); // clustering?
attributes.put("MultilingualClustering.defaultLanguage",
LanguageCode.CHINESE_SIMPLIFIED);
attributes.put(AttributeNames.RESULTS, 100);
final ProcessingResult chineseResult2 = controller.process(attributes,
GoogleDocumentSource.class, LingoClusteringAlgorithm.class);
ConsoleFormatter.displayResults(chineseResult2);
View full source code
|
Please refer to project documentation at
http://project.carrot2.org |
||||||||
| PREV NEXT | FRAMES NO FRAMES | ||||||||