Copyright © 2002-2010 Stanisław Osiński, Dawid Weiss
Abstract
This document serves as documentation for the Carrot2 framework. It describes Carrot2 application suite and the API developers can use to integrate Carrot2 clustering algorithms into their code. It also provides a reference of all Carrot2 components and their attributes.
Carrot2 Online Demo: http://search.carrot2.org
Carrot2 website: http://project.carrot2.org
Table of Contents
List of Figures
List of Tables
Carrot2 is a library and a set of supporting applications you can use to build a search results clustering engine. Such an engine will organize your search results into topics, fully automatically and without external kowledge such as taxonomies or preclassified content.
Carrot2 contains two document clustering algorighms designed specifically for search results clustering: Suffix Tree Clustering and Lingo. Carrot2 also contains components for fetching search results from several search engines, such as Yahoo!, MSN Live, Google, but it also supports other sources of documents like Lucene, Solr or Google Desktop index.
Carrot2 is not a search engine itself, it does not have a crawler and indexer. There is a number of Open Source projects you can use to crawl (Nutch), index and search (Lucene, Solr) your content, which can then be queried and clustered by Carrot2
In most cases your workflow with Carrot2 applications would be the following:
Use Carrot2 Document Clustering Workbench and possibly other applications from Carrot2 application suite to see what the clustering results are like for your content. If the results are promising, you can use the Carrot2 Document Clustering Workbench to further tune the clustering algorithm's settings.
If you are developing Java software, use Carrot2 API and JAR to integrate clustering into your code. For non-Java environments, set-up the Carrot2 Document Clustering Server and call Carrot2 clustering using the REST protocol.
Chapter 2 answers the questions most frequently asked on Carrot2 mailing lists, it can also serve as a question-based index to the rest of this manual. Chapter 3 introduces applications available in Carrot2 distribution and Chapter 4 shows how to quickly set up Carrot2 to cluster your own data. Chapter 5 discusses topics related to tuning Carrot2 clustering, while Chapter 6 shows how to customize Carrot2 applications. Chapter 7 covers some more advanced use cases of Carrot2 and Chapter 8 provides solutions to common problems. Finally, Chapter 9 discusses Carrot2 architecture and internals, while Chapter 11 is an in-depth reference of Carrot2 components.
This chapter answers the questions most frequently asked on Carrot2 mailing lists. As it extensively links to further sections of the manual, it can also be treated as some sort question-based index for this manual.
|
|
|
Can I use Carrot2 in a commercial project? |
|
|
Yes. The only requirement is that you properly acknowledge the use of Carrot2 (on your project's website and documentation) and let us know about your project. Please also remember to read the license. |
|
|
How can I acknowledge the use of Carrot2 on my site? |
|
|
Please put a statement equivalent to “This product includes software developed by the Carrot2 Project” on your site and link it to Carrot2's website (http://www.carrot2.org). Additionally, you can use some of our powered-by logos if you like. |
|
|
Can Carrot2 crawl my website? |
|
|
No. Carrot2 can add clustering of search results to an existing search engine. You can use an Open Source project called Nutch to crawl your website. Nutch has a Carrot2-based search clustering plugin, so you'll get all crawling, searching and clustering in one piece. |
|
|
Can I use Carrot2 to cluster something else than search results? |
|
|
Absolutely. Carrot2 came about as a framework for building search results clustering engines but its algorithms should successfully cluster up to about a thousand text documents, a few paragraphs each. |
|
|
How does Carrot2 clustering scale with respect to the number and length of documents? |
|
|
The most important characteristic of Carrot2 algorithms to keep in mind is that they perform in-memory clustering. For this reason, as a rule of thumb, depending on the algorithm, Carrot2 should successfully deal with up to a few thousands of documents, a few paragraphs each. For algorithms designed to process millions of documents, you may want to check out the Mahout project. |
|
|
Can I force Carrot2 to cluster my documents to some predefined clusters / labels? |
|
|
No. Assigning documents to a set of predefined categories is a problem called text classification / categorization and Carrot2 was not designed to solve it. For text classification components you may want to see the LingPipe project. |
|
|
Can Carrot2 cluster content in other languages than English? |
|
|
Yes. Currently, Carrot2 can cluster content in 17 languages:
Please note, however, that for some of the languages you may need to tune the stop words to achieve best results. |
|
|
What is the query syntax in Carrot2? |
|
|
As Carrot2 is not a search engine on its own, there is no common query syntax in Carrot2. The syntax depends on the underlying search engine you set Carrot2 to use, e.g. Yahoo!, Solr, Lucene or any other. Carrot2 passes your query without any modifications to the search engine and clusters the results it returns. For this reason, any syntax supported by the search engine is automatically supported in Carrot2. |
|
|
Which Carrot2 clustering algorithm is the best? |
|
|
There is no one clear answer to this question. The choice of the algorithm depends on the input data and the desired characteristics of clusters. Please see Section 5.2 for some guidelines. |
|
|
Does Carrot2 support boolean querying? |
|
|
If the underlying search engine support boolean queries, so will Carrot2. Please see this question for more details. |
|
|
What is the most suitable content for clustering in Carrot2? |
|
|
Please see Section 5.1 for the answer. |
|
|
How can I remove meaningless cluster labels? |
|
|
Occasionally, Carrot2 may create meaningless cluster labels like read or site. Please see Section 5.5 for information on how to remove them. |
|
|
How can I improve the performance of Carrot2? |
|
|
Please see Section 5.7 for some clustering performance tips. |
|
Carrot2 comes with a number of supporting applications that you can use to quickly set up clustering on your own data, further tune clustering results and expose Carrot2 clustering as a remote service.
Carrot2 application suite contains:
Carrot2 Document Clustering Workbench which is a standalone GUI application you can use to experiment with Carrot2 clustering on data from common search engines or your own data,
Carrot2 Document Clustering Server which exposes Carrot2 clustering as a REST service,
Carrot2 Command Line Interface applications which allow invoking Carrot2 clustering from command line,
Carrot2 Web Application which exposes Carrot2 clustering as a web application for end users.
Carrot2 Document Clustering Workbench is a standalone GUI application you can use to experiment with Carrot2 clustering on data from common search engines or your own data.
You can use Carrot2 Document Clustering Workbench to:
Quickly test Carrot2 clustering with your own data. Please see Chapter 4 for instructions for the most common scenarios.
Fine tune Carrot2 clustering algorithms' settings to work best with your specific data. Please see Chapter 5 for more details.
Run simple performance benchmarks using different settings to predict maximum clustering throughput on a single machine. Please see Section 5.8 for details.
Carrot2 Document Clustering Workbench features include:
Various document sources included. Carrot2 Document Clustering Workbench can fetch and cluster documents from a number of sources, including major search engines, indexing engines (Lucene, Solr, Google Desktop) as well as generic XML feeds and files.
Live tuning of clustering algorithm attributes. Carrot2 Document Clustering Workbench enables modifying clustering algorithm's attributes and observing the results in real time.
Performance benchmarking. Carrot2 Document Clustering Workbench can run simple performance benchmarks of Carrot2 clustering algorithms.
Attractive visualizations. Carrot2 Document Clustering Workbench comes with two visualizations of the cluster structure, one developed within the Carrot2 project and another one from Aduna Software.
Modular architecture and extendability. Carrot2 Document Clustering Workbench is based on Eclipse Rich Client Platform, which makes it easily extendable.
To run Carrot2 Document Clustering Workbench:
Download and install Java Runtime Environment (version 1.5.0 or newer) if you have not done so.
Download Carrot2 Document Clustering Workbench Windows binaries or Linux binaries and extract the archive to some local disk location.
Run carrot2-workbench.exe (Windows) or carrot2-workbench (Linux).
Carrot2 Document Clustering Server (DCS) exposes Carrot2 clustering as a REST service. It can cluster documents from an external source (e.g. a search engine) or documents provided directly as an XML stream and returns results in XML or JSON formats.
You can use Carrot2 Document Clustering Server to:
Integrate Carrot2 with your non-Java software.
Build a high-throughput document clustering system by setting up a number of load-balanced instances of the DCS.
Carrot2 Document Clustering Server features include:
XML and JSON response formats. Carrot2 Document Clustering Server can return results both in XML and JSON formats. JSON-P (with callback) is also supported.
Various document sources included. Carrot2 Document Clustering Server can fetch and cluster documents from a large number of sources, including major search engines and indexing engines (Lucene, Solr).
Direct XML feed. Carrot2 Document Clustering Server can cluster documents fed directly in a simple XML format.
PHP and C# examples included. Carrot2 Document Clustering Server ships with ready-to-use examples of calling Carrot2 DCS services from PHP (version 5), C#, Ruby, Java and curl.
Quick start screen. A simple quick start screen will let you make your first DCS request straight from your browser.
To run Carrot2 Document Clustering Server:
Download and install Java Runtime Environment (version 1.5.0 or newer) if you have not done so.
Download Carrot2 Document Clustering Server binaries and extract the archive to some local disk location.
Run dcs.cmd (Windows) or dcs.sh (Linux).
Point your browser to http://localhost:8080 for further instructions.
See the examples/ directory in the distribution archive
for PHP, C#, Ruby and Java code examples. You can also
invoke DCS clustering
using the curl command.
If you need to start the DCS at a port different than 8080, you can use the
-port option:
dcs -port 9090
To deploy the DCS in an external servlet container, such as Apache Tomcat, use
the carrot2-dcs.war file from the war/
folder of the DCS distribution.
Carrot2 Web Application exposes Carrot2 clustering as a web application for end users. It allows users to browse clusters using a conventional tree view, but also in an attractive visualization.
Carrot2 Document Clustering Server features include:
Two cluster views. Carrot2 Web Application offers two views of the clusters generated by Carrot2: conventional tree view and a Flash-based visualization.
All Carrot2 document sources and algorithms included. Carrot2 Web Application contains a large number of document sources, including major search engines. Optionally, further document sources can be added, such as Lucene or Solr ones. It also contains all Carrot2's clustering algorithms.
XSLT and JavaScript-based presentation layer. Look & feel of the Carrot2 Web Application can be easily changed by editing a number of XSLT style sheets. All common style sheets and JavaScripts can be re-used when implementing a new look & feel.
High-performance front-end. The front-end of the Carrot2 Web Application has been optimized for fast loading by using such techniques as JavaScript and CSS merging and minification, as well as using CSS sprites.
To run Carrot2 Web Application:
Make sure you have access to a Servlet API 2.4 compliant container, such as Apache Tomcat.
Download Carrot2 Web Application WAR file.
Deploy the WAR file to your servlet container.
Carrot2 Command Line Interface (CLI) is a set of applications that allow invoking Carrot2 clustering from the command line. Currently, the only available CLI application is Carrot2 Batch Processor, which performs Carrot2 clustering on one or more files in the Carrot2 XML format and saves the results as XML or JSON. Apart from clustering large number of documents sets at one time, you can use the Carrot2 Batch Processor to integrate Carrot2 with your non-Java applications.
To run Carrot2 Batch Processor:
Download and install Java Runtime Environment (version 1.5.0 or newer) if you have not done so.
Download Carrot2 Command Line Interface binaries and extract the archive to some local disk location.
Run batch.cmd (Windows) or batch.sh
(Linux) for an overview of the syntax. The Carrot2 Batch Processor ships with two example
input data sets located in the input/ directory.
Below is a list of some common example invocations.
To cluster one or more input files, specify their paths:
batch input/data-mining.xml input/seattle.xml
Clustering will be performed using the default clustering algorithm
and the results in the XML format will be saved to the output
directory relative to the current working directory.
You can also cluster files from one or more directories:
batch input/
Each directory will be processed recursively, i.e. including subdirectories. For each specified input directory, a corresponding directory with results will be created in the output directory.
To save results in the non-default directory, use the -o option:
batch input/ -o results
To repeat the input documents on the output, use the -d option:
batch input/ -d
To save the results in JSON, use the -f JSON option:
batch input/ -f JSON
To use a different clustering algorithm, use the -a
option followed by the identifier of the algorithm:
batch input/ -a url
To see the list of available algorithm identifiers, run the application without arguments.
In case of processing errors, you can use the -v
option to see detailed messages and stack traces.
This chapter will show you how to use Carrot2 in a number of typical scenarios such as trying clustering on your own documents or integrating Carrot2 with your software.
All Carrot2 applications require Java Runtime Environment version 1.5.0 or higher (1.6.0 recommended). The Carrot2 Document Clustering Workbench is distributed for Windows, Linux 32-bit and 64-bit versions and Mac OS x86. All other Carrot2 applications will run on any platform supporting Java Runtime Environment version 1.5.0 or higher.
This section shows how to apply Carrot2 clustering on documents from various sources.
To try Carrot2 clustering on results from common search engines, such as Google, Yahoo or MSN, you can either:
or
Use the Carrot2 Document Clustering Workbench which can fetch and cluster documents from the same search engines as the Carrot2 Web Application
To try Carrot2 clustering on a collection of plain text, HTML or MS Word documents, you will need to install Google Desktop:
Download and install Google Desktop if you have not done so.
Configure Google Desktop to index your documents.
You can use TweakGDS to make Google Desktop index only the folder with your documents.
Use Carrot2 Document Clustering Workbench to cluster documents fetched from your Google Desktop installation. Simply choose Google Desktop source in the search view (Figure 4.1), type your query and press the Process button to see the results.
You can use the filetype: operator to restrict
searching to specific file types only, e.g. filetype:doc
for MS Word documents. You can also use the under:
operator to restrict searches to a specific folder and its subfolders, e.g.
under:"c:\test-documents". Please see
Google Desktop search operators reference
for other useful query modifiers.
Carrot2 Document Clustering Workbench can automatically determine the Google Desktop Query URL only when it
is run on Windows with Administrator's privileges. For other setups, please
refer to Google Desktop API Documentation
for instructions about obtaining the Query URL. You can set the Query URL
attribute in the optional attributes section, shown after clicking the
button on the Search view toolbar.
Please also note that Query URLs are different for different users, using a Query URL not belonging to the currently logged in-user will result in errors.
To try Carrot2 clustering on documents or search results stored in a single XML file you can use the Carrot2 Document Clustering Workbench.
In the Search view of Carrot2 Document Clustering Workbench, choose XML source.
Set path to your XML file in the XML Resource field.
(Optional) If your file is not in Carrot2 format,
create an XSLT style sheet that transforms your data into Carrot2 format,
see Section 4.2.4 for an example.
Provide a path to your style sheet in the XSLT Stylesheet
field, which is an optional field.
You can show optional fields by clicking the
button on the Search view toolbar (Figure 4.2).
If you know the query that generated the documents in your XML file, you can provide it in the Query field, which may improve the clustering results. Press the Process button to see the results.
To try Carrot2 clustering on documents or search results fetched from a remote XML feed, you can use the Carrot2 Document Clustering Workbench. As an example, we will cluster a news feed from BBC:
In the Search view of Carrot2 Document Clustering Workbench, choose XML source.
Set URL to your XML feed in the XML Resource field. Optionally, the URL can contain two special place holders that will be replaced with the Query and Results number you set in the search view.
In our example, we will use the BBC News RSS feed.
Create an XSLT style sheet that will transform the XML feed into Carrot2 format. For the news feed we can use the stylesheet shown in Figure 4.3. To add more colour to our results, the XSLT transform extracts thumbnail URLs from the feed and passes them to Carrot2 in a special attribute. Attributes that are a sequence of values can be embedded as shown in Figure 4.4.
Provide a path to the transformation style sheet in the XSLT Stylesheet
field, which is an optional field.
You can show optional fields by clicking the
button on the Search view toolbar (Figure 4.2).
Press the Process button to see the results.
Figure 4.3 News feed XML to Carrot2 format transformation
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:media="http://search.yahoo.com/mrss">
<xsl:output indent="yes" omit-xml-declaration="no"
media-type="application/xml" encoding="UTF-8" />
<xsl:template match="/">
<searchresult>
<xsl:apply-templates select="/rss/channel/item" />
</searchresult>
</xsl:template>
<xsl:template match="item">
<document>
<title><xsl:value-of select="title" /></title>
<snippet>
<xsl:value-of select="description" />
</snippet>
<url><xsl:value-of select="link" /></url>
<xsl:if test="media:thumbnail">
<field key="thumbnail-url">
<value type="java.lang.String"
value="{media:thumbnail/@url}"/>
</field>
</xsl:if>
</document>
</xsl:template>
</xsl:stylesheet>
To try Carrot2 clustering on documents from a local Lucene index, you can use Carrot2 Document Clustering Workbench:
In the Search view of Carrot2 Document Clustering Workbench, choose Lucene source.
Click the
button on the view's toolbar (Figure 4.5)
to show optional attributes.
Choose the path to your Lucene index in the Index directory field.
Choose fields from your Lucene index in at least one of Document title field and Document content field combo boxes.
Type a query and press the Process button to see the results.
To try Carrot2 clustering on documents from an instance of Apache Solr, you can use Carrot2 Document Clustering Workbench:
In the Search view of Carrot2 Document Clustering Workbench, choose Solr source.
Click the
button on the view's toolbar (Figure 4.6)
to show optional attributes.
Provide the URL at which your Solr instance is available in the Service URL field.
Provide fields that should be used as document title, content and URL (optional) in the Title field name, Summary field name and URL field name field, respectively.
Type a query and press the Process button to see the results.
Carrot2 clustering can also be performed directly within Solr by means of Solr's Carrot2 Clustering Component. Please see Section 7.1 for more details.
To save doocuments and/or clusters produced by Carrot2 for further processing:
Use Carrot2 Document Clustering Workbench to perform clustering on documents from the source of your choice.
Use the File > Save as... dialog to save the documents and/or clusters into a file in the Carrot2 XML format.
Saving documents into XML can be particularly useful when there is a need to capture the output of some remote or non-public document source to a local file, which can be then passed on to someone else for further inspection. Documents saved into XML can be opened for clustering within Carrot2 Document Clustering Workbench using the XML document source.
The easiest way to integrate Carrot2 with your Java programs is to use the Carrot2 Java API package:
Download Carrot2 Java API and unpack it to some local directory.
Make sure that carrot2-core.jar and
all JARs from the lib/ directory are available in the classpath of
your program.
Look in the examples/ directory for some sample code.
Good places to start are ClusteringDocumentList and ClusteringDataFromDocumentSources.
For a complete description of Carrot2 Java API, please
see Javadoc documentation in the javadoc/ directory.
You can use the build.xml Ant script to compile and run
code from the examples/ directory.
For easier experimenting with Carrot2 Java API, you may want to set up a Carrot2 project in Eclipse IDE.
To add Carrot2 as a dependency to an existing Maven2 project:
Add the following fragment to the dependencies section of your
pom.xml:
<dependency> <groupId>org.carrot2</groupId> <artifactId>carrot2-core</artifactId> <version>3.0-rc1</version> </dependency>
Optionally, to enable Polish language support, add the following
fragment to the dependencies section of your
pom.xml:
<dependency> <groupId>org.carrot2</groupId> <artifactId>morfologik</artifactId> <version>1.1.2</version> </dependency>
Add the following fragment to the repositories section of your
pom.xml:
<repository> <id>carrot2.org</id> <name>Carrot2 Maven2 repository</name> <url>http://download.carrot2.org/maven2/</url> </repository>
Carrot2 provides Maven2 artifacts and an archetype project with examples of use. To create a template Carrot2 project, use the following command (line breaks for clarity):
mvn archetype:generate
-DarchetypeRepository=http://download.carrot2.org/maven2/
-DarchetypeGroupId=org.carrot2
-DarchetypeArtifactId=carrot2-example-archetype
-DarchetypeVersion=3.3.0-dev
-DgroupId=com.mycompany
-DartifactId=myproject
-DinteractiveMode=false
Marked in bold is the Carrot2 release that will be used, please see our Maven2 repository for available version numbers.
After the example project gets created, you can use standard Maven2 goals e.g. to generate Eclipse IDE project files:
mvn eclipse:eclipse
Carrot2 Java API examples can be easily set up in Eclipse IDE. The description below assumes you are using Eclipse IDE version 3.4 or newer.
Download Carrot2 Java API and unpack it to some local directory.
In your Eclipse IDE choose File > New > Java Project.
In the New Java Project dialog (Figure 4.7),
type name for the new project, e.g. carrot2-examples.
Then choose the Create project from existing source option,
provide the directory to which you unpacked the Carrot2 Java API archive and click
Finish.
When Eclipse compiles the example classes, you can open one of them, e.g.
ClusteringDocumentList and choose Run
> Run As > Java Application.
The output of the example program should be visible in the Console
view.
To set up Carrot2 source code, you will need Eclipse IDE version 3.5 or higher with the Plug-in Development Environment (PDE). The required plugins are avaiilable e.g. in Eclipse for Plug-in Developers and Eclipse Classic distributions available at http://www.eclipse.org/downloads.
Check out Carrot2 source code, e.g. from the following Subversion URL:
https://carrot2.svn.sourceforge.net/svnroot/carrot2/trunk
In the Package Explorer view in Eclipse IDE, choose Import... (see Figure 4.8), select General > Existing Projects into Workspace and click Next.
In the Import projects dialog provide your local
Carrot2 checkout directory in the Select root directory
field. Uncheck the org.carrot2.antlib project
(see Figure 4.9) and click
Finish.
All Carrot2 source code should compile without errors. If it does not:
Make sure your Eclipse's Java compiler compliance level is set to 1.5 or higher (Preferences > Java > Compiler).
Make sure your Eclipse's workspace encoding is set to UTF-8 (Preferences > General > Workspace > Text file encoding).
To integrate Carrot2 with your non-Java system,
you can use the Carrot2 Document Clustering Server, which exposes Carrot2 clustering as a REST/XML service. Please
see Section 3.2.1 for installation instructions and
the examples/ directory in the distribution archive for
example code in PHP, C# and Ruby.
This chapter discusses a number of typical fine-tuning scenarios for Carrot2 clustering algorithms. Some of the scenarios are relevant to all Carrot2 algorithms, while others are specific to individual algorithms.
The quality of clusters and their labels largely depends on the characteristics of documents provided on the input. Although there is no general rule for optimum document content, below are some tips worth considering.
Carrot2 is designed for small to medium collections of documents. The most important characteristic of Carrot2 algorithms to keep in mind is that they perform in-memory clustering. For this reason, as a rule of thumb, Carrot2 should successfully deal with up to a thousand of documents, a few paragraphs each. For algorithms designed to process millions of documents, you may want to check out the Mahout project.
Provide a minimum of 20 documents. Carrot2 clustering algorithms will work best with a set of documents similar to what is normally returned by a typical search engine. While about 20 is the minimum number of documents you can reasonably cluster, the optimum would fall in the 100 – 500 range.
Provide contextual snippets if possible. If the input documents are a result of some search query, provide contextual snippets related to that query, similar to what web search engines return, instead of full document content. Not only will this speed up processing, but also should help the clustering algorithm to cover the full spectrum of topics dealt with in the search results.
Minimize "noise" in the input documents. All kinds of "noise" in the documents, such as truncated sentences (sometimes resulting from contextual snippet extraction suggested above) or random alphanumerical strings may decrease the quality of cluster labels. If you have access to e.g. a few sentences' abstract of each document, it is worth checking the quality of clustering based on those abstracts. If you can combine this with the previous tip, i.e. extract complete sentences matching user's query, this should improve the clusters even further.
Let us once again stress that there are no definite generic guidelines for the best content for clustering, it is always worth experimenting with different combinations. You can also describe your specific application on Carrot2 mailing list and ask for advice.
Currently, Carrot2 offers two specialized search results clustering algorithms: Lingo and STC. The algorithms differ in terms of the main clustering principle and hence have different quality and performance characteristics. This section describes briefly the two algorithms and provides some recommendations for choosing the most suitable one.
The key characteristic of the Lingo algorithm is that it reverses the traditional clustering pipeline: it first identifies cluster labels and only then assigns documents to the labels to form final clusters. To find the labels, Lingo builds a term-document matrix for all input documents and decomposes the matrix to obtain a number of base vectors that well approximate the matrix in a low-dimensional space. Each such vector gives rise to one cluster label. To complete the clustering process, each label is assigned documents that contain the label's words.
The key data structure used in the Suffix Tree Clustering (STC) algorithm is a Generalized Suffix Tree (GST) built for all input documents. The algorithm traverses the GST to identify words and phrases that occurred more than once in the input documents. Each such word or phrase gives rise to one base cluster. The last stage of the clustering process is merging base clusters to form the final clusters.
The two algorithms have two features in common. They both create overlapping clusterings, in which one document can be assigned to more than one cluster. Also, in case of both algorithms a certain number of documents can remain unclustered and fall in the Other Topics group.
Table 5.1 compares the characteristics of Lingo and STC under their default settings and Figure 5.1 shows clusters generated by Lingo and STC for data mining search results.
Table 5.1 Characteristics of Lingo and STC clustering algorithms
| Feature | Lingo | STC |
|---|---|---|
| Cluster diversity | High, many small (outlier) clusters highlighted | Low, small (outlier) clusters rarely highlighted |
| Cluster labels | Longer, often more descriptive | Shorter, but still appropriate |
| Scalability | Low. For more than about 1000 documents, Lingo clustering will take a long time and large memory[a]. | High |
|
[a] Performance of the pure Java version of Lingo can be improved by installing native matrix computation libraries. |
||
It is difficult to give one clear recommendation as to which algorithm is "better". Many people feel Lingo delivers better-formed and more diverse clusters at the cost of lower performance and scalability. The ultimate judgment, however, should based on the evaluation with the specific document collection. Table 5.2 highlights the scenarios for which the algorithms are best suited.
Table 5.2 Optimum usage scenarios for Lingo and STC
| Feature | Use Lingo | Use STC |
|---|---|---|
| Well-formed longer labels required | ![]() | |
| Highlighting of small (outlier) clusters required | ![]() | |
| High clustering performance or large document set processing required | ![]() |
The bottom line is: use Lingo, unless you need high-performance clustering of document sets larger than 1000 documents.
For a more scientifically-oriented discussion and evaluation of the two algorithms, please check the publications on Carrot2 website.
Carrot Search, a company founded by Carrot2 authors, offers a commercial document clustering engine called Lingo3G that produces Lingo-quality hierarchical clusters at a better-than-STC speed. Please contact Carrot Search for details.
The best tool for experimenting and tuning Carrot2 clustering is the Carrot2 Document Clustering Workbench. Figure 5.2 shows the main components involved in the tuning process.
Figure 5.2 Tuning clustering in Carrot2 Document Clustering Workbench
|
|
The results editor presents documents and clusters. Changes made in the Attributes view will affect the currently active results editor. |
|
|
The Attributes view, where you can see and change values of clustering algorithm's attributes. |
|
|
The Attribute Info view, which shows documentation for specific attributes. Hold the mouse pointer over an attribute's label to see its documentation. |
Opening the Attributes view. By default, the Attributes view shows on the right hand side of the Carrot2 Document Clustering Workbench. You can open the view at any time by choosing Window > Show view > Attributes.
Setting modified attributes as default for new queries. If you modified a number of attributes for an algorithm and would like to use the modified values for new queries, choose the Set as defaults for new queries from the Attributes view's context menu (Figure 5.3).
Restoring default attribute values. To reset the attributes to their default values, choose the Reset to defaults option from the Attributes view's context menu (Figure 5.3). To bring the attributes back to their factory defaults, choose the Reset to factory defaults option.
Loading and saving attribute values to XML.
To load or save attribute values to an XML file, use the Open
and Save as... options available under the
icon on the Attributes view's menu bar.
Accessing attribute documentation. To see the documentation for a specific attribute, hold the mouse pointer over the attribute's label and its documentation will show in the Attribute Info view.
Stop words are the common meaningless words, such as the, to, for in English, that should be ignored while clustering. The Lingo algorithm, for example, will not create clusters whose labels start or end in a stop word.
To fine-tune the stop words list you can use the Carrot2 Document Clustering Workbench in the following way:
Start Carrot2 Document Clustering Workbench and run some query on which you'll be observing the results of your changes.
Go to the workspace/ directory which is located in
the directory to which you extracted Carrot2 Document Clustering Workbench. Modify the stopwords.*
file for the language you are working on (e.g. stopwords.en
for English). Add or remove stop words as required and save changes.
Open the Attributes view and use the view toolbar's
button to group the attributes by semantics. In the Preprocessing
section, make sure the Processing language is correctly set and
check the Reload stopwords checkbox. Doing the latter
will let you to see the updated clustering results without restarting Carrot2 Document Clustering Workbench
every time you save the changed stop word list.
To re-run clustering after
you've saved changes to the stopwords.*, choose the
Restart Processing option from the Search
menu, or press Ctrl+F11.
To transfer the changed stop words file to other Carrot2 applications, update
the existing stop words file in the carrot2-core.jar the application
is using. In case of the Carrot2 Document Clustering Server and Carrot2 Web Application, the carrot2-core.jar is located
in the WEB-INF/lib directory.
The Lingo clustering algorithm, in addition to stop words editing, offers more precise control over cluster labels by means of "stop label" regular expressions. If a cluster's label matches one of the stop labels, the label will not appear on the list of clusters produced by Lingo.
The procedure for tuning stop labels and transferring them to other Carrot2 applications is similar to
stop word tuning.
The difference is that this time you need to edit the stoplabels.* files.
Each line of a stop labels file corresponds to one stop label and is a Java regular
expression. Please note that in order to be removed, a label as a whole must match
at least one of the stop label expressions. A number of example
stop label expressions are shown below.
(?i)new (?i)information (?i)information (about|on).* (?i)(index|list) of.*
All stop labels shown above start with the (?i) prefix, which enables
case-insensitive matching for them. The stop label in the first line suppresses
labels consisting solely of the word new. Similarly, the stop label
in the second line removes labels consisting of the word information.
The stop label in the third line removes labels that start in information about
or information on, and the stop label in the fourth line removes
labels that start with index of or list of.
Please note that defining a very large number of stop labels (100+) may significantly slow down clustering. In such cases you may want to combine separate stop label expressions into one larger regular expression.
The Other Topics cluster contains documents that do not belong to any other cluster generated by the algorithm. Depending on the input documents, the size of this cluster may vary from a few to tens of documents.
By tuning parameters of the clustering algorithm, you can reduce the number of unclustered documents, however bringing the number down to 0 is unachievable in most cases. Please note that minimizing the Other Topics cluster size is usually achieved by forcing the algorithm to create more clusters, which may degrade the perceived clustering quality.
The easiest way to try different clustering algorithm settings is to use the Carrot2 Document Clustering Workbench.
To reduce the size of the Other Topics cluster generated by Lingo, you can try applying the following settings:
Change the Factorization method attribute
to LocalNonnegativeMatrixFactorizationFactory.
Increase the Cluster count base above the default value.
Decrease the Phrase label boost. Note that this will increase the number of one-word labels, which may not always be desirable.
To apply the changes to the Carrot2 applications, please follow instructions from Chapter 6.
As a rule of thumb, the more documents you put on input and the longer the documents are, the larger clustering times. Interestingly, in many cases short document excerpts (such as contextual snippets for search results, title and abstracts or first couple sentences of non-search results) may work just as well or even better than full documents. Hence the first two most important performance tuning tips:
Reduce the size of the input documents You can achieve this in a few ways:
Rather than full text of documents, use their titles and abstracts, if available.
In case of search results, use the contextual snippet rather than the full document text. Not only will this improve clustering performance, but it will very likely increase the quality of clusters as well because you will be clustering specifically the fragments the users asked for in their query.
If you don't have document abstracts, but have access to some automatically generated summaries, use them. Otherwise, try clustering the title and the first few sentences of each document.
In certain cases, you may get decent clustering results with document titles only, this variant is worth trying too.
Reduce the number of input documents While removing large part of the input document set may not always be an option, in many cases dividing the input into two or more batches, clustering separately and then merging based on cluster label text may give reasonable results. The downside of this approach is that very small clusters containing just a few documents are likely to be lost during this process.
Further performance tuning tips are specific for each clustering algorithm.
You can change a number of attributes to increase the performance of Lingo. Most often, performance gain will be achieved at the cost of lowered clustering quality or significant change in the structure of clusters.
Lower Factorization quality,
which will cause the matrix factorization algorithm to perform fewer iterations
and hence complete quicker. Alternatively, you can set Factorization method
to org.carrot2.matrix.factorization.PartialSingularValueDecompositionFactory,
which is slightly faster than the other factorizations. In the latter case
Factorization quality
becomes irrelevant.
Lower Maximum matrix size, which would cause the matrix factorization algorithm to complete quicker and use less memory. With small matrix sizes, Lingo may not be able to discover smaller clusters.
Not yet covered, please contact us if you need this section.
You can use the Carrot2 Document Clustering Workbench to run simple performance benchmarks of Carrot2. The benchmarks repeatedly cluster the content of the currently opened editor and report the average clustering time. You can use the benchmarking results to measure the impact of different algorithm's attribute settings on its performance and estimate the the maximum number of clustering requests that the algorithm can process per second.
To perform a performance benchmark:
Open the Benchmark view.
To asses the performance impact of different attribute settings on one algorithm, you can open two or more editors with the same results clustered by the algorithm, set different attribute values in each editor and run benchmarking for each editor separately. The benchmark view remembers the last result for each editor, so you can compare the performance figures by simply switching between the editors.
By default, the benchmarking view uses only a single processing unit on multi-processor or multi-core machines. You can increase the number of benchmark threads in the Threads section.
Benchmark results may vary and be different from the results acquired on production machines due to other programs running in the background, operating system, hardware-specific considerations and even different Java Virtual Machine settings. Always fine-tune your clustering setup in the target deployment environment.
This chapter will show you how to add new document sources and tune clustering in Carrot2 applications.
Key concepts in customizing and tuning Carrot2 applications are component suites and component attributes described in the following sections.
Component suite is a set of Carrot2 components, such as document sources or clustering algorithms, configured to work within a specific Carrot2 application. For each component, the component suite defines the component's identifier, label, description and also a number of component- and application-specific properties, such as the list of example queries.
Component suites are defined in XML files read from application-specific locations described in further sections of this chapter. An example component suite definition is shown in Figure 6.1.
Figure 6.1 Example Carrot2 component suite
<component-suite>
<sources>
<source id="lucene"
component-class="org.carrot2.source.lucene.LuceneDocumentSource"
attribute-sets-resource="lucene.attributes.xml">
<label>Lucene</label>
<title>Apache Lucene</title>
<mnemonic>L</mnemonic>
<description>
Apache Lucene index (local index access).
</description>
<icon-path>icons/lucene.png</icon-path>
<example-queries>
<example-query>data mining</example-query>
<example-query>london</example-query>
<example-query>clustering</example-query>
</example-queries>
</source>
</sources>
<algorithms>
<algorithm id="lingo"
component-class="org.carrot2.clustering.lingo.LingoClusteringAlgorithm"
attribute-sets-resource="lingo.attributes.xml">
<label>Lingo</label>
<title>Lingo Clustering</title>
</algorithm>
</algorithms>
<include suite="source-yahoo-boss.xml" />
<include suite="algorithm-stc.xml" />
</component-suite>
The component suite definition can consist of the following elements:
sources
Document source definitions, optional.
algorithms
Clustering algorithm definitions, optional.
include
Includes other XML component suite definitions, optional. The resource
specified in the suite attribute will be loaded from the current
thread's context class loader.
Common parts of the source and algorithm tags include:
id
Identifier of the component within the suite, required. Identifiers
must be unique within the component suite scope.
component-class
Fully qualified name of the processing component class, required.
attribute-sets-resource
XML file to load the component's attributes from. The resource specified in
this attribute will be loaded from the current thread's context
class loader. For the syntax of the XML file, please see
Section 6.1.2.
label
A human readable label of the component, required.
label
A human readable title of the component, required. The title will be usually
slightly longer than the label.
description
A longer description of the component, optional.
icon-path
Application specific definition of the component's icon.
Additionally, for the source tag you can use the example-queries tag
to specify some example queries the applications may show for this source.
Component attribute is a specific property of a Carrot2 component that influences its behavior, e.g. the number of search results fetched by a document source or the depth of cluster hierarchy produced by a clustering algorithm. Each attribute is identified by a unique string key, Chapter 11 lists and describes all available components and their attributes.
You can specify attribute values for specific components in the component suite
using attribute sets. Attribute sets are defined in XML files
referenced by the attribute-sets-resource attribute of the component's
entry in the component suite. Figure 6.2
shows an example attribute set definition.
Figure 6.2 Example Carrot2 attribute set
<attribute-sets>
<attribute-set id="lucene">
<value-set>
<label>Lucene</label>
<attribute key="LuceneDocumentSource.directory">
<value>
<wrapper class="org.carrot2.source.lucene.FSDirectoryWrapper">
<indexPath>/path/to/lucene/index/directory</indexPath>
</wrapper>
</value>
</attribute>
<attribute key="org.carrot2.source.lucene.SimpleFieldMapper.contentField">
<value type="java.lang.String" value="summary" />
</attribute>
<attribute key="org.carrot2.source.lucene.SimpleFieldMapper.titleField">
<value type="java.lang.String" value="title" />
</attribute>
<attribute key="org.carrot2.source.lucene.SimpleFieldMapper.urlField">
<value type="java.lang.String" value="url" />
</attribute>
</value-set>
</attribute-set>
</attribute-sets>
An attribute-sets element can contain one or more
attribute-sets. Each attribute-set must specify a unique
id and a value-set.
Saving attributes to XML using Carrot2 Document Clustering Workbench
As the syntax of the value elements depends on the type of the
attribute being set, the easiest way to obtain the XML file is to use
the Carrot2 Document Clustering Workbench.
To generate attribute set XML for a document source:
In the Search view, choose the document source for which you would like to save attributes.
Use the Search view to set the desired attribute values.
Choose the Save as... option from Search
view's menu bar. Carrot2 Document Clustering Workbench will suggest the XML file name based on the value of
the document source's attribute-sets-resource attribute.
Please note that the Carrot2 Document Clustering Workbench will remove a number of common attributes from the XML file being saved, including: query, start result index, number of results.
To generate attribute set XML for a clustering algorithm:
In the Search view, choose the clustering algorithm for which you would like to save attributes. Choose any document source and perform processing using the selected algorithm.
Use the Attributes view to set the desired attribute values.
Choose the Save as... option from Attribute
view's menu bar. Carrot2 Document Clustering Workbench will suggest the XML file name based on the value of
the clustering algorithm's attribute-sets-resource attribute.
If for some reason you cannot use the Carrot2 Document Clustering Workbench to save attribute set XML files,
you can modify the SavingAttributeValuesToXml class from the
carrot2-examples package to correspond to the attribute values
you would like to set and run the class to print the XML encoding of the
attribute values to the standard output.
To add a document source tab to the Carrot2 Web Application:
Download Carrot2 Web Application WAR file.
Open for editing the suite-webapp.xml file, located in the
WEB-INF/classes/suites directory of the
WAR file.
Add a descriptor for the document source you want to add to the sources
section of the suite-webapp.xml file. Alternatively, you
may want to use the include element to reference one of the example
document source descriptors shipped with the application (e.g.
source-lucene.xml). Please see
Section 6.1.1
for more information about the component suite XML file.
If the document source you are adding requires setting specific attribute values
(e.g. index location for the Lucene document source),
use
the Carrot2 Document Clustering Workbench to generate the attribute set XML file. Place the generated
XML file in WEB-INF/classes/suites and make
sure it is appropriately referenced by the attribute-sets-resource
attribute of the descriptor added in the previous step.
Deploy the WAR file with the above modifications to your container. If the new document source tab is not showing, clear cookies for the domain on which the web application is deployed.
To add a document source tab to the Carrot2 Document Clustering Server:
Download Carrot2 Document Clustering Server distribution archive and extract it to some local folder.
Open for editing the suite-dcs.xml file, located in the
WEB-INF/classes/suites directory of the
DCS WAR file located in the war/ of the DCS distribution.
Add a descriptor for the document source you want to add to the sources
section of the suite-dcs.xml file. Alternatively, you may want to use the include element to
reference one of the example document source descriptors shipped with the
application (e.g. source-lucene.xml). Please see
Section 6.1.1
for more information about the component suite XML file.
If the document source you are adding requires setting specific attribute values
(e.g. index location for the Lucene document source),
use
the Carrot2 Document Clustering Workbench to generate the attribute set XML file. Place the generated
XML file in WEB-INF/classes/suites
and make sure it is appropriately referenced by the attribute-sets-resource
attribute of the descriptor added in the previous step.
Restart the DCS. The new document source should be available for processing.
To run the Carrot2 Web Application with custom attributes of the Lingo clustering algorithm:
Use the Carrot2 Document Clustering Workbench to save the attribute set XML file with the desired Lingo attribute values.
Replace the contents of lingo.attributes.xml, located in
the WEB-INF/classes/suites directory of the web application
WAR file, with the XML file saved in the previous step.
Deploy the WAR file with the above modifications to your container.
You can use the same procedure to customize other algorithms, e.g. STC.
To run the Carrot2 Document Clustering Server with custom attributes of the Lingo clustering algorithm:
Use the Carrot2 Document Clustering Workbench to save the attribute set XML file with the desired Lingo attribute values.
Replace the contents of algorithm-lingo-attributes.xml, located in
the WEB-INF/classes/suites directory of the DCS
WAR file, located in the war/ directory of the DCS distribution,
with the XML file saved in the previous step.
Restart the DCS.
You can use the same procedure to customize other algorithms, e.g. STC.
To run the Carrot2 Command Line Interface with custom attributes of the Lingo clustering algorithm:
Use the Carrot2 Document Clustering Workbench to save the attribute set XML file with the desired Lingo attribute values.
Replace the contents of algorithm-lingo-attributes.xml, located in
the /suites directory of the carrot2-mini.jar
file, located in the lib/ directory of the CLI distribution,
with the XML file saved in the previous step.
Run the CLI application.
You can use the same procedure to customize other algorithms, e.g. STC.
Not yet covered, please contact us if you need this section.
This chapter discusses more advanced usage scenarios of Carrot2 such as integration with Apache Solr, running Carrot2 applications in Eclipse and building Carrot2 from source code.
As of version 1.4 of Apache Solr, Carrot2 clustering can be performed directly within Solr by means of the Carrot2 Clustering Component
To run Carrot2 Document Clustering Workbench in Eclipse IDE (version 3.4 or higher required):
Choose Window > Preferences and
then Run/Debug > String substitution.
Add a temp_workspaces variable pointing to a an existing disk
directory where the Workbench's workspace should be created.
Choose Run > External Tools >
External Tools Configurations... from
the main menu and run the Attribute Metadata XML configuration.
This will build the metadata files required for Workbench to show descriptions
of Carrot2 components' attributes.
Choose Run > Run Configurations... from
the main menu and run the Workbench configuration.
To run Carrot2 Document Clustering Workbench in Eclipse IDE:
Choose Run > External Tools >
External Tools Configurations... from
the main menu and run the Attribute Metadata XML configuration.
This will build the metadata files required for the web application to show
advanced options of document sources.
Choose Run > External Tools >
External Tools Configurations... from
the main menu and run the Web Application Setup [carrot2] configuration.
This will preprocess various configuration files required by the web application.
Choose Run > Run Configurations... from
the main menu and run the Web Application Runner [carrot2] configuration.
Point your browser to http://localhost:8080 to access the running web application.
To build Carrot2 applications from source code, you will need Java Softwade Development Kit (Java SDK) version 1.6 or higher and Apache Ant version 1.7.1 or higher. You can chcek out the latest Carrot2 source code from the following SVN location:
https://carrot2.svn.sourceforge.net/svnroot/carrot2/trunk
To build Carrot2 Document Clustering Workbench from source code:
Download Eclipse Target Platform from http://download.carrot2.org/eclipse and extract to some local folder.
Copy workbench.properties.example from Carrot2 checkout folder
to workbench.properties in the same folder. In
workbench.properties edit the target.platform
property to point to the Eclipse Target Platform you have downloaded.
The folder pointed to by target.platform must have the eclipse/
folder inside.
You may also need to change the configs property to
match the platform you want to build Carrot2 Document Clustering Workbench for.
Run:
ant -f build-workbench.xml buildto build Carrot2 Document Clustering Workbench binaries.
Go to the tmp/ workbench/ tmp/ carrot2-workbench
folder in the Carrot2 checkout dir and run Carrot2 Document Clustering Workbench.
You can use curl to post requests to the Carrot2 Document Clustering Server
Figure 7.3 shows how to use curl
to query an external document source and cluster the results using the DCS.
Figure 7.4 shows how to cluster documents
from an XML file in Carrot2 format using the DCS.
Please see the examples/curl directory of the Carrot2 Document Clustering Server distribution
archive for more curl DCS invocation examples.
Figure 7.3 Using DCS and curl to cluster data from document source
curl http://localhost/dcs/rest \
-F "dcs.source=etools" \
-F "query=test" \
-o result.xml
Figure 7.4 Using DCS and curl to cluster data from document source
curl http://localhost/dcs/rest \
-F "dcs.c2stream=@documents-in-carrot2-format.xml" \
-o result.xml
You can download curl for Windows from http://curl.haxx.se/latest.cgi?curl=win32-nossl.
If your server or development machine connects to HTTP servers via a HTTP proxy, you can most of Carrot2 document source implementations to take this information into account by defining the following global system properties:
URL of the HTTP proxy (numeric or full address, but without the port number).
Proxy server's port number.
Two sources that currently do not support the above properties are: MicrosoftLiveDocumentSource and OpenSearchDocumentSource.
Password-based authentication is not supported at the moment.
You can alter the source code to change this
in the HttpUtils class.
To speed up clustering performed by the Lingo algorithm, you can configure Carrot2 to use a native platform-specific matrix computation library. Depending on the platform, you may see up to a 400% speed-up compared to the Java-only mode.
To enable native matrix computations for Carrot2:
Download precompiled libraries for your platform and extract the archive to some local directory.
If no distribution matches your platform, and you would like to compile your own version, please ask on the mailing list for instructions. You can also try the PIII (Pentium III) versions, which seem to work quite well on modern processors as well (e.g. Core2 Duo).
Add an additional option to your JVM command line invocation providing the path to the directory to which you extracted the native library:
java -Djava.library.path=[native-lib-dir] ...
To enable native computations in web applications deployed to Apache Tomcat,
pass the above directive in the JAVA_OPTS environment
variable, e.g.:
export JAVA_OPTS="-Djava.library.path=[native-lib-dir]"
When Carrot2 correctly loads the native library, upon initialization of the Lingo clustering algorithm, the following entry should appear in application logs:
INFO org.carrot2.clustering.lingo.LingoClustering Algorithm: Native BLAS routines available
This chapter discusses solutions to some common problems with Carrot2 code or applications.
To increase Java heap size for Carrot2 Document Clustering Workbench, use the following command line parameters:
carrot2-workbench -vmargs -Xmx256m
Using the above pattern you can specify any other JVM options if needed.
To get the stack trace (useful for Carrot2 team to spot errors) corresponding to a processing error in Carrot2 Document Clustering Workbench, follow the following procedure:
Click OK on the Problem Occurred dialog box (Figure 8.1).
Go to Window > Show view > Other... and choose Error Log (Figure 8.2).
In the Error Log view double click the line corresponding to the error (Figure 8.3).
Copy the exception stack trace from the Event Details dialog and pass to Carrot2 team (Figure 8.4).
If you see question marks ("?") instead of Chinese, Polish or other special Unicode characters in clusters and documents output by the Carrot2 Web Application
The Carrot2 Web Application running under a Web application container (such as Tomcat) relies on proper decoding of Unicode characters from the request URI. This decoding is done by the container and must be properly configured at the container level. Unfortunately, this configuration is not part of the J2EE standard and is therefore different for each container.
For Apache Tomcat, you can enforce the URI decoding code page at the connector
configuration level. Locate server.xml file inside
Tomcat's conf folder and add the following attribute to
the Connector section:
URIEncoding="UTF-8"
A typical connector configuration should look like this:
<Connector port="8080" maxThreads="25"
minSpareThreads="5" maxSpareThreads="10"
minProcessors="5" maxProcessors="25"
enableLookups="false" redirectPort="8443"
acceptCount="10" debug="0"
connectionTimeout="20000" URIEncoding="UTF-8" />
This chapter discusses some Carrot2 architecture assumptions, internals and more complex API use cases.
This section provides a very brief overview of Carrot2 architecture. If you would like us to cover some specific topic in more detail, please let us know on the mailing list.
Processing in Carrot2 is based on a pipeline of processing components. The two main types of Carrot2 processing components are:
Document Sources provide data for further processing. In a typical scenario, such a component would fetch search results from e.g. an external search engine, Lucene / Solr index or an XML file. Currently, Carrot2 distribution contains 12 different document source components.
Clustering Algorithms organize documents provided by document sources into meaningful groups. Currently, two specialized clustering algorithms are available in Carrot2: Lingo and STC. Additionally, a number of "synthetic" clustering algorithms are available, such as by URL clustering.
Carrot2 applications, such as Carrot2 Document Clustering Workbench or Carrot2 Document Clustering Server operate on a pipeline consisting of one document source and one clustering algorithm, but using Carrot2 Java API you can insert additional components at any point in the pipeline. Currently, the only component not falling into the above categories is a component for computing certain cluster quality metrics, but more components may be added in the future, e.g. for spell checking of user queries.
The behavior of both document sources and clustering algorithms depends on a number of attributes (settings) such as the number of documents to fetch or the number of clusters to produce. The way you provide attribute values for specific components depends on the Carrot2 application you are working with:
Carrot2 Document Clustering Workbench. In Carrot2 Document Clustering Workbench you can provide attributes for document sources (such as number of results to fetch or preferred results language) before you issue a query in the Search view. Clustering algorithm attributes you can change using the sliders in the Attributes view.
Carrot2 Document Clustering Server. In Carrot2 Document Clustering Server, you can provide attribute values as additional parameters in the POST request. Name of the POST parameter should be the identifier of the attribute you want to set (see Chapter 11 for attribute identifiers). Carrot2 will attempt to convert the string value of the parameter to the required type (integer, float etc.).
For a complete reference of attributes of each Carrot2 component, please see Chapter 11.
This section shows examples of Carrot2 input and output XML formats, used consistently by all Carrot2 applications, including Carrot2 Document Clustering Workbench, Carrot2 Document Clustering Server and Carrot2 Web Application.
To provide documents for Carrot2 clustering, use the following XML format:
Figure 9.1 Carrot2 input XML format
<?xml version="1.0" encoding="UTF-8"?>
<searchresult>
<query>Globe</query>
<document id="0">
<title>default</title>
<url>http://www.globe.com.ph/</url>
<snippet>
Provides mobile communications (GSM) including
GenTXT, handyphones, wireline services, an
broadband Internet services.
</snippet>
</document>
<document id="1">
<title>Skate Shoes by Globe | Time For Change</title>
<url>http://www.globeshoes.com/</url>
<snippet>
Skaters, surfers, and showboarders
designing in their own style.
</snippet>
</document>
...
</searchresult>
Carrot2 saves the clusters in the following XML format:
Figure 9.2 Carrot2 output XML format
<?xml version="1.0" encoding="UTF-8"?>
<searchresult>
<query>Globe</query>
<document id="0">
<title>default</title>
<url>http://www.globe.com.ph/</url>
<snippet>
Provides mobile communications (GSM) including
GenTXT, handyphones, wireline services, an
broadband Internet services.
</snippet>
</document>
<document id="1">
<title>Skate Shoes by Globe | Time For Change</title>
<url>http://www.globeshoes.com/</url>
<snippet>
Skaters, surfers, and showboarders
designing in their own style.
</snippet>
</document>
...
<group id="0" size="60">
<title>
<phrase>com</phrase>
</title>
<group id="1" size="2">
<title>
<phrase>amazon.com</phrase>
</title>
<document refid="43"/>
<document refid="77"/>
</group>
<group id="2" size="2">
<title>
<phrase>boston.com</phrase>
</title>
<document refid="4"/>
<document refid="7"/>
</group>
...
<group id="7" size="48">
<title>
<phrase>Other Sites</phrase>
</title>
<attribute key="other-topics">
<value type="java.lang.Boolean" value="true"/>
</attribute>
<document refid="1"/>
<document refid="2"/>
...
</group>
</group>
<group id="8" size="12">
<title>
<phrase>org</phrase>
</title>
<group id="9" size="2">
<title>
<phrase>en.wikipedia.org</phrase>
</title>
<document refid="9"/>
<document refid="14"/>
...
</group>
</group>
...
</searchresult>
This section shows examples of Carrot2 output JSON format, used consistently by all Carrot2 applications, including Carrot2 Document Clustering Server and Carrot2 Java API.
Carrot2 saves documents and the clusters in the following JSON format:
Figure 9.3 Carrot2 output JSON format
{
"clusters": [
{
"attributes": {
"score": 1.0
},
"documents": [
0,
2
],
"id": 0,
"phrases": [
"Cluster 1"
],
"score": 1.0,
"size": 2
},
{
"attributes": {
"score": 0.63
},
"clusters": [
{
"attributes": {
"score": 0.3
},
"documents": [
1
],
"id": 2,
"phrases": [
"Cluster 2.1"
],
"score": 0.3,
"size": 1
},
{
"attributes": {
"score": 0.15
},
"documents": [
2
],
"id": 3,
"phrases": [
"Cluster 2.2"
],
"score": 0.15,
"size": 1
}
],
"documents": [
0
],
"id": 1,
"phrases": [
"Cluster 2"
],
"score": 0.63,
"size": 3
}
],
"documents": [
{
"id": 0,
"snippet": "Document 1 Content.",
"title": "Document 1 Title",
"url": "http://document.url/1"
},
{
"id": 1,
"snippet": "Document 2 Content.",
"title": "Document 2 Title",
"url": "http://document.url/2"
},
{
"id": 2,
"snippet": "Document 3 Content.",
"title": "Document 3 Title",
"url": "http://document.url/3"
}
],
"query": "query (optional)"
}
This chapter contains information for Carrot2 developers.
Each Carrot2 release should be performed according to the following procedure:
Update JavaDoc documentation Review JavaDoc documentation, provide missing public and protected members description, provide missing package descriptions.
Update Carrot2 Manual Review Carrot2 Manual, modify or add content related to the features implemented in the new release.
Update Maven dependencies Update Maven POMs so that dependencies are in sync with the JAR versions in the repository.
Review of static code analysis reports Review and fix reasonably-looking flaws from the following reports:
Update source code headers and line endings In project root:
ant prereleaseCommit changes to trunk.
Precondition: successful trunk builds The status of the C2HEAD-CORE and C2HEAD-SOURCES builds must be successful.
Precondition: resolved issues All issues related to the software to be released scheduled (fix for) for the release must be resolved.
Replace the stable branch in SVN
svn remove https://carrot2.svn.sourceforge.net/svnroot/carrot2/branches/stable
svn copy https://carrot2.svn.sourceforge.net/svnroot/carrot2/trunk
https://carrot2.svn.sourceforge.net/svnroot/carrot2/branches/stable
Update version number strings in the stable branch
Version files
Update etc/version/carrot2.version
to contain the desired stable version number. That number will be embedded in
distribution file names, JavaDoc page title and other version-sensitive places.
Note the property name should be carrot2.version.stable, e.g.:
carrot2.version.stable=3.2.0
carrot2.version=${carrot2.version.stable}
# workbench plugin/ feature versions.
carrot2.version.workbench=${carrot2.version.stable}
Trigger stable branch build Go to the C2STABLE-ALL build page and trigger a build. If the build is successful, all distribution files should be available in the download directory.
Verify the distribution files Download, unpack and run each distribution file to make sure there are no obvious release blockers.
Create the release tag
svn copy https://carrot2.svn.sourceforge.net/svnroot/carrot2/branches/stable
https://carrot2.svn.sourceforge.net/svnroot/carrot2/tags/VERSION_3_2_0
Update version number strings in trunk In case of major releases, update development version numbers.
Version files
Update etc/version/carrot2.version
to contain the desired development version number.
Note the property name should be carrot2.version.head, e.g.:
carrot2.version.head=3.3.0-dev # workbench plugin/ feature versions. carrot2.version.workbench=3.3.0.dev-snapshot
Carrot2 plugin versions in Carrot2 Document Clustering Workbench Update Carrot2 plugin version strings in the Carrot2 Document Clustering Workbench to the current development version.
Update JIRA Close issues scheduled for the release being made, release the version in JIRA, create a next version in JIRA.
Update project website
Release notes
Add a page named release-[version]-notes that
lists new features, major bug fixes and improvements introduced in the
new release. The page will automatically become linked from all
relevant sections of the website (done by an SVN external to
etc/version/carrot2.version).
Release note history
Add release date and link to the release's JIRA issues on the
release-notes page.
Upload distribution files to SourceForge Perform (e.g. on the build server):
rsync -e ssh *-3.2.0.zip \ <sf.user>,carrot2@frs.sourceforge.net:/home/frs/project/c/ca/carrot2/carrot2/3.2.0
Circulate release news If appropriate, circulate release news to:
Carrot2 mailing lists
Consider upgrading Carrot2 in dependent projects If reasonable, upgrade Carrot2 dependency in other known projects, such as Apache Solr and Nutch.
This a very quick quality assurance check list to run through before stable releases. This list also serves as some guide line for further automation of acceptance tests.
Note that this list does not contain many checks for the Carrot2 Web Application, Carrot2 Document Clustering Server and Carrot2 Java API as these are fairly well tested during builds (webtests, smoke-tests).
For each supported platform you can test, check that Carrot2 Document Clustering Workbench:
launches without errors in the error log
executes and cluters a remote search query without errors
executes and clusters a Lucene query without errors (we've had a bug that caused the Lucene directory attribute editor to disappear, hence this step).
can edit a clustering algorithm's attribute
shows both cluster visualizations
executes clustering algorithm benchmarks
Check that a the Carrot2 Document Clustering Server starts up correctly using command line on Windows and Linux. More acceptance tests are performed during builds (but starting Carrot2 Document Clustering Server using the WAR file instead of command line).
This section lists and describes attributes of all Carrot2 components. By changing values of these attributes, you can change the behaviour of the component. Please see Chapter 6 for information on how you pass attribute values in different Carrot2 applications.
Each attribute is described by a number of properties:
Key The unique identifier of the attribute.
Direction
Input The attribute is an input for the component, the behaviour of the component depends on its value.
Output The attribute is an output produced by the component.
Level Informs how advanced the attribute is.
Basic Attribute value should be fairly easily tunable by a person without significant experience in text clustering.
Medium Attribute value should be fairly easily tunable by a person without some intuition about text clustering
Advanced Attribute may require in-depth knowledge of the component for successful tuning.
Required If true and the attribute does not have a default value, a value must be provided for the component to perform processing.
Scope
Initialization time Attribute value will be respected only when the component is initializing; values provided at processing time will be ignored. This scope applies to the attributes that control time-consuming operations performed once per component instance (e.g. parsing of configuration files). As a result, only a handful of attributes fall into the initialization-time only scope.
Processing time Attribute values will be respected both at initialization and clustering time. Most of the attributes fall into this scope.
Value type The Java type of the attribute's value.
Default value The default value of the attribute or none if there is no default value defined for the attribute.
| Key |
clusters
|
| Direction |
Output
|
| Description | Clusters created by the algorithm. |
| Scope | Processing time |
| Value type |
java.util.List
|
| Default value | none |
| Key |
documents
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Documents to cluster. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.util.List
|
| Default value | none |
| Key |
ByAttributeClusteringAlgorithm.fieldName
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Name of the field to cluster by.
Each non-null scalar field value with distinct hash code will give rise to a single cluster, named using the value returned by buildClusterLabel(Object). If the field value is a collection, the document will be assigned to all clusters corresponding to the values in the collection. Note that arrays will not be 'unfolded' in this way. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value |
sources
|
| Value content | Must not be blank |
| Key |
clusters
|
| Direction |
Output
|
| Description | Clusters created by the algorithm. |
| Scope | Processing time |
| Value type |
java.util.List
|
| Default value | none |
| Key |
documents
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Documents to cluster. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.util.List
|
| Default value | none |
| Key |
LingoClusteringAlgorithm.desiredClusterCountBase
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Desired cluster count base. Base factor used to calculate the number of clusters based on the number of documents on input. The larger the value, the more clusters will be created. The number of clusters created by the algorithm will be proportional to the cluster count base, but not in a linear way. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
30
|
| Min value |
2
|
| Max value |
100
|
| Key |
clusters
|
| Direction |
Output
|
| Description | Clusters created by the clustering algorithm. |
| Scope | Processing time |
| Value type |
java.util.List
|
| Default value | none |
| Key |
LingoClusteringAlgorithm.scoreWeight
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Balance between cluster score and size during cluster sorting. Value equal to 0.0 will cause Lingo to sort clusters based only on cluster size. Value equal to 1.0 will cause Lingo to sort clusters based only on cluster score. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Double
|
| Default value |
0.0
|
| Min value |
0.0
|
| Max value |
1.0
|
| Key |
documents
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Documents to cluster. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.util.List
|
| Default value | none |
| Key |
GenitiveLabelFilter.enabled
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Remove labels ending in genitive form. Removes labels that do end in words in the Saxon Genitive form (e.g. "Threatening the Country's"). |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
true
|
| Key |
StopWordLabelFilter.enabled
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Remove leading and trailing stop words. Removes labels that consist of, start or end in stop words. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
true
|
| Key |
NumericLabelFilter.enabled
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Remove numeric labels. Remove labels that consist only of or start with numbers. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
true
|
| Key |
QueryLabelFilter.enabled
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Remove query words. Removes labels that consist only of words contained in the query. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
true
|
| Key |
MinLengthLabelFilter.enabled
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Remove labels shorter than 3 characters. Removes labels whose total length in characters, including spaces, is less than 3. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
true
|
| Key |
StopLabelFilter.enabled
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Remove stop labels. Removes labels that are declared as stop labels in the stoplabels.<lang> files. Please note that adding a long list of regular expressions to the stoplabels file may result in a noticeable performance penalty. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
true
|
| Key |
CompleteLabelFilter.enabled
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Remove truncated phrases. Tries to remove "incomplete" cluster labels. For example, in a collection of documents related to Data Mining, the phrase Conference on Data is incomplete in a sense that most likely it should be Conference on Data Mining or even Conference on Data Mining in Large Databases. When truncated phrase removal is enabled, the algorithm would try to remove the "incomplete" phrases like the former one and leave only the more informative variants. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
true
|
| Key |
LingoClusteringAlgorithm.labelAssigner
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Cluster label assignment method. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
org.carrot2.clustering.lingo.ILabelAssigner
|
| Default value |
org.carrot2.clustering.lingo.UniqueLabelAssigner
|
| Allowed value types | Allowed value types: No other assignable value types are allowed. |
| Key |
LingoClusteringAlgorithm.clusterMergingThreshold
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Cluster merging threshold. The percentage overlap between two cluster's documents required for the clusters to be merged into one clusters. Low values will result in more aggressive merging, which may lead to irrelevant documents in clusters. High values will result in fewer clusters being merged, which may lead to very similar or duplicated clusters. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Double
|
| Default value |
0.7
|
| Min value |
0.0
|
| Max value |
1.0
|
| Key |
LingoClusteringAlgorithm.phraseLabelBoost
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Phrase label boost. The weight of multi-word labels relative to one-word labels. Low values will result in more one-word labels being produced, higher values will favor multi-word labels. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Double
|
| Default value |
1.5
|
| Min value |
0.0
|
| Max value |
10.0
|
| Key |
LingoClusteringAlgorithm.phraseLengthPenaltyStart
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Phrase length penalty start.
The phrase length at which the overlong multi-word labels should start to be penalized. Phrases of length smaller than phraseLengthPenaltyStart will not be penalized. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
8
|
| Min value |
2
|
| Max value |
8
|
| Key |
LingoClusteringAlgorithm.phraseLengthPenaltyStop
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Phrase length penalty stop.
The phrase length at which the overlong multi-word labels should be removed completely. Phrases of length larger than phraseLengthPenaltyStop will be removed. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
8
|
| Min value |
2
|
| Max value |
8
|
| Key |
TermDocumentMatrixBuilder.titleWordsBoost
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Title word boost.
Gives more weight to words that appeared in Document.TITLE fields. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Double
|
| Default value |
2.0
|
| Min value |
0.0
|
| Max value |
10.0
|
| Key |
LingoClusteringAlgorithm.factorizationFactory
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Factorization method. The method to be used to factorize the term-document matrix and create base vectors that will give rise to cluster labels. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
org.carrot2.matrix.factorization.IMatrixFactorizationFactory
|
| Default value |
org.carrot2.matrix.factorization.NonnegativeMatrixFactorizationEDFactory
|
| Allowed value types |
Allowed value types:
|
| Key |
LingoClusteringAlgorithm.factorizationQuality
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Factorization quality. The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
org.carrot2.matrix.factorization.IterationNumberGuesser$FactorizationQuality
|
| Default value |
HIGH
|
| Allowed values |
|
| Key |
TermDocumentMatrixBuilder.maximumMatrixSize
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Maximum matrix size. The maximum number of the term-document matrix elements. The larger the size, the more accurate, time- and memory-consuming clustering. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
37500
|
| Min value |
5000
|
| Key |
TermDocumentMatrixBuilder.maxWordDf
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Maximum word document frequency.
The maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than maxWordDf will be ignored. For example, when maxWordDf is 0.4, words appearing in more than 40% of documents will be be ignored. The default value of 1.0 means that all words will be taken into account, no matter in how many documents they appear. This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting Another useful application of this attribute is when there is a need to generate only very specific clusters, i.e. clusters containing small numbers of documents. This can be achieved by setting |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Double
|
| Default value |
1.0
|
| Min value |
0.0
|
| Max value |
1.0
|
| Key |
LingoClusteringAlgorithm.nativeMatrixUsed
|
| Direction |
Output
|
| Description | Indicates whether Lingo used fast native matrix computation routines.
Value of this attribute is equal to NNIInterface.isNativeBlasAvailable() at the time of running the algorithm. |
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value | none |
| Key |
TermDocumentMatrixBuilder.termWeighting
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Term weighting. The method for calculating weight of words in the term-document matrices. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
org.carrot2.text.vsm.ITermWeighting
|
| Default value |
org.carrot2.text.vsm.LogTfIdfTermWeighting
|
| Allowed value types | Allowed value types: Other assignable value types are allowed. |
| Key |
MultilingualClustering.defaultLanguage
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Default clustering language.
The default language to use for documents with undefined Document.LANGUAGE. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
org.carrot2.core.LanguageCode
|
| Default value |
ENGLISH
|
| Allowed values |
|
| Key |
MultilingualClustering.languageAggregationStrategy
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Language aggregation strategy.
Determines how clusters generated for individual languages should be combined to form the final result. Please see LanguageAggregationStrategy for the list of available options. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
org.carrot2.text.clustering.MultilingualClustering$LanguageAggregationStrategy
|
| Default value |
FLATTEN_MAJOR_LANGUAGE
|
| Allowed values |
|
| Key |
PhraseExtractor.dfThreshold
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Phrase Document Frequency threshold.
Phrases appearing in fewer than dfThreshold documents will be ignored. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
1
|
| Min value |
1
|
| Max value |
100
|
| Key |
CompleteLabelFilter.labelOverrideThreshold
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Truncated label threshold. Determines the strength of the truncated label filter. The lowest value means strongest truncated labels elimination, which may lead to overlong cluster labels and many unclustered documents. The highest value effectively disables the filter, which may result in short or truncated labels. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Double
|
| Default value |
0.65
|
| Min value |
0.0
|
| Max value |
1.0
|
| Key |
Tokenizer.documentFields
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Textual fields of documents that should be tokenized and parsed for clustering. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
java.util.Collection
|
| Default value |
[title, snippet]
|
| Key |
DocumentAssigner.exactPhraseAssignment
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Only exact phrase assignments. Assign only documents that contain the label in its original form, including the order of words. Enabling this option will cause less documents to be put in clusters, which result in higher precision of assignment, but also a larger "Other Topics" group. Disabling this option will cause more documents to be put in clusters, which will make the "Other Topics" cluster smaller, but also lower the precision of cluster-document assignments. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
false
|
| Key |
PreprocessingPipeline.languageModelFactory
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Language model factory. Creates language the language model to be used by the clustering algorithm. The language models provides the lexical resources required to perform clustering, including stop words and a word stemming algorithm. |
| Required |
no
|
| Scope | Initialization time and Processing time |
| Value type |
org.carrot2.text.linguistic.ILanguageModelFactory
|
| Default value |
org.carrot2.text.linguistic.DefaultLanguageModelFactory
|
| Allowed value types | Allowed value types: Other assignable value types are allowed. |
| Key |
DefaultLanguageModelFactory.mergeResources
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Merges stop words and stop labels from all known languages.
If set to false, only stop words and stop labels of the active language will be used. If set to true, stop words from all LanguageCodes will be used together and stop labels from all languages will be used together, no matter the active language. Lexical resource merging is useful when clustering data in a mix of different languages and should increase clustering quality in such settings. |
| Required |
no
|
| Scope | Initialization time and Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
true
|
| Key |
DocumentAssigner.minClusterSize
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Determines the minimum number of documents in each cluster. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
2
|
| Min value |
1
|
| Max value |
100
|
| Key |
DefaultLanguageModelFactory.reloadResources
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Reloads cached stop words and stop labels on every processing request. For best performance, lexical resource reloading should be disabled in production. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
false
|
| Key |
CaseNormalizer.dfThreshold
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Word Document Frequency threshold.
Words appearing in fewer than dfThreshold documents will be ignored. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
1
|
| Min value |
1
|
| Max value |
100
|
| Key |
query
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Query that produced the documents. The query will help the algorithm to create better clusters. Therefore, providing the query is optional but desirable. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Key |
STCClusteringAlgorithm.documentCountBoost
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Double
|
| Default value |
1.0
|
| Min value |
0.0
|
| Key |
STCClusteringAlgorithm.optimalPhraseLength
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Optimal label length. A factor in calculation of the base cluster score. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
3
|
| Min value |
1
|
| Key |
STCClusteringAlgorithm.optimalPhraseLengthDev
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Phrase length tolerance. A factor in calculation of the base cluster score. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Double
|
| Default value |
2.0
|
| Min value |
0.5
|
| Key |
STCClusteringAlgorithm.singleTermBoost
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Single term boost. A factor in calculation of the base cluster score. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Double
|
| Default value |
0.5
|
| Min value |
0.0
|
| Key |
STCClusteringAlgorithm.maxBaseClusters
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Maximum base clusters count. Trims the base cluster array after N-th position for the merging phase. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
300
|
| Min value |
2
|
| Key |
STCClusteringAlgorithm.minBaseClusterScore
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Minimum base cluster score. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Double
|
| Default value |
2.0
|
| Min value |
0.0
|
| Max value |
10.0
|
| Key |
STCClusteringAlgorithm.minBaseClusterSize
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Minimum documents per base cluster. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
2
|
| Min value |
2
|
| Max value |
20
|
| Key |
clusters
|
| Direction |
Output
|
| Description | Clusters created by the algorithm. |
| Scope | Processing time |
| Value type |
java.util.List
|
| Default value | none |
| Key |
documents
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Documents to cluster. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.util.List
|
| Default value | none |
| Key |
STCClusteringAlgorithm.maxPhraseOverlap
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Maximum cluster phrase overlap. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Double
|
| Default value |
0.6
|
| Min value |
0.0
|
| Max value |
1.0
|
| Key |
STCClusteringAlgorithm.maxPhrases
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum phrases per label. Maximum number of phrases from base clusters promoted to the cluster's label. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
3
|
| Min value |
1
|
| Key |
STCClusteringAlgorithm.maxDescPhraseLength
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum words per label. Base clusters formed by phrases with more words than this ratio are trimmed. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
4
|
| Min value |
1
|
| Key |
STCClusteringAlgorithm.mostGeneralPhraseCoverage
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Minimum general phrase coverage. Minimum phrase coverage to appear in cluster description. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Double
|
| Default value |
0.5
|
| Min value |
0.0
|
| Max value |
1.0
|
| Key |
STCClusteringAlgorithm.mergeThreshold
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Base cluster merge threshold. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Double
|
| Default value |
0.6
|
| Min value |
0.0
|
| Max value |
1.0
|
| Key |
STCClusteringAlgorithm.maxClusters
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum final clusters. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
15
|
| Min value |
1
|
| Key |
MultilingualClustering.defaultLanguage
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Default clustering language.
The default language to use for documents with undefined Document.LANGUAGE. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
org.carrot2.core.LanguageCode
|
| Default value |
ENGLISH
|
| Allowed values |
|
| Key |
MultilingualClustering.languageAggregationStrategy
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Language aggregation strategy.
Determines how clusters generated for individual languages should be combined to form the final result. Please see LanguageAggregationStrategy for the list of available options. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
org.carrot2.text.clustering.MultilingualClustering$LanguageAggregationStrategy
|
| Default value |
FLATTEN_MAJOR_LANGUAGE
|
| Allowed values |
|
| Key |
Tokenizer.documentFields
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Textual fields of documents that should be tokenized and parsed for clustering. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
java.util.Collection
|
| Default value |
[title, snippet]
|
| Key |
PreprocessingPipeline.languageModelFactory
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Language model factory. Creates language the language model to be used by the clustering algorithm. The language models provides the lexical resources required to perform clustering, including stop words and a word stemming algorithm. |
| Required |
no
|
| Scope | Initialization time and Processing time |
| Value type |
org.carrot2.text.linguistic.ILanguageModelFactory
|
| Default value |
org.carrot2.text.linguistic.DefaultLanguageModelFactory
|
| Allowed value types | Allowed value types: Other assignable value types are allowed. |
| Key |
DefaultLanguageModelFactory.mergeResources
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Merges stop words and stop labels from all known languages.
If set to false, only stop words and stop labels of the active language will be used. If set to true, stop words from all LanguageCodes will be used together and stop labels from all languages will be used together, no matter the active language. Lexical resource merging is useful when clustering data in a mix of different languages and should increase clustering quality in such settings. |
| Required |
no
|
| Scope | Initialization time and Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
true
|
| Key |
DefaultLanguageModelFactory.reloadResources
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Reloads cached stop words and stop labels on every processing request. For best performance, lexical resource reloading should be disabled in production. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
false
|
| Key |
CaseNormalizer.dfThreshold
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Word Document Frequency threshold.
Words appearing in fewer than dfThreshold documents will be ignored. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
1
|
| Min value |
1
|
| Max value |
100
|
| Key |
query
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Query that produced the documents. The query will help the algorithm to create better clusters. Therefore, providing the query is optional but desirable. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Key |
STCClusteringAlgorithm.ignoreWordIfInHigherDocsPercent
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Maximum word-document ratio. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Double
|
| Default value |
0.9
|
| Min value |
0.0
|
| Max value |
1.0
|
| Key |
STCClusteringAlgorithm.ignoreWordIfInFewerDocs
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Minimum word-document recurrences. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
2
|
| Min value |
2
|
Open Search document source retrieves search results from search engines supporting the OpenSearch standard.
| Key |
SearchEngineBase.compressed
|
| Direction |
Output
|
| Description | Indicates whether the search engine returned a compressed result stream. |
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value | none |
| Key |
SearchEngineStats.pageRequests
|
| Direction |
Output
|
| Description | Number of individual page requests issued by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
SearchEngineStats.queries
|
| Direction |
Output
|
| Description | Number queries handled successfully by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
documents
|
| Direction |
Output
|
| Description | Documents returned by the search engine/ document retrieval system. |
| Scope | Processing time |
| Value type |
java.util.Collection
|
| Default value | none |
| Key |
search-mode
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Search mode defines how fetchers returned from createFetcher(SearchRange) are called.
|
| Required |
no
|
| Scope | Processing time |
| Value type |
org.carrot2.source.MultipageSearchEngine$SearchMode
|
| Default value |
SPECULATIVE
|
| Allowed values |
|
| Key |
query
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Query to perform. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Value content | Must not be blank |
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum number of documents/ search results to fetch. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
100
|
| Min value |
1
|
| Key |
start
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Index of the first document/ search result to fetch. The index starts at zero. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
0
|
| Min value |
0
|
| Key |
results-total
|
| Direction |
Output
|
| Description | Estimated total number of matching documents. |
| Scope | Processing time |
| Value type |
java.lang.Long
|
| Default value | none |
| Key |
OpenSearchDocumentSource.feedUrlParams
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Additional parameters to be appended to feedUrlTemplate on each request.
|
| Required |
no
|
| Scope | Initialization time and Processing time |
| Value type |
java.util.Map
|
| Default value | none |
| Key |
OpenSearchDocumentSource.feedUrlTemplate
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | URL to fetch the search feed from.
The URL template can contain variable place holders as defined by the OpenSearch specification that will be replaced during runtime. The format of the place holder is ${variable}. The following variables are supported:
|
| Required |
yes
|
| Scope | Initialization time |
| Value type |
java.lang.String
|
| Default value | none |
| Key |
OpenSearchDocumentSource.maximumResults
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Maximum number of results. The maximum number of results the document source can deliver. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
java.lang.Integer
|
| Default value |
1000
|
| Min value |
1
|
| Key |
OpenSearchDocumentSource.resultsPerPage
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Results per page. The number of results per page the document source will expect the feed to return. |
| Required |
yes
|
| Scope | Initialization time |
| Value type |
java.lang.Integer
|
| Default value |
0
|
| Min value |
1
|
Searches the web using Google.
| Key |
SearchEngineBase.compressed
|
| Direction |
Output
|
| Description | Indicates whether the search engine returned a compressed result stream. |
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value | none |
| Key |
SearchEngineStats.pageRequests
|
| Direction |
Output
|
| Description | Number of individual page requests issued by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
SearchEngineStats.queries
|
| Direction |
Output
|
| Description | Number queries handled successfully by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
documents
|
| Direction |
Output
|
| Description | Documents returned by the search engine/ document retrieval system. |
| Scope | Processing time |
| Value type |
java.util.Collection
|
| Default value | none |
| Key |
GoogleDocumentSource.keepHighlights
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Keep query word highlighting.
Google by default highlights query words in snippets using the bold HTML tag. Set this attribute to true to keep these highlights. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
false
|
| Key |
search-mode
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Search mode defines how fetchers returned from createFetcher(SearchRange) are called.
|
| Required |
no
|
| Scope | Processing time |
| Value type |
org.carrot2.source.MultipageSearchEngine$SearchMode
|
| Default value |
SPECULATIVE
|
| Allowed values |
|
| Key |
query
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Query to perform. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Value content | Must not be blank |
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum number of documents/ search results to fetch. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
100
|
| Min value |
1
|
| Key |
start
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Index of the first document/ search result to fetch. The index starts at zero. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
0
|
| Min value |
0
|
| Key |
results-total
|
| Direction |
Output
|
| Description | Estimated total number of matching documents. |
| Scope | Processing time |
| Value type |
java.lang.Long
|
| Default value | none |
| Key |
GoogleDocumentSource.apiKey
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Google API Key. Please do not use the default key when deploying this component in production environments. Instead, apply generate and use your own key. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value |
ABQIAAAA_XmITjrzoipJYoBApAgGJhS8yIvkL4-1sNwOJWkV7nbkjq_Z_BQW0-uzOh5lKXRtEXQDTGbzIEz06Q
|
| Key |
GoogleDocumentSource.referer
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Request referer. Please do not use the default value when deploying this component in production environments. Instead, put the URL to your application here. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value |
http://www.carrot2.org
|
| Key |
GoogleDocumentSource.serviceUrl
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Service URL. Google web search service URL. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value |
http://ajax.googleapis.com/ajax/services/search/web
|
eTools document source searches the web using etools.ch metasearch engine
| Key |
SearchEngineBase.compressed
|
| Direction |
Output
|
| Description | Indicates whether the search engine returned a compressed result stream. |
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value | none |
| Key |
SearchEngineStats.pageRequests
|
| Direction |
Output
|
| Description | Number of individual page requests issued by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
SearchEngineStats.queries
|
| Direction |
Output
|
| Description | Number queries handled successfully by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
documents
|
| Direction |
Output
|
| Description | Documents returned by the search engine/ document retrieval system. |
| Scope | Processing time |
| Value type |
java.util.Collection
|
| Default value | none |
| Key |
EToolsDocumentSource.country
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Determines the country of origin for the returned search results. |
| Required |
no
|
| Scope | Processing time |
| Value type |
org.carrot2.source.etools.EToolsDocumentSource$Country
|
| Default value |
ALL
|
| Allowed values |
|
| Key |
EToolsDocumentSource.language
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Determines the language of the returned search results. |
| Required |
no
|
| Scope | Processing time |
| Value type |
org.carrot2.source.etools.EToolsDocumentSource$Language
|
| Default value |
ENGLISH
|
| Allowed values |
|
| Key |
EToolsDocumentSource.safeSearch
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | If enabled, excludes offensive content from the results. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
false
|
| Key |
query
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Query to perform. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Value content | Must not be blank |
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum number of documents/ search results to fetch. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
100
|
| Min value |
1
|
| Key |
start
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Index of the first document/ search result to fetch. The index starts at zero. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
0
|
| Min value |
0
|
| Key |
results-total
|
| Direction |
Output
|
| Description | Estimated total number of matching documents. |
| Scope | Processing time |
| Value type |
java.lang.Long
|
| Default value | none |
| Key |
EToolsDocumentSource.dataSources
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Determines which data sources to search. |
| Required |
no
|
| Scope | Processing time |
| Value type |
org.carrot2.source.etools.EToolsDocumentSource$DataSources
|
| Default value |
ALL
|
| Allowed values |
|
| Key |
EToolsDocumentSource.partnerId
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | eTools partner identifier. If you have commercial arrangements with eTools, specify your partner id here. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value |
Carrot2
|
| Key |
EToolsDocumentSource.serviceUrlBase
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Base URL for the eTools service. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value |
http://www.etools.ch/partnerSearch.do
|
| Key |
EToolsDocumentSource.timeout
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Maximum time in milliseconds to wait for all data sources to return results. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
4000
|
| Min value |
0
|
Searches the web using MSN Live API
| Key |
SearchEngineBase.compressed
|
| Direction |
Output
|
| Description | Indicates whether the search engine returned a compressed result stream. |
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value | none |
| Key |
SearchEngineStats.pageRequests
|
| Direction |
Output
|
| Description | Number of individual page requests issued by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
SearchEngineStats.queries
|
| Direction |
Output
|
| Description | Number queries handled successfully by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
documents
|
| Direction |
Output
|
| Description | Documents returned by the search engine/ document retrieval system. |
| Scope | Processing time |
| Value type |
java.util.Collection
|
| Default value | none |
| Key |
MicrosoftLiveDocumentSource.culture
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Culture and language restriction. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
org.carrot2.source.microsoft.CultureInfo
|
| Default value |
ENGLISH_UNITED_STATES
|
| Allowed values |
|
| Key |
MicrosoftLiveDocumentSource.safeSearch
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Safe search restriction (porn filter). |
| Required |
yes
|
| Scope | Processing time |
| Value type |
org.carrot2.source.microsoft.SafeSearch
|
| Default value |
MODERATE
|
| Allowed values |
|
| Key |
search-mode
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Search mode defines how fetchers returned from createFetcher(SearchRange) are called.
|
| Required |
no
|
| Scope | Processing time |
| Value type |
org.carrot2.source.MultipageSearchEngine$SearchMode
|
| Default value |
SPECULATIVE
|
| Allowed values |
|
| Key |
query
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Query to perform. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Value content | Must not be blank |
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum number of documents/ search results to fetch. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
100
|
| Min value |
1
|
| Key |
start
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Index of the first document/ search result to fetch. The index starts at zero. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
0
|
| Min value |
0
|
| Key |
results-total
|
| Direction |
Output
|
| Description | Estimated total number of matching documents. |
| Scope | Processing time |
| Value type |
java.lang.Long
|
| Default value | none |
| Key |
MicrosoftLiveDocumentSource.appid
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Microsoft-assigned application ID for querying the API. Please generate your own ID for production deployments and branches off the Carrot2.org's code. |
| Required |
yes
|
| Scope | Initialization time |
| Value type |
java.lang.String
|
| Default value |
DE531D8A42139F590B253CADFAD7A86172F93B96
|
Searches the web using Yahoo Boss Web Search API
| Key |
SearchEngineBase.compressed
|
| Direction |
Output
|
| Description | Indicates whether the search engine returned a compressed result stream. |
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value | none |
| Key |
SearchEngineStats.pageRequests
|
| Direction |
Output
|
| Description | Number of individual page requests issued by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
SearchEngineStats.queries
|
| Direction |
Output
|
| Description | Number queries handled successfully by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
documents
|
| Direction |
Output
|
| Description | Documents returned by the search engine/ document retrieval system. |
| Scope | Processing time |
| Value type |
java.util.Collection
|
| Default value | none |
| Key |
BossDocumentSource.keepHighlights
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Determines whether to keep the original query word highlights.
Yahoo by default highlights query words in search results using the <b> HTML tag. Set this attribute to true to keep these highlights. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
false
|
| Key |
BossWebSearchService.filter
|
||||||
| Direction |
Input
|
||||||
| Level |
MEDIUM
|
||||||
| Description | Filters out adult or hate content.
Must be a comma-separated list of content types to filter out. The following content types are supported:
Adult content filtering is supported for all languages, hate content filtering is supported for English only. |
||||||
| Required |
no
|
||||||
| Scope | Processing time | ||||||
| Value type |
org.carrot2.source.boss.BossWebSearchService$OffensiveContentFilter
|
||||||
| Default value | none | ||||||
| Allowed values |
|
| Key |
BossSearchService.sites
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Restricts search results to a set of sites.
Must be a comma-separated list of site's domain names, e.g. abc.com,cnn.com. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Key |
BossSearchService.languageAndRegion
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Direction |
Input
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Level |
MEDIUM
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description | Restricts search to the specified language and region.
Must be a concatenation of constants defined by the language codes supported by the Yahoo Boss API. The following languages and regions are currently (July 2009) supported:
Use |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Required |
no
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Scope | Initialization time and Processing time | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Value type |
org.carrot2.source.boss.BossLanguageCodes
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Default value | none | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Allowed values |
|
| Key |
BossWebSearchService.type
|
||||||||||||||||||||
| Direction |
Input
|
||||||||||||||||||||
| Level |
ADVANCED
|
||||||||||||||||||||
| Description | Restricts search to documents of the specified types.
Must be a comma-separated list of the required document types or type groups. The following document types are supported:
The following document type groups are supported:
You can also specify a format group and then exclude an item: |
||||||||||||||||||||
| Required |
no
|
||||||||||||||||||||
| Scope | Processing time | ||||||||||||||||||||
| Value type |
java.lang.String
|
||||||||||||||||||||
| Default value | none |
| Key |
search-mode
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Search mode defines how fetchers returned from createFetcher(SearchRange) are called.
|
| Required |
no
|
| Scope | Processing time |
| Value type |
org.carrot2.source.MultipageSearchEngine$SearchMode
|
| Default value |
SPECULATIVE
|
| Allowed values |
|
| Key |
query
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Query to perform. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Value content | Must not be blank |
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum number of documents/ search results to fetch. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
100
|
| Min value |
1
|
| Key |
start
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Index of the first document/ search result to fetch. The index starts at zero. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
0
|
| Min value |
0
|
| Key |
results-total
|
| Direction |
Output
|
| Description | Estimated total number of matching documents. |
| Scope | Processing time |
| Value type |
java.lang.Long
|
| Default value | none |
| Key |
BossSearchService.appid
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Application ID required for BOSS services. Please generate your own ID for production deployments and branches off the Carrot2.org's code. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
java.lang.String
|
| Default value |
txRLTt7V34GgabH9baqIrsnRLuy87i4dQ2kQyok0IIqlUXdw4HmxjE59xhq2_6mT0LM-
|
| Key |
BossDocumentSource.service
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | The specific search service to be used by this document source. Use this attribute to choose which BOSS's service to query, e.g. Web, News or Image search. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
org.carrot2.source.boss.BossSearchService
|
| Default value |
org.carrot2.source.boss.BossWebSearchService
|
| Allowed value types | Allowed value types: No other assignable value types are allowed. |
| Key |
BossWebSearchService.serviceURI
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Boss Web search service URI.
Specifies the URI at which Yahoo Boss Web Search API is available. The ${query} place holder will be replaced with the URL-encoded text of the processed query. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
java.lang.String
|
| Default value |
http://boss.yahooapis.com/ysearch/web/v1/${query}
|
Searches the Wikipedia web using Yahoo Boss Web Search API
| Key |
SearchEngineBase.compressed
|
| Direction |
Output
|
| Description | Indicates whether the search engine returned a compressed result stream. |
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value | none |
| Key |
SearchEngineStats.pageRequests
|
| Direction |
Output
|
| Description | Number of individual page requests issued by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
SearchEngineStats.queries
|
| Direction |
Output
|
| Description | Number queries handled successfully by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
documents
|
| Direction |
Output
|
| Description | Documents returned by the search engine/ document retrieval system. |
| Scope | Processing time |
| Value type |
java.util.Collection
|
| Default value | none |
| Key |
BossDocumentSource.keepHighlights
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Determines whether to keep the original query word highlights.
Yahoo by default highlights query words in search results using the <b> HTML tag. Set this attribute to true to keep these highlights. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
false
|
| Key |
BossWebSearchService.filter
|
||||||
| Direction |
Input
|
||||||
| Level |
MEDIUM
|
||||||
| Description | Filters out adult or hate content.
Must be a comma-separated list of content types to filter out. The following content types are supported:
Adult content filtering is supported for all languages, hate content filtering is supported for English only. |
||||||
| Required |
no
|
||||||
| Scope | Processing time | ||||||
| Value type |
org.carrot2.source.boss.BossWebSearchService$OffensiveContentFilter
|
||||||
| Default value | none | ||||||
| Allowed values |
|
| Key |
BossSearchService.sites
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Restricts search results to a set of sites.
Must be a comma-separated list of site's domain names, e.g. abc.com,cnn.com. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value |
en.wikipedia.org
|
| Key |
BossSearchService.languageAndRegion
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Direction |
Input
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Level |
MEDIUM
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description | Restricts search to the specified language and region.
Must be a concatenation of constants defined by the language codes supported by the Yahoo Boss API. The following languages and regions are currently (July 2009) supported:
Use |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Required |
no
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Scope | Initialization time and Processing time | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Value type |
org.carrot2.source.boss.BossLanguageCodes
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Default value | none | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Allowed values |
|
| Key |
BossWebSearchService.type
|
||||||||||||||||||||
| Direction |
Input
|
||||||||||||||||||||
| Level |
ADVANCED
|
||||||||||||||||||||
| Description | Restricts search to documents of the specified types.
Must be a comma-separated list of the required document types or type groups. The following document types are supported:
The following document type groups are supported:
You can also specify a format group and then exclude an item: |
||||||||||||||||||||
| Required |
no
|
||||||||||||||||||||
| Scope | Processing time | ||||||||||||||||||||
| Value type |
java.lang.String
|
||||||||||||||||||||
| Default value | none |
| Key |
search-mode
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Search mode defines how fetchers returned from createFetcher(SearchRange) are called.
|
| Required |
no
|
| Scope | Processing time |
| Value type |
org.carrot2.source.MultipageSearchEngine$SearchMode
|
| Default value |
SPECULATIVE
|
| Allowed values |
|
| Key |
query
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Query to perform. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Value content | Must not be blank |
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum number of documents/ search results to fetch. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
100
|
| Min value |
1
|
| Key |
start
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Index of the first document/ search result to fetch. The index starts at zero. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
0
|
| Min value |
0
|
| Key |
results-total
|
| Direction |
Output
|
| Description | Estimated total number of matching documents. |
| Scope | Processing time |
| Value type |
java.lang.Long
|
| Default value | none |
| Key |
BossSearchService.appid
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Application ID required for BOSS services. Please generate your own ID for production deployments and branches off the Carrot2.org's code. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
java.lang.String
|
| Default value |
txRLTt7V34GgabH9baqIrsnRLuy87i4dQ2kQyok0IIqlUXdw4HmxjE59xhq2_6mT0LM-
|
| Key |
BossDocumentSource.service
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | The specific search service to be used by this document source. Use this attribute to choose which BOSS's service to query, e.g. Web, News or Image search. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
org.carrot2.source.boss.BossSearchService
|
| Default value |
org.carrot2.source.boss.BossWebSearchService
|
| Allowed value types | Allowed value types: No other assignable value types are allowed. |
| Key |
BossWebSearchService.serviceURI
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Boss Web search service URI.
Specifies the URI at which Yahoo Boss Web Search API is available. The ${query} place holder will be replaced with the URL-encoded text of the processed query. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
java.lang.String
|
| Default value |
http://boss.yahooapis.com/ysearch/web/v1/${query}
|
Searches web images using Yahoo Boss Image Search API
| Key |
SearchEngineBase.compressed
|
| Direction |
Output
|
| Description | Indicates whether the search engine returned a compressed result stream. |
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value | none |
| Key |
SearchEngineStats.pageRequests
|
| Direction |
Output
|
| Description | Number of individual page requests issued by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
SearchEngineStats.queries
|
| Direction |
Output
|
| Description | Number queries handled successfully by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
documents
|
| Direction |
Output
|
| Description | Documents returned by the search engine/ document retrieval system. |
| Scope | Processing time |
| Value type |
java.util.Collection
|
| Default value | none |
| Key |
BossDocumentSource.keepHighlights
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Determines whether to keep the original query word highlights.
Yahoo by default highlights query words in search results using the <b> HTML tag. Set this attribute to true to keep these highlights. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
false
|
| Key |
BossSearchService.sites
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Restricts search results to a set of sites.
Must be a comma-separated list of site's domain names, e.g. abc.com,cnn.com. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Key |
BossSearchService.languageAndRegion
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Direction |
Input
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Level |
MEDIUM
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description | Restricts search to the specified language and region.
Must be a concatenation of constants defined by the language codes supported by the Yahoo Boss API. The following languages and regions are currently (July 2009) supported:
Use |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Required |
no
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Scope | Initialization time and Processing time | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Value type |
org.carrot2.source.boss.BossLanguageCodes
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Default value | none | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Allowed values |
|
| Key |
BossImageSearchService.filter
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | If enabled, excludes offensive content from the results. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
true
|
| Key |
BossImageSearchService.dimensions
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | The size of images to fetch. Small images are generally thumbnail or icon sized. Medium sized images are average sized; usually not exceeding an average screen size. Large images are screen size or larger. |
| Required |
no
|
| Scope | Initialization time and Processing time |
| Value type |
org.carrot2.source.boss.Dimensions
|
| Default value |
ALL
|
| Allowed values |
|
| Key |
search-mode
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Search mode defines how fetchers returned from createFetcher(SearchRange) are called.
|
| Required |
no
|
| Scope | Processing time |
| Value type |
org.carrot2.source.MultipageSearchEngine$SearchMode
|
| Default value |
SPECULATIVE
|
| Allowed values |
|
| Key |
query
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Query to perform. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Value content | Must not be blank |
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum number of documents/ search results to fetch. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
100
|
| Min value |
1
|
| Key |
start
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Index of the first document/ search result to fetch. The index starts at zero. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
0
|
| Min value |
0
|
| Key |
results-total
|
| Direction |
Output
|
| Description | Estimated total number of matching documents. |
| Scope | Processing time |
| Value type |
java.lang.Long
|
| Default value | none |
| Key |
BossSearchService.appid
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Application ID required for BOSS services. Please generate your own ID for production deployments and branches off the Carrot2.org's code. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
java.lang.String
|
| Default value |
txRLTt7V34GgabH9baqIrsnRLuy87i4dQ2kQyok0IIqlUXdw4HmxjE59xhq2_6mT0LM-
|
| Key |
BossDocumentSource.service
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | The specific search service to be used by this document source. Use this attribute to choose which BOSS's service to query, e.g. Web, News or Image search. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
org.carrot2.source.boss.BossSearchService
|
| Default value |
org.carrot2.source.boss.BossImageSearchService
|
| Allowed value types | Allowed value types: No other assignable value types are allowed. |
| Key |
BossImageSearchService.serviceURI
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Boss Image search service URI.
Specifies the URI at which Yahoo Boss Image Search API is available. The ${query} place holder will be replaced with the URL-encoded text of the processed query. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
java.lang.String
|
| Default value |
http://boss.yahooapis.com/ysearch/images/v1/${query}
|
Yahoo Boss News Search searches news using Yahoo Boss API.
| Key |
SearchEngineBase.compressed
|
| Direction |
Output
|
| Description | Indicates whether the search engine returned a compressed result stream. |
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value | none |
| Key |
SearchEngineStats.pageRequests
|
| Direction |
Output
|
| Description | Number of individual page requests issued by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
SearchEngineStats.queries
|
| Direction |
Output
|
| Description | Number queries handled successfully by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
documents
|
| Direction |
Output
|
| Description | Documents returned by the search engine/ document retrieval system. |
| Scope | Processing time |
| Value type |
java.util.Collection
|
| Default value | none |
| Key |
BossDocumentSource.keepHighlights
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Determines whether to keep the original query word highlights.
Yahoo by default highlights query words in search results using the <b> HTML tag. Set this attribute to true to keep these highlights. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
false
|
| Key |
BossNewsSearchService.age
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Maximum age of returned news in days. The index stories for 30 days. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
7
|
| Min value |
1
|
| Max value |
30
|
| Key |
BossSearchService.sites
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Restricts search results to a set of sites.
Must be a comma-separated list of site's domain names, e.g. abc.com,cnn.com. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Key |
BossSearchService.languageAndRegion
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Direction |
Input
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Level |
MEDIUM
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description | Restricts search to the specified language and region.
Must be a concatenation of constants defined by the language codes supported by the Yahoo Boss API. The following languages and regions are currently (July 2009) supported:
Use |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Required |
no
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Scope | Initialization time and Processing time | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Value type |
org.carrot2.source.boss.BossLanguageCodes
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Default value | none | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Allowed values |
|
| Key |
search-mode
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Search mode defines how fetchers returned from createFetcher(SearchRange) are called.
|
| Required |
no
|
| Scope | Processing time |
| Value type |
org.carrot2.source.MultipageSearchEngine$SearchMode
|
| Default value |
SPECULATIVE
|
| Allowed values |
|
| Key |
query
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Query to perform. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Value content | Must not be blank |
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum number of documents/ search results to fetch. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
100
|
| Min value |
1
|
| Key |
start
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Index of the first document/ search result to fetch. The index starts at zero. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
0
|
| Min value |
0
|
| Key |
results-total
|
| Direction |
Output
|
| Description | Estimated total number of matching documents. |
| Scope | Processing time |
| Value type |
java.lang.Long
|
| Default value | none |
| Key |
BossSearchService.appid
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Application ID required for BOSS services. Please generate your own ID for production deployments and branches off the Carrot2.org's code. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
java.lang.String
|
| Default value |
txRLTt7V34GgabH9baqIrsnRLuy87i4dQ2kQyok0IIqlUXdw4HmxjE59xhq2_6mT0LM-
|
| Key |
BossDocumentSource.service
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | The specific search service to be used by this document source. Use this attribute to choose which BOSS's service to query, e.g. Web, News or Image search. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
org.carrot2.source.boss.BossSearchService
|
| Default value |
org.carrot2.source.boss.BossNewsSearchService
|
| Allowed value types | Allowed value types: No other assignable value types are allowed. |
| Key |
BossNewsSearchService.serviceURI
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Boss News search service URI. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
java.lang.String
|
| Default value |
http://boss.yahooapis.com/ysearch/news/v1/${query}
|
Searches jobs from indeed.com
| Key |
SearchEngineBase.compressed
|
| Direction |
Output
|
| Description | Indicates whether the search engine returned a compressed result stream. |
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value | none |
| Key |
SearchEngineStats.pageRequests
|
| Direction |
Output
|
| Description | Number of individual page requests issued by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
SearchEngineStats.queries
|
| Direction |
Output
|
| Description | Number queries handled successfully by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
documents
|
| Direction |
Output
|
| Description | Documents returned by the search engine/ document retrieval system. |
| Scope | Processing time |
| Value type |
java.util.Collection
|
| Default value | none |
| Key |
search-mode
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Search mode defines how fetchers returned from createFetcher(SearchRange) are called.
|
| Required |
no
|
| Scope | Processing time |
| Value type |
org.carrot2.source.MultipageSearchEngine$SearchMode
|
| Default value |
SPECULATIVE
|
| Allowed values |
|
| Key |
query
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Query to perform. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Value content | Must not be blank |
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum number of documents/ search results to fetch. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
100
|
| Min value |
1
|
| Key |
start
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Index of the first document/ search result to fetch. The index starts at zero. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
0
|
| Min value |
0
|
| Key |
results-total
|
| Direction |
Output
|
| Description | Estimated total number of matching documents. |
| Scope | Processing time |
| Value type |
java.lang.Long
|
| Default value | none |
| Key |
OpenSearchDocumentSource.feedUrlParams
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Additional parameters to be appended to feedUrlTemplate on each request.
|
| Required |
no
|
| Scope | Initialization time and Processing time |
| Value type |
java.util.Map
|
| Default value | none |
| Key |
OpenSearchDocumentSource.feedUrlTemplate
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | URL to fetch the search feed from.
The URL template can contain variable place holders as defined by the OpenSearch specification that will be replaced during runtime. The format of the place holder is ${variable}. The following variables are supported:
|
| Required |
yes
|
| Scope | Initialization time |
| Value type |
java.lang.String
|
| Default value |
http://www.indeed.com/opensearch?q=${searchTerms}&start=${startIndex}&limit=${count}
|
| Key |
OpenSearchDocumentSource.maximumResults
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Maximum number of results. The maximum number of results the document source can deliver. |
| Required |
no
|
| Scope | Initialization time |
| Value type |
java.lang.Integer
|
| Default value |
400
|
| Min value |
1
|
| Key |
OpenSearchDocumentSource.resultsPerPage
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Results per page. The number of results per page the document source will expect the feed to return. |
| Required |
yes
|
| Scope | Initialization time |
| Value type |
java.lang.Integer
|
| Default value |
50
|
| Min value |
1
|
XML document source retrieves documents from local XML files or remote XML streams. It can optionally apply an XSLT transformation to convert the XML to the required format.
| Key |
documents
|
| Direction |
Output
|
| Description | Documents read from the XML data. |
| Scope | Processing time |
| Value type |
java.util.List
|
| Default value | none |
| Key |
query
|
| Direction |
Input
and
Output
|
| Level |
BASIC
|
| Description | After processing this field may hold the query read from the XML data, if any.
For the semantics of this field on input, see xml. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Key |
XmlDocumentSource.readAll
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | If true, all documents are read from the input XML stream, regardless of the limit set by results.
|
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
true
|
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | The maximum number of documents to read from the XML data if readAll is false.
|
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
100
|
| Min value |
1
|
| Key |
processing-result.title
|
| Direction |
Output
|
| Description | The title (file name or query attribute, if present) for the search result fetched from the resource. A typical title for a processing result will be the query used to fetch documents from that source. For certain document sources the query may not be needed (on-disk XML, feed of syndicated news); in such cases, the input component should set its title properly for visual interfaces such as the workbench. |
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Key |
XmlDocumentSource.xmlParameters
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Values for custom placeholders in the XML URL.
If the type of resource provided in the xml attribute is URLResourceWithParams, this map provides values for custom placeholders found in the XML URL. Keys of the map correspond to placeholder names, values of the map will be used to replace the placeholders. Please see xml for the placeholder syntax. |
| Required |
no
|
| Scope | Initialization time and Processing time |
| Value type |
java.util.Map
|
| Default value |
{}
|
| Key |
XmlDocumentSource.xml
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | The resource to load XML data from.
You can either create instances of IResource implementations directly or use ResourceUtils to look up IResource instances from a variety of locations. One special
Additionally, custom placeholders can be used. Values for the custom placeholders should be provided in the |
| Required |
yes
|
| Scope | Initialization time and Processing time |
| Value type |
org.carrot2.util.resource.IResource
|
| Default value | none |
| Allowed value types | Allowed value types: Other assignable value types are allowed. |
| Key |
XmlDocumentSource.xsltParameters
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Parameters to be passed to the XSLT transformer. Keys of the map will be used as parameter names, values of the map as parameter values. |
| Required |
no
|
| Scope | Initialization time and Processing time |
| Value type |
java.util.Map
|
| Default value |
{}
|
| Key |
XmlDocumentSource.xslt
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | The resource to load XSLT stylesheet from.
The XSLT stylesheet is optional and is useful when the source XML stream does not follow the Carrot2 format. The XSLT transformation will be applied to the source XML stream, the transformed XML stream will be deserialized into Documents. The XSLT To pass additional parameters to the XSLT transformer, use the |
| Required |
no
|
| Scope | Initialization time and Processing time |
| Value type |
org.carrot2.util.resource.IResource
|
| Default value | none |
| Allowed value types | Allowed value types: Other assignable value types are allowed. |
Google Desktop document source searches the local instance of Google Desktop.
| Key |
SearchEngineBase.compressed
|
| Direction |
Output
|
| Description | Indicates whether the search engine returned a compressed result stream. |
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value | none |
| Key |
SearchEngineStats.pageRequests
|
| Direction |
Output
|
| Description | Number of individual page requests issued by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
SearchEngineStats.queries
|
| Direction |
Output
|
| Description | Number queries handled successfully by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
documents
|
| Direction |
Output
|
| Description | Documents returned by the search engine/ document retrieval system. |
| Scope | Processing time |
| Value type |
java.util.Collection
|
| Default value | none |
| Key |
GoogleDesktopDocumentSource.keepHighlights
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Keep query word highlighting.
Google by default highlights query words in snippets using the bold HTML tag. Set this attribute to true to keep these highlights. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value |
false
|
| Key |
query
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Query to perform. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Value content | Must not be blank |
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum number of documents/ search results to fetch. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
100
|
| Min value |
1
|
| Key |
start
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Index of the first document/ search result to fetch. The index starts at zero. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
0
|
| Min value |
0
|
| Key |
results-total
|
| Direction |
Output
|
| Description | Estimated total number of matching documents. |
| Scope | Processing time |
| Value type |
java.lang.Long
|
| Default value | none |
| Key |
GoogleDesktopDocumentSource.queryUrl
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Query URL.
Installation-specific URL at which Google Desktop search service is available. On Windows machines, the URL is available at the HKEY_CURRENT_USER\Software\Google\Google Desktop\API\search_url system registry key and Carrot2 will attempt to automatically read the value from the registry when run with Administrator provileges. Please consult Google Desktop API documentation for further instructions on how to determine the query URL on other systems. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
Solr document source queries an instance of Apache Solr search engine.
| Key |
SearchEngineBase.compressed
|
| Direction |
Output
|
| Description | Indicates whether the search engine returned a compressed result stream. |
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value | none |
| Key |
SearchEngineStats.pageRequests
|
| Direction |
Output
|
| Description | Number of individual page requests issued by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
SearchEngineStats.queries
|
| Direction |
Output
|
| Description | Number queries handled successfully by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
documents
|
| Direction |
Output
|
| Description | Documents returned by the search engine/ document retrieval system. |
| Scope | Processing time |
| Value type |
java.util.Collection
|
| Default value | none |
| Key |
SolrDocumentSource.solrSummaryFieldName
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Summary field name. Name of the Solr field that will provide document summary. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value |
description
|
| Key |
SolrDocumentSource.solrTitleFieldName
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Title field name. Name of the Solr field that will provide document titles. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value |
title
|
| Key |
SolrDocumentSource.solrUrlFieldName
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | URL field name. Name of the Solr field that will provide document URLs. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value |
url
|
| Key |
query
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Query to perform. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Value content | Must not be blank |
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum number of documents/ search results to fetch. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
100
|
| Min value |
1
|
| Key |
start
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Index of the first document/ search result to fetch. The index starts at zero. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
0
|
| Min value |
0
|
| Key |
results-total
|
| Direction |
Output
|
| Description | Estimated total number of matching documents. |
| Scope | Processing time |
| Value type |
java.lang.Long
|
| Default value | none |
| Key |
SolrDocumentSource.serviceUrlBase
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Solr service URL base. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value |
http://localhost:8983/solr/select
|
Serves documents from the Ambient test set. Ambient (AMBIgous ENTries) is a data set designed for evaluating subtopic information retrieval. It consists of 44 topics, each with a set of subtopics and a list of 100 ranked documents. For more information, please see: http://credo.fub.it/ambient.
| Key |
documents
|
| Direction |
Output
|
| Description | Documents returned by the search engine/ document retrieval system. |
| Scope | Processing time |
| Value type |
java.util.List
|
| Default value | none |
| Key |
FubDocumentSource.minTopicSize
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Minimum topic size. Documents belonging to a topic with fewer documents than minimum topic size will not be returned. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
1
|
| Min value |
1
|
| Key |
query
|
| Direction |
Output
|
| Description | Query to perform. |
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum number of documents/ search results to fetch. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
100
|
| Min value |
1
|
| Max value |
100
|
| Key |
results-total
|
| Direction |
Output
|
| Description | Estimated total number of matching documents. |
| Scope | Processing time |
| Value type |
java.lang.Long
|
| Default value | none |
| Key |
AmbientDocumentSource.topic
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Ambient Topic. The Ambient Topic to load documents from. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
org.carrot2.source.ambient.AmbientDocumentSource$AmbientTopic
|
| Default value |
AIDA
|
| Allowed values |
|
| Key |
FubDocumentSource.topicIds
|
| Direction |
Output
|
| Description | Topics and subtopics covered in the output documents.
The set is computed for the output documents and it may vary for the same main topic based e.g. on the requested number of requested results or minTopicSize. |
| Scope | Processing time |
| Value type |
java.util.Set
|
| Default value | none |
Serves documents from the ODP239 test set. ODP239 is a data set designed for evaluating subtopic information retrieval. It consists of 239 topics extracted from the Open Directory Project, each with a set of subtopics and a list of about 100 documents. For more information, please see: http://credo.fub.it/odp239.
| Key |
documents
|
| Direction |
Output
|
| Description | Documents returned by the search engine/ document retrieval system. |
| Scope | Processing time |
| Value type |
java.util.List
|
| Default value | none |
| Key |
FubDocumentSource.minTopicSize
|
| Direction |
Input
|
| Level |
MEDIUM
|
| Description | Minimum topic size. Documents belonging to a topic with fewer documents than minimum topic size will not be returned. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
1
|
| Min value |
1
|
| Key |
query
|
| Direction |
Output
|
| Description | Query to perform. |
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum number of documents/ search results to fetch. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
1000
|
| Min value |
1
|
| Max value |
1000
|
| Key |
results-total
|
| Direction |
Output
|
| Description | Estimated total number of matching documents. |
| Scope | Processing time |
| Value type |
java.lang.Long
|
| Default value | none |
| Key |
Odp239DocumentSource.topic
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | ODP239 Topic. The ODP239 Topic to load documents from. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
org.carrot2.source.ambient.Odp239DocumentSource$Odp239Topic
|
| Default value |
ARTS_ANIMATION
|
| Allowed values |
|
| Key |
FubDocumentSource.topicIds
|
| Direction |
Output
|
| Description | Topics and subtopics covered in the output documents.
The set is computed for the output documents and it may vary for the same main topic based e.g. on the requested number of requested results or minTopicSize. |
| Scope | Processing time |
| Value type |
java.util.Set
|
| Default value | none |
Searches the PubMed medical abstracts database
| Key |
SearchEngineBase.compressed
|
| Direction |
Output
|
| Description | Indicates whether the search engine returned a compressed result stream. |
| Scope | Processing time |
| Value type |
java.lang.Boolean
|
| Default value | none |
| Key |
SearchEngineStats.pageRequests
|
| Direction |
Output
|
| Description | Number of individual page requests issued by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
SearchEngineStats.queries
|
| Direction |
Output
|
| Description | Number queries handled successfully by this data source. |
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value | none |
| Key |
documents
|
| Direction |
Output
|
| Description | Documents returned by the search engine/ document retrieval system. |
| Scope | Processing time |
| Value type |
java.util.Collection
|
| Default value | none |
| Key |
query
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Query to perform. |
| Required |
yes
|
| Scope | Processing time |
| Value type |
java.lang.String
|
| Default value | none |
| Value content | Must not be blank |
| Key |
results
|
| Direction |
Input
|
| Level |
BASIC
|
| Description | Maximum number of documents/ search results to fetch. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
100
|
| Min value |
1
|
| Key |
start
|
| Direction |
Input
|
| Level |
ADVANCED
|
| Description | Index of the first document/ search result to fetch. The index starts at zero. |
| Required |
no
|
| Scope | Processing time |
| Value type |
java.lang.Integer
|
| Default value |
0
|
| Min value |
0
|
| Key |
results-total
|
| Direction |
Output
|
| Description | Estimated total number of matching documents. |
| Scope | Processing time |
| Value type |
java.lang.Long
|
| Default value | none |