Carrot2

User and Developer Manual

Abstract

This document serves as documentation for the Carrot2 framework. It describes Carrot2 application suite and the API developers can use to integrate Carrot2 clustering algorithms into their code. It also provides a reference of all Carrot2 components and their attributes.

Carrot2 Online Demo: http://search.carrot2.org
Carrot2 website: http://project.carrot2.org


Table of Contents

1. Introduction
2. FAQ
2.1. Is Carrot2 suitable for me?
2.2. How do I use Carrot2?
2.3. How can I improve clustering?
3. Application suite
3.1. Carrot2 Document Clustering Workbench
3.2. Carrot2 Document Clustering Server
3.3. Carrot2 Web Application
3.4. Carrot2 Command Line Interface
4. Getting started
4.1. Requirements
4.2. Trying Carrot2 clustering
4.2.1. Clustering results from common search engines
4.2.2. Clustering plain text, HTML and MS Word documents
4.2.3. Clustering documents from XML files
4.2.4. Clustering documents from XML feeds
4.2.5. Clustering documents from a Lucene index
4.2.6. Clustering documents from a Solr index
4.2.7. Saving documents or clusters for further processing
4.3. Integrating Carrot2 with your software
4.3.1. Compiling a Java program using Carrot2 API
4.3.2. Adding Carrot2 dependency to a Maven2 project
4.3.3. Setting up a Maven2 project with Carrot2 dependency
4.3.4. Setting up a Carrot2 project in Eclipse IDE
4.3.5. Setting up Carrot2 source code in Eclipse IDE
4.3.6. Calling Carrot2 clustering from non-Java software
5. Tuning clustering
5.1. Desirable characteristics of documents for clustering
5.2. Choosing the clustering algorithm
5.3. Tuning clustering in Carrot2 Document Clustering Workbench
5.4. Modifying the list of stop words
5.5. Excluding specific clusters from results
5.6. Reducing the size of the Other Topics cluster
5.7. Improving clustering performance
5.7.1. Improving performance of Lingo
5.7.2. Improving performance of STC
5.8. Benchmarking clustering performance
6. Customizing applications
6.1. Component suites and attributes
6.1.1. Component suites
6.1.2. Component attributes
6.2. Adding document sources to Carrot2 Web Application
6.3. Adding document sources to Carrot2 Document Clustering Server
6.4. Customizing Lingo for Carrot2 Web Application
6.5. Customizing Lingo for Carrot2 Document Clustering Server
6.6. Customizing Lingo for Carrot2 Command Line Interface
6.7. Adding document sources to Carrot2 Document Clustering Workbench
7. Advanced topics
7.1. Integration with Apache Solr
7.2. Running Carrot2 in Eclipse IDE
7.2.1. Running Carrot2 Document Clustering Workbench in Eclipse IDE
7.2.2. Running Carrot2 Web Application in Eclipse IDE
7.3. Building Carrot2 from source code
7.3.1. Building Carrot2 Document Clustering Workbench
7.3.2. Building Carrot2 Web Application
7.4. Using Carrot2 Document Clustering Server with curl
7.5. Working with HTTP proxies
7.6. Enabling native matrix computations
8. Troubleshooting
8.1. Troubleshooting Carrot2 Document Clustering Workbench
8.1.1. Increasing memory size
8.1.2. Getting exception stack trace
8.2. Troubleshooting Carrot2 Web Application
8.2.1. "?" characters instead of Unicode special characters
9. Architecture and API
9.1. Carrot2 architecture overview
9.1.1. Processing component pipeline
9.1.2. Processing component attributes
9.2. Carrot2 XML data formats
9.2.1. Carrot2 input XML format
9.2.2. Carrot2 output XML format
9.3. Carrot2 JSON data format
9.3.1. Carrot2 output JSON format
10. Carrot2 Development
10.1. Stable release procedure
10.2. QA check list
11. Component reference
11.1. By Source Clustering
11.2. By URL Clustering
11.3. Lingo Clustering
11.4. Suffix Tree Clustering
11.5. Open Search
11.6. Google Web Search
11.7. eTools Metasearch Engine
11.8. MSN Live Search
11.9. Yahoo Web Search
11.10. Wikipedia Search (with Yahoo Boss)
11.11. Yahoo Image Search
11.12. Yahoo Boss News Search
11.13. Jobs from indeed.com
11.14. XML
11.15. Google Desktop search
11.16. Solr Search Engine
11.17. Ambient Test Set
11.18. ODP239 Test Set
11.19. PubMed medical database

List of Figures

3.1. Carrot2 Document Clustering Workbench screenshot
3.2. Carrot2 Document Clustering Server quick start screen
3.3. Carrot2 Web Application results screen
4.1. Carrot2 Document Clustering Workbench Google Desktop search view
4.2. Carrot2 Document Clustering Workbench XML search view
4.3. News feed XML to Carrot2 format transformation
4.4. Document attribute that contains a list of values.
4.5. Carrot2 Document Clustering Workbench Lucene search view
4.6. Carrot2 Document Clustering Workbench Solr search view
4.7. Setting up Carrot2 Java API in Eclipse IDE
4.8. Eclipse IDE Carrot2 project import step 1
4.9. Eclipse IDE Carrot2 project import step 2
5.1. Lingo and STC clusters for the 'data mining' search results
5.2. Tuning clustering in Carrot2 Document Clustering Workbench
5.3. Attributes view's context menu
5.4. Preprocessing attributes section
5.5. Carrot2 Document Clustering Workbench Benchmark view
6.1. Example Carrot2 component suite
6.2. Example Carrot2 attribute set
7.1. Attribute Metadata XML Run Configuration
7.2. Workbench Run Configuration
7.3. Using DCS and curl to cluster data from document source
7.4. Using DCS and curl to cluster data from document source
8.1. Carrot2 Document Clustering Workbench error dialog
8.2. Carrot2 Document Clustering Workbench Show View dialog
8.3. Carrot2 Document Clustering Workbench Error Log view
8.4. Carrot2 Document Clustering Workbench Event Details dialog
9.1. Carrot2 input XML format
9.2. Carrot2 output XML format
9.3. Carrot2 output JSON format

List of Tables

5.1. Characteristics of Lingo and STC clustering algorithms
5.2. Optimum usage scenarios for Lingo and STC

1 Introduction

What is Carrot2 and what it is not

Carrot2 is a library and a set of supporting applications you can use to build a search results clustering engine. Such an engine will organize your search results into topics, fully automatically and without external kowledge such as taxonomies or preclassified content.

Carrot2 contains two document clustering algorighms designed specifically for search results clustering: Suffix Tree Clustering and Lingo. Carrot2 also contains components for fetching search results from several search engines, such as Yahoo!, MSN Live, Google, but it also supports other sources of documents like Lucene, Solr or Google Desktop index.

Carrot2 is not a search engine itself, it does not have a crawler and indexer. There is a number of Open Source projects you can use to crawl (Nutch), index and search (Lucene, Solr) your content, which can then be queried and clustered by Carrot2

In most cases your workflow with Carrot2 applications would be the following:

  1. Use Carrot2 Document Clustering Workbench and possibly other applications from Carrot2 application suite to see what the clustering results are like for your content. If the results are promising, you can use the Carrot2 Document Clustering Workbench to further tune the clustering algorithm's settings.

  2. If you are developing Java software, use Carrot2 API and JAR to integrate clustering into your code. For non-Java environments, set-up the Carrot2 Document Clustering Server and call Carrot2 clustering using the REST protocol.

Chapter 2 answers the questions most frequently asked on Carrot2 mailing lists, it can also serve as a question-based index to the rest of this manual. Chapter 3 introduces applications available in Carrot2 distribution and Chapter 4 shows how to quickly set up Carrot2 to cluster your own data. Chapter 5 discusses topics related to tuning Carrot2 clustering, while Chapter 6 shows how to customize Carrot2 applications. Chapter 7 covers some more advanced use cases of Carrot2 and Chapter 8 provides solutions to common problems. Finally, Chapter 9 discusses Carrot2 architecture and internals, while Chapter 11 is an in-depth reference of Carrot2 components.

2 FAQ

Frequently Asked Questions

This chapter answers the questions most frequently asked on Carrot2 mailing lists. As it extensively links to further sections of the manual, it can also be treated as some sort question-based index for this manual.

2.1 Is Carrot2 suitable for me?

Can I use Carrot2 in a commercial project?
How can I acknowledge the use of Carrot2 on my site?
Can Carrot2 crawl my website?
Can I use Carrot2 to cluster something else than search results?
How does Carrot2 clustering scale with respect to the number and length of documents?
Can I force Carrot2 to cluster my documents to some predefined clusters / labels?
Can Carrot2 cluster content in other languages than English?

Can I use Carrot2 in a commercial project?

Yes. The only requirement is that you properly acknowledge the use of Carrot2 (on your project's website and documentation) and let us know about your project. Please also remember to read the license.

How can I acknowledge the use of Carrot2 on my site?

Please put a statement equivalent to “This product includes software developed by the Carrot2 Project” on your site and link it to Carrot2's website (http://www.carrot2.org). Additionally, you can use some of our powered-by logos if you like.

Can Carrot2 crawl my website?

No. Carrot2 can add clustering of search results to an existing search engine. You can use an Open Source project called Nutch to crawl your website. Nutch has a Carrot2-based search clustering plugin, so you'll get all crawling, searching and clustering in one piece.

Can I use Carrot2 to cluster something else than search results?

Absolutely. Carrot2 came about as a framework for building search results clustering engines but its algorithms should successfully cluster up to about a thousand text documents, a few paragraphs each.

How does Carrot2 clustering scale with respect to the number and length of documents?

The most important characteristic of Carrot2 algorithms to keep in mind is that they perform in-memory clustering. For this reason, as a rule of thumb, depending on the algorithm, Carrot2 should successfully deal with up to a few thousands of documents, a few paragraphs each. For algorithms designed to process millions of documents, you may want to check out the Mahout project.

Can I force Carrot2 to cluster my documents to some predefined clusters / labels?

No. Assigning documents to a set of predefined categories is a problem called text classification / categorization and Carrot2 was not designed to solve it. For text classification components you may want to see the LingPipe project.

Can Carrot2 cluster content in other languages than English?

Yes. Currently, Carrot2 can cluster content in 17 languages:

  • Chinese Simplified (experimental)
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Norwegian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Turkish

Please note, however, that for some of the languages you may need to tune the stop words to achieve best results.

2.2 How do I use Carrot2?

What is the query syntax in Carrot2?
Which Carrot2 clustering algorithm is the best?
Does Carrot2 support boolean querying?

What is the query syntax in Carrot2?

As Carrot2 is not a search engine on its own, there is no common query syntax in Carrot2. The syntax depends on the underlying search engine you set Carrot2 to use, e.g. Yahoo!, Solr, Lucene or any other. Carrot2 passes your query without any modifications to the search engine and clusters the results it returns. For this reason, any syntax supported by the search engine is automatically supported in Carrot2.

Which Carrot2 clustering algorithm is the best?

There is no one clear answer to this question. The choice of the algorithm depends on the input data and the desired characteristics of clusters. Please see Section 5.2 for some guidelines.

Does Carrot2 support boolean querying?

If the underlying search engine support boolean queries, so will Carrot2. Please see this question for more details.

2.3 How can I improve clustering?

What is the most suitable content for clustering in Carrot2?
How can I remove meaningless cluster labels?
How can I improve the performance of Carrot2?

What is the most suitable content for clustering in Carrot2?

Please see Section 5.1 for the answer.

How can I remove meaningless cluster labels?

Occasionally, Carrot2 may create meaningless cluster labels like read or site. Please see Section 5.5 for information on how to remove them.

How can I improve the performance of Carrot2?

Please see Section 5.7 for some clustering performance tips.

3 Application suite

Applications shipped with Carrot2

Carrot2 comes with a number of supporting applications that you can use to quickly set up clustering on your own data, further tune clustering results and expose Carrot2 clustering as a remote service.

Carrot2 application suite contains:

  • Carrot2 Document Clustering Workbench  which is a standalone GUI application you can use to experiment with Carrot2 clustering on data from common search engines or your own data,

  • Carrot2 Document Clustering Server  which exposes Carrot2 clustering as a REST service,

  • Carrot2 Command Line Interface  applications which allow invoking Carrot2 clustering from command line,

  • Carrot2 Web Application  which exposes Carrot2 clustering as a web application for end users.

3.1 Carrot2 Document Clustering Workbench

Carrot2 Document Clustering Workbench is a standalone GUI application you can use to experiment with Carrot2 clustering on data from common search engines or your own data.

You can use Carrot2 Document Clustering Workbench to:

  • Quickly test Carrot2 clustering with your own data. Please see Chapter 4 for instructions for the most common scenarios.

  • Fine tune Carrot2 clustering algorithms' settings to work best with your specific data. Please see Chapter 5 for more details.

  • Run simple performance benchmarks using different settings to predict maximum clustering throughput on a single machine. Please see Section 5.8 for details.

Carrot2 Document Clustering Workbench features include:

  • Various document sources included.  Carrot2 Document Clustering Workbench can fetch and cluster documents from a number of sources, including major search engines, indexing engines (Lucene, Solr, Google Desktop) as well as generic XML feeds and files.

  • Live tuning of clustering algorithm attributes.  Carrot2 Document Clustering Workbench enables modifying clustering algorithm's attributes and observing the results in real time.

  • Performance benchmarking.  Carrot2 Document Clustering Workbench can run simple performance benchmarks of Carrot2 clustering algorithms.

  • Attractive visualizations.  Carrot2 Document Clustering Workbench comes with two visualizations of the cluster structure, one developed within the Carrot2 project and another one from Aduna Software.

  • Modular architecture and extendability.  Carrot2 Document Clustering Workbench is based on Eclipse Rich Client Platform, which makes it easily extendable.

Figure 3.1 Carrot2 Document Clustering Workbench screenshot

Carrot2 Document Clustering Workbench screenshot

3.1.1 Installation and running

To run Carrot2 Document Clustering Workbench:

  1. Download and install Java Runtime Environment (version 1.5.0 or newer) if you have not done so.

  2. Download Carrot2 Document Clustering Workbench Windows binaries or Linux binaries and extract the archive to some local disk location.

  3. Run carrot2-workbench.exe (Windows) or carrot2-workbench (Linux).

3.2 Carrot2 Document Clustering Server

Carrot2 Document Clustering Server (DCS) exposes Carrot2 clustering as a REST service. It can cluster documents from an external source (e.g. a search engine) or documents provided directly as an XML stream and returns results in XML or JSON formats.

You can use Carrot2 Document Clustering Server to:

  • Integrate Carrot2 with your non-Java software.

  • Build a high-throughput document clustering system by setting up a number of load-balanced instances of the DCS.

Carrot2 Document Clustering Server features include:

  • XML and JSON response formats.  Carrot2 Document Clustering Server can return results both in XML and JSON formats. JSON-P (with callback) is also supported.

  • Various document sources included.  Carrot2 Document Clustering Server can fetch and cluster documents from a large number of sources, including major search engines and indexing engines (Lucene, Solr).

  • Direct XML feed.  Carrot2 Document Clustering Server can cluster documents fed directly in a simple XML format.

  • PHP and C# examples included.  Carrot2 Document Clustering Server ships with ready-to-use examples of calling Carrot2 DCS services from PHP (version 5), C#, Ruby, Java and curl.

  • Quick start screen.  A simple quick start screen will let you make your first DCS request straight from your browser.

Figure 3.2 Carrot2 Document Clustering Server quick start screen

Carrot2 Document Clustering Server quick start screen

3.2.1 Installation and running

To run Carrot2 Document Clustering Server:

  1. Download and install Java Runtime Environment (version 1.5.0 or newer) if you have not done so.

  2. Download Carrot2 Document Clustering Server binaries and extract the archive to some local disk location.

  3. Run dcs.cmd (Windows) or dcs.sh (Linux).

  4. Point your browser to http://localhost:8080 for further instructions.

  5. See the examples/ directory in the distribution archive for PHP, C#, Ruby and Java code examples. You can also invoke DCS clustering using the curl command.

Tip

If you need to start the DCS at a port different than 8080, you can use the -port option:

dcs -port 9090

Tip

To deploy the DCS in an external servlet container, such as Apache Tomcat, use the carrot2-dcs.war file from the war/ folder of the DCS distribution.

3.3 Carrot2 Web Application

Carrot2 Web Application exposes Carrot2 clustering as a web application for end users. It allows users to browse clusters using a conventional tree view, but also in an attractive visualization.

Carrot2 Document Clustering Server features include:

  • Two cluster views.  Carrot2 Web Application offers two views of the clusters generated by Carrot2: conventional tree view and a Flash-based visualization.

  • All Carrot2 document sources and algorithms included.  Carrot2 Web Application contains a large number of document sources, including major search engines. Optionally, further document sources can be added, such as Lucene or Solr ones. It also contains all Carrot2's clustering algorithms.

  • XSLT and JavaScript-based presentation layer.  Look & feel of the Carrot2 Web Application can be easily changed by editing a number of XSLT style sheets. All common style sheets and JavaScripts can be re-used when implementing a new look & feel.

  • High-performance front-end.  The front-end of the Carrot2 Web Application has been optimized for fast loading by using such techniques as JavaScript and CSS merging and minification, as well as using CSS sprites.

Figure 3.3 Carrot2 Web Application results screen

Carrot2 Web Application results screen

3.3.1 Installation and running

To run Carrot2 Web Application:

  1. Make sure you have access to a Servlet API 2.4 compliant container, such as Apache Tomcat.

  2. Download Carrot2 Web Application WAR file.

  3. Deploy the WAR file to your servlet container.

3.4 Carrot2 Command Line Interface

Carrot2 Command Line Interface (CLI) is a set of applications that allow invoking Carrot2 clustering from the command line. Currently, the only available CLI application is Carrot2 Batch Processor, which performs Carrot2 clustering on one or more files in the Carrot2 XML format and saves the results as XML or JSON. Apart from clustering large number of documents sets at one time, you can use the Carrot2 Batch Processor to integrate Carrot2 with your non-Java applications.

3.4.1 Installation and running

To run Carrot2 Batch Processor:

  1. Download and install Java Runtime Environment (version 1.5.0 or newer) if you have not done so.

  2. Download Carrot2 Command Line Interface binaries and extract the archive to some local disk location.

  3. Run batch.cmd (Windows) or batch.sh (Linux) for an overview of the syntax. The Carrot2 Batch Processor ships with two example input data sets located in the input/ directory. Below is a list of some common example invocations.

    • To cluster one or more input files, specify their paths:

      batch input/data-mining.xml input/seattle.xml

      Clustering will be performed using the default clustering algorithm and the results in the XML format will be saved to the output directory relative to the current working directory.

    • You can also cluster files from one or more directories:

      batch input/

      Each directory will be processed recursively, i.e. including subdirectories. For each specified input directory, a corresponding directory with results will be created in the output directory.

    • To save results in the non-default directory, use the -o option:

      batch input/ -o results
    • To repeat the input documents on the output, use the -d option:

      batch input/ -d
    • To save the results in JSON, use the -f JSON option:

      batch input/ -f JSON
    • To use a different clustering algorithm, use the -a option followed by the identifier of the algorithm:

      batch input/ -a url

      To see the list of available algorithm identifiers, run the application without arguments.

    • In case of processing errors, you can use the -v option to see detailed messages and stack traces.

4 Getting started

Trying Carrot2 clustering with your own data

This chapter will show you how to use Carrot2 in a number of typical scenarios such as trying clustering on your own documents or integrating Carrot2 with your software.

4.1 Requirements

All Carrot2 applications require Java Runtime Environment version 1.5.0 or higher (1.6.0 recommended). The Carrot2 Document Clustering Workbench is distributed for Windows, Linux 32-bit and 64-bit versions and Mac OS x86. All other Carrot2 applications will run on any platform supporting Java Runtime Environment version 1.5.0 or higher.

4.2 Trying Carrot2 clustering

This section shows how to apply Carrot2 clustering on documents from various sources.

4.2.1 Clustering results from common search engines

To try Carrot2 clustering on results from common search engines, such as Google, Yahoo or MSN, you can either:

or

  • Use the Carrot2 Document Clustering Workbench which can fetch and cluster documents from the same search engines as the Carrot2 Web Application

4.2.2 Clustering plain text, HTML and MS Word documents

To try Carrot2 clustering on a collection of plain text, HTML or MS Word documents, you will need to install Google Desktop:

  1. Download and install Google Desktop if you have not done so.

  2. Configure Google Desktop to index your documents.

    Tip

    You can use TweakGDS to make Google Desktop index only the folder with your documents.

  3. Use Carrot2 Document Clustering Workbench to cluster documents fetched from your Google Desktop installation. Simply choose Google Desktop source in the search view (Figure 4.1), type your query and press the Process button to see the results.

    Figure 4.1 Carrot2 Document Clustering Workbench Google Desktop search view

    Carrot2 Document Clustering Workbench Google Desktop search view

Tip

You can use the filetype: operator to restrict searching to specific file types only, e.g. filetype:doc for MS Word documents. You can also use the under: operator to restrict searches to a specific folder and its subfolders, e.g. under:"c:\test-documents". Please see Google Desktop search operators reference for other useful query modifiers.

Note

Carrot2 Document Clustering Workbench can automatically determine the Google Desktop Query URL only when it is run on Windows with Administrator's privileges. For other setups, please refer to Google Desktop API Documentation for instructions about obtaining the Query URL. You can set the Query URL attribute in the optional attributes section, shown after clicking the button on the Search view toolbar.

Please also note that Query URLs are different for different users, using a Query URL not belonging to the currently logged in-user will result in errors.

4.2.3 Clustering documents from XML files

To try Carrot2 clustering on documents or search results stored in a single XML file you can use the Carrot2 Document Clustering Workbench.

  1. In the Search view of Carrot2 Document Clustering Workbench, choose XML source.

  2. Set path to your XML file in the XML Resource field.

  3. (Optional) If your file is not in Carrot2 format, create an XSLT style sheet that transforms your data into Carrot2 format, see Section 4.2.4 for an example. Provide a path to your style sheet in the XSLT Stylesheet field, which is an optional field. You can show optional fields by clicking the button on the Search view toolbar (Figure 4.2).

  4. If you know the query that generated the documents in your XML file, you can provide it in the Query field, which may improve the clustering results. Press the Process button to see the results.

Figure 4.2 Carrot2 Document Clustering Workbench XML search view

Carrot2 Document Clustering Workbench XML search view

4.2.4 Clustering documents from XML feeds

To try Carrot2 clustering on documents or search results fetched from a remote XML feed, you can use the Carrot2 Document Clustering Workbench. As an example, we will cluster a news feed from BBC:

  1. In the Search view of Carrot2 Document Clustering Workbench, choose XML source.

  2. Set URL to your XML feed in the XML Resource field. Optionally, the URL can contain two special place holders that will be replaced with the Query and Results number you set in the search view.

    In our example, we will use the BBC News RSS feed.

  3. Create an XSLT style sheet that will transform the XML feed into Carrot2 format. For the news feed we can use the stylesheet shown in Figure 4.3. To add more colour to our results, the XSLT transform extracts thumbnail URLs from the feed and passes them to Carrot2 in a special attribute. Attributes that are a sequence of values can be embedded as shown in Figure 4.4.

  4. Provide a path to the transformation style sheet in the XSLT Stylesheet field, which is an optional field. You can show optional fields by clicking the button on the Search view toolbar (Figure 4.2).

  5. Press the Process button to see the results.

Figure 4.3 News feed XML to Carrot2 format transformation

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
     xmlns:media="http://search.yahoo.com/mrss">

  <xsl:output indent="yes" omit-xml-declaration="no"
       media-type="application/xml" encoding="UTF-8" />

  <xsl:template match="/">
    <searchresult>
      <xsl:apply-templates select="/rss/channel/item" />
    </searchresult>
  </xsl:template>

  <xsl:template match="item">
    <document>
      <title><xsl:value-of select="title" /></title>
      <snippet>
        <xsl:value-of select="description" />
      </snippet>
      <url><xsl:value-of select="link" /></url>
      <xsl:if test="media:thumbnail">
        <field key="thumbnail-url">
           <value type="java.lang.String"
                  value="{media:thumbnail/@url}"/>
        </field>
      </xsl:if>
    </document>
  </xsl:template>
</xsl:stylesheet>

Figure 4.4 Document attribute that contains a list of values.

<field key="key">
  <value><wrapper class="org.carrot2.util.simplexml.ListSimpleXmlWrapper">
    <list>
      <value value="value1"/>
      <value value="value2"/>
    </list>
  </wrapper></value>
</field>

4.2.5 Clustering documents from a Lucene index

To try Carrot2 clustering on documents from a local Lucene index, you can use Carrot2 Document Clustering Workbench:

  1. In the Search view of Carrot2 Document Clustering Workbench, choose Lucene source. Click the button on the view's toolbar (Figure 4.5) to show optional attributes.

  2. Choose the path to your Lucene index in the Index directory field.

  3. Choose fields from your Lucene index in at least one of Document title field and Document content field combo boxes.

  4. Type a query and press the Process button to see the results.

Figure 4.5 Carrot2 Document Clustering Workbench Lucene search view

Carrot2 Document Clustering Workbench Lucene search view

4.2.6 Clustering documents from a Solr index

To try Carrot2 clustering on documents from an instance of Apache Solr, you can use Carrot2 Document Clustering Workbench:

  1. In the Search view of Carrot2 Document Clustering Workbench, choose Solr source. Click the button on the view's toolbar (Figure 4.6) to show optional attributes.

  2. Provide the URL at which your Solr instance is available in the Service URL field.

  3. Provide fields that should be used as document title, content and URL (optional) in the Title field name, Summary field name and URL field name field, respectively.

  4. Type a query and press the Process button to see the results.

Tip

Carrot2 clustering can also be performed directly within Solr by means of Solr's Carrot2 Clustering Component. Please see Section 7.1 for more details.

Figure 4.6 Carrot2 Document Clustering Workbench Solr search view

Carrot2 Document Clustering Workbench Solr search view

4.2.7 Saving documents or clusters for further processing

To save doocuments and/or clusters produced by Carrot2 for further processing:

  1. Use Carrot2 Document Clustering Workbench to perform clustering on documents from the source of your choice.

  2. Use the File > Save as... dialog to save the documents and/or clusters into a file in the Carrot2 XML format.

Tip

Saving documents into XML can be particularly useful when there is a need to capture the output of some remote or non-public document source to a local file, which can be then passed on to someone else for further inspection. Documents saved into XML can be opened for clustering within Carrot2 Document Clustering Workbench using the XML document source.

4.3 Integrating Carrot2 with your software

4.3.1 Compiling a Java program using Carrot2 API

The easiest way to integrate Carrot2 with your Java programs is to use the Carrot2 Java API package:

  1. Download Carrot2 Java API and unpack it to some local directory.

  2. Make sure that carrot2-core.jar and all JARs from the lib/ directory are available in the classpath of your program.

  3. Look in the examples/ directory for some sample code. Good places to start are ClusteringDocumentList and ClusteringDataFromDocumentSources. For a complete description of Carrot2 Java API, please see Javadoc documentation in the javadoc/ directory.

  4. You can use the build.xml Ant script to compile and run code from the examples/ directory.

    Tip

    For easier experimenting with Carrot2 Java API, you may want to set up a Carrot2 project in Eclipse IDE.

4.3.2 Adding Carrot2 dependency to a Maven2 project

To add Carrot2 as a dependency to an existing Maven2 project:

  1. Add the following fragment to the dependencies section of your pom.xml:

    <dependency>
      <groupId>org.carrot2</groupId>
      <artifactId>carrot2-core</artifactId>
      <version>3.0-rc1</version>
    </dependency>

    Optionally, to enable Polish language support, add the following fragment to the dependencies section of your pom.xml:

    <dependency>
      <groupId>org.carrot2</groupId>
      <artifactId>morfologik</artifactId>
      <version>1.1.2</version>
    </dependency>
  2. Add the following fragment to the repositories section of your pom.xml:

    <repository>
      <id>carrot2.org</id>
      <name>Carrot2 Maven2 repository</name>
      <url>http://download.carrot2.org/maven2/</url>
    </repository>

4.3.3 Setting up a Maven2 project with Carrot2 dependency

Carrot2 provides Maven2 artifacts and an archetype project with examples of use. To create a template Carrot2 project, use the following command (line breaks for clarity):

mvn archetype:generate 
 -DarchetypeRepository=http://download.carrot2.org/maven2/ 
 -DarchetypeGroupId=org.carrot2 
 -DarchetypeArtifactId=carrot2-example-archetype 
 -DarchetypeVersion=3.3.0-dev 
 -DgroupId=com.mycompany 
 -DartifactId=myproject 
 -DinteractiveMode=false

Marked in bold is the Carrot2 release that will be used, please see our Maven2 repository for available version numbers.

After the example project gets created, you can use standard Maven2 goals e.g. to generate Eclipse IDE project files:

mvn eclipse:eclipse

4.3.4 Setting up a Carrot2 project in Eclipse IDE

Carrot2 Java API examples can be easily set up in Eclipse IDE. The description below assumes you are using Eclipse IDE version 3.4 or newer.

  1. Download Carrot2 Java API and unpack it to some local directory.

  2. In your Eclipse IDE choose File > New > Java Project.

  3. In the New Java Project dialog (Figure 4.7), type name for the new project, e.g. carrot2-examples. Then choose the Create project from existing source option, provide the directory to which you unpacked the Carrot2 Java API archive and click Finish.

  4. When Eclipse compiles the example classes, you can open one of them, e.g. ClusteringDocumentList and choose Run > Run As > Java Application. The output of the example program should be visible in the Console view.

Figure 4.7 Setting up Carrot2 Java API in Eclipse IDE

Setting up Carrot2 Java API in Eclipse IDE

4.3.5 Setting up Carrot2 source code in Eclipse IDE

Important

To set up Carrot2 source code, you will need Eclipse IDE version 3.5 or higher with the Plug-in Development Environment (PDE). The required plugins are avaiilable e.g. in Eclipse for Plug-in Developers and Eclipse Classic distributions available at http://www.eclipse.org/downloads.

  1. Check out Carrot2 source code, e.g. from the following Subversion URL:

    https://carrot2.svn.sourceforge.net/svnroot/carrot2/trunk
  2. In the Package Explorer view in Eclipse IDE, choose Import... (see Figure 4.8), select General > Existing Projects into Workspace and click Next.

    Figure 4.8 Eclipse IDE Carrot2 project import step 1

    Eclipse IDE Carrot2 project import step 1
  3. In the Import projects dialog provide your local Carrot2 checkout directory in the Select root directory field. Uncheck the org.carrot2.antlib project (see Figure 4.9) and click Finish.

    Figure 4.9 Eclipse IDE Carrot2 project import step 2

    Eclipse IDE Carrot2 project import step 2
  4. All Carrot2 source code should compile without errors. If it does not:

    • Make sure your Eclipse's Java compiler compliance level is set to 1.5 or higher (Preferences > Java > Compiler).

    • Make sure your Eclipse's workspace encoding is set to UTF-8 (Preferences > General > Workspace > Text file encoding).

4.3.6 Calling Carrot2 clustering from non-Java software

To integrate Carrot2 with your non-Java system, you can use the Carrot2 Document Clustering Server, which exposes Carrot2 clustering as a REST/XML service. Please see Section 3.2.1 for installation instructions and the examples/ directory in the distribution archive for example code in PHP, C# and Ruby.

5 Tuning clustering

Fine-tuning Carrot2 clustering

This chapter discusses a number of typical fine-tuning scenarios for Carrot2 clustering algorithms. Some of the scenarios are relevant to all Carrot2 algorithms, while others are specific to individual algorithms.

5.1 Desirable characteristics of documents for clustering

The quality of clusters and their labels largely depends on the characteristics of documents provided on the input. Although there is no general rule for optimum document content, below are some tips worth considering.

  • Carrot2 is designed for small to medium collections of documents.  The most important characteristic of Carrot2 algorithms to keep in mind is that they perform in-memory clustering. For this reason, as a rule of thumb, Carrot2 should successfully deal with up to a thousand of documents, a few paragraphs each. For algorithms designed to process millions of documents, you may want to check out the Mahout project.

  • Provide a minimum of 20 documents.  Carrot2 clustering algorithms will work best with a set of documents similar to what is normally returned by a typical search engine. While about 20 is the minimum number of documents you can reasonably cluster, the optimum would fall in the 100 – 500 range.

  • Provide contextual snippets if possible.  If the input documents are a result of some search query, provide contextual snippets related to that query, similar to what web search engines return, instead of full document content. Not only will this speed up processing, but also should help the clustering algorithm to cover the full spectrum of topics dealt with in the search results.

  • Minimize "noise" in the input documents.  All kinds of "noise" in the documents, such as truncated sentences (sometimes resulting from contextual snippet extraction suggested above) or random alphanumerical strings may decrease the quality of cluster labels. If you have access to e.g. a few sentences' abstract of each document, it is worth checking the quality of clustering based on those abstracts. If you can combine this with the previous tip, i.e. extract complete sentences matching user's query, this should improve the clusters even further.

Let us once again stress that there are no definite generic guidelines for the best content for clustering, it is always worth experimenting with different combinations. You can also describe your specific application on Carrot2 mailing list and ask for advice.

5.2 Choosing the clustering algorithm

Currently, Carrot2 offers two specialized search results clustering algorithms: Lingo and STC. The algorithms differ in terms of the main clustering principle and hence have different quality and performance characteristics. This section describes briefly the two algorithms and provides some recommendations for choosing the most suitable one.

The key characteristic of the Lingo algorithm is that it reverses the traditional clustering pipeline: it first identifies cluster labels and only then assigns documents to the labels to form final clusters. To find the labels, Lingo builds a term-document matrix for all input documents and decomposes the matrix to obtain a number of base vectors that well approximate the matrix in a low-dimensional space. Each such vector gives rise to one cluster label. To complete the clustering process, each label is assigned documents that contain the label's words.

The key data structure used in the Suffix Tree Clustering (STC) algorithm is a Generalized Suffix Tree (GST) built for all input documents. The algorithm traverses the GST to identify words and phrases that occurred more than once in the input documents. Each such word or phrase gives rise to one base cluster. The last stage of the clustering process is merging base clusters to form the final clusters.

The two algorithms have two features in common. They both create overlapping clusterings, in which one document can be assigned to more than one cluster. Also, in case of both algorithms a certain number of documents can remain unclustered and fall in the Other Topics group.

Table 5.1 compares the characteristics of Lingo and STC under their default settings and Figure 5.1 shows clusters generated by Lingo and STC for data mining search results.

Table 5.1 Characteristics of Lingo and STC clustering algorithms

FeatureLingoSTC
Cluster diversityHigh, many small (outlier) clusters highlightedLow, small (outlier) clusters rarely highlighted
Cluster labelsLonger, often more descriptiveShorter, but still appropriate
Scalability Low. For more than about 1000 documents, Lingo clustering will take a long time and large memory[a]. High

[a] Performance of the pure Java version of Lingo can be improved by installing native matrix computation libraries.

Figure 5.1 Lingo and STC clusters for the 'data mining' search results

Lingo and STC clusters for the 'data mining' search results

It is difficult to give one clear recommendation as to which algorithm is "better". Many people feel Lingo delivers better-formed and more diverse clusters at the cost of lower performance and scalability. The ultimate judgment, however, should based on the evaluation with the specific document collection. Table 5.2 highlights the scenarios for which the algorithms are best suited.

Table 5.2 Optimum usage scenarios for Lingo and STC

FeatureUse LingoUse STC
Well-formed longer labels required 
Highlighting of small (outlier) clusters required 
High clustering performance or large document set processing required 

The bottom line is: use Lingo, unless you need high-performance clustering of document sets larger than 1000 documents.

Tip

For a more scientifically-oriented discussion and evaluation of the two algorithms, please check the publications on Carrot2 website.

Note

Carrot Search, a company founded by Carrot2 authors, offers a commercial document clustering engine called Lingo3G that produces Lingo-quality hierarchical clusters at a better-than-STC speed. Please contact Carrot Search for details.

5.3 Tuning clustering in Carrot2 Document Clustering Workbench

The best tool for experimenting and tuning Carrot2 clustering is the Carrot2 Document Clustering Workbench. Figure 5.2 shows the main components involved in the tuning process.

Figure 5.2 Tuning clustering in Carrot2 Document Clustering Workbench

Tuning clustering in Carrot2 Document Clustering Workbench

1

The results editor presents documents and clusters. Changes made in the Attributes view will affect the currently active results editor.

2

The Attributes view, where you can see and change values of clustering algorithm's attributes.

3

The Attribute Info view, which shows documentation for specific attributes. Hold the mouse pointer over an attribute's label to see its documentation.

Opening the Attributes view.  By default, the Attributes view shows on the right hand side of the Carrot2 Document Clustering Workbench. You can open the view at any time by choosing Window > Show view > Attributes.

Setting modified attributes as default for new queries.  If you modified a number of attributes for an algorithm and would like to use the modified values for new queries, choose the Set as defaults for new queries from the Attributes view's context menu (Figure 5.3).

Figure 5.3 Attributes view's context menu

Attributes view's context menu

Restoring default attribute values.  To reset the attributes to their default values, choose the Reset to defaults option from the Attributes view's context menu (Figure 5.3). To bring the attributes back to their factory defaults, choose the Reset to factory defaults option.

Loading and saving attribute values to XML.  To load or save attribute values to an XML file, use the Open and Save as... options available under the icon on the Attributes view's menu bar.

Accessing attribute documentation.  To see the documentation for a specific attribute, hold the mouse pointer over the attribute's label and its documentation will show in the Attribute Info view.

5.4 Modifying the list of stop words

Stop words are the common meaningless words, such as the, to, for in English, that should be ignored while clustering. The Lingo algorithm, for example, will not create clusters whose labels start or end in a stop word.

To fine-tune the stop words list you can use the Carrot2 Document Clustering Workbench in the following way:

  1. Start Carrot2 Document Clustering Workbench and run some query on which you'll be observing the results of your changes.

  2. Go to the workspace/ directory which is located in the directory to which you extracted Carrot2 Document Clustering Workbench. Modify the stopwords.* file for the language you are working on (e.g. stopwords.en for English). Add or remove stop words as required and save changes.

  3. Open the Attributes view and use the view toolbar's button to group the attributes by semantics. In the Preprocessing section, make sure the Processing language is correctly set and check the Reload stopwords checkbox. Doing the latter will let you to see the updated clustering results without restarting Carrot2 Document Clustering Workbench every time you save the changed stop word list.

    Figure 5.4 Preprocessing attributes section

    Preprocessing attributes section
  4. To re-run clustering after you've saved changes to the stopwords.*, choose the Restart Processing option from the Search menu, or press Ctrl+F11.

Tip

To transfer the changed stop words file to other Carrot2 applications, update the existing stop words file in the carrot2-core.jar the application is using. In case of the Carrot2 Document Clustering Server and Carrot2 Web Application, the carrot2-core.jar is located in the WEB-INF/lib directory.

5.5 Excluding specific clusters from results

The Lingo clustering algorithm, in addition to stop words editing, offers more precise control over cluster labels by means of "stop label" regular expressions. If a cluster's label matches one of the stop labels, the label will not appear on the list of clusters produced by Lingo.

The procedure for tuning stop labels and transferring them to other Carrot2 applications is similar to stop word tuning. The difference is that this time you need to edit the stoplabels.* files. Each line of a stop labels file corresponds to one stop label and is a Java regular expression. Please note that in order to be removed, a label as a whole must match at least one of the stop label expressions. A number of example stop label expressions are shown below.

(?i)new
(?i)information
(?i)information (about|on).*
(?i)(index|list) of.*

All stop labels shown above start with the (?i) prefix, which enables case-insensitive matching for them. The stop label in the first line suppresses labels consisting solely of the word new. Similarly, the stop label in the second line removes labels consisting of the word information. The stop label in the third line removes labels that start in information about or information on, and the stop label in the fourth line removes labels that start with index of or list of.

Note

Please note that defining a very large number of stop labels (100+) may significantly slow down clustering. In such cases you may want to combine separate stop label expressions into one larger regular expression.

5.6 Reducing the size of the Other Topics cluster

The Other Topics cluster contains documents that do not belong to any other cluster generated by the algorithm. Depending on the input documents, the size of this cluster may vary from a few to tens of documents.

By tuning parameters of the clustering algorithm, you can reduce the number of unclustered documents, however bringing the number down to 0 is unachievable in most cases. Please note that minimizing the Other Topics cluster size is usually achieved by forcing the algorithm to create more clusters, which may degrade the perceived clustering quality.

Tip

The easiest way to try different clustering algorithm settings is to use the Carrot2 Document Clustering Workbench.

Tuning Lingo algorithm for smallest Other Topics cluster

To reduce the size of the Other Topics cluster generated by Lingo, you can try applying the following settings:

  1. Change the Factorization method attribute to LocalNonnegativeMatrixFactorizationFactory.

  2. Increase the Cluster count base above the default value.

  3. Decrease the Phrase label boost. Note that this will increase the number of one-word labels, which may not always be desirable.

Tip

To apply the changes to the Carrot2 applications, please follow instructions from Chapter 6.

5.7 Improving clustering performance

As a rule of thumb, the more documents you put on input and the longer the documents are, the larger clustering times. Interestingly, in many cases short document excerpts (such as contextual snippets for search results, title and abstracts or first couple sentences of non-search results) may work just as well or even better than full documents. Hence the first two most important performance tuning tips:

Reduce the size of the input documents  You can achieve this in a few ways:

  • Rather than full text of documents, use their titles and abstracts, if available.

  • In case of search results, use the contextual snippet rather than the full document text. Not only will this improve clustering performance, but it will very likely increase the quality of clusters as well because you will be clustering specifically the fragments the users asked for in their query.

  • If you don't have document abstracts, but have access to some automatically generated summaries, use them. Otherwise, try clustering the title and the first few sentences of each document.

  • In certain cases, you may get decent clustering results with document titles only, this variant is worth trying too.

Reduce the number of input documents  While removing large part of the input document set may not always be an option, in many cases dividing the input into two or more batches, clustering separately and then merging based on cluster label text may give reasonable results. The downside of this approach is that very small clusters containing just a few documents are likely to be lost during this process.

Further performance tuning tips are specific for each clustering algorithm.

5.7.1 Improving performance of Lingo

You can change a number of attributes to increase the performance of Lingo. Most often, performance gain will be achieved at the cost of lowered clustering quality or significant change in the structure of clusters.

  • Lower Factorization quality, which will cause the matrix factorization algorithm to perform fewer iterations and hence complete quicker. Alternatively, you can set Factorization method to org.carrot2.matrix.factorization.PartialSingularValueDecompositionFactory, which is slightly faster than the other factorizations. In the latter case Factorization quality becomes irrelevant.

  • Lower Maximum matrix size, which would cause the matrix factorization algorithm to complete quicker and use less memory. With small matrix sizes, Lingo may not be able to discover smaller clusters.

5.7.2 Improving performance of STC

Not yet covered, please contact us if you need this section.

5.8 Benchmarking clustering performance

You can use the Carrot2 Document Clustering Workbench to run simple performance benchmarks of Carrot2. The benchmarks repeatedly cluster the content of the currently opened editor and report the average clustering time. You can use the benchmarking results to measure the impact of different algorithm's attribute settings on its performance and estimate the the maximum number of clustering requests that the algorithm can process per second.

To perform a performance benchmark:

  1. In the Search view, choose the algorithm to benchmark and perform the query to be used for benchmarking.
  2. Open the Benchmark view.

    Figure 5.5 Carrot2 Document Clustering Workbench Benchmark view

    Carrot2 Document Clustering Workbench Benchmark view
  3. Press Start to start the benchmark. After the benchmark completes, you should see the measured clustering time average, standard deviation, minimum and maximum.

Tip

To asses the performance impact of different attribute settings on one algorithm, you can open two or more editors with the same results clustered by the algorithm, set different attribute values in each editor and run benchmarking for each editor separately. The benchmark view remembers the last result for each editor, so you can compare the performance figures by simply switching between the editors.

Tip

By default, the benchmarking view uses only a single processing unit on multi-processor or multi-core machines. You can increase the number of benchmark threads in the Threads section.

Caution

Benchmark results may vary and be different from the results acquired on production machines due to other programs running in the background, operating system, hardware-specific considerations and even different Java Virtual Machine settings. Always fine-tune your clustering setup in the target deployment environment.

6 Customizing applications

Customizing Carrot2 applications

This chapter will show you how to add new document sources and tune clustering in Carrot2 applications.

6.1 Component suites and attributes

Key concepts in customizing and tuning Carrot2 applications are component suites and component attributes described in the following sections.

6.1.1 Component suites

Component suite is a set of Carrot2 components, such as document sources or clustering algorithms, configured to work within a specific Carrot2 application. For each component, the component suite defines the component's identifier, label, description and also a number of component- and application-specific properties, such as the list of example queries.

Component suites are defined in XML files read from application-specific locations described in further sections of this chapter. An example component suite definition is shown in Figure 6.1.

Figure 6.1 Example Carrot2 component suite

<component-suite>
  <sources>
    <source id="lucene"
        component-class="org.carrot2.source.lucene.LuceneDocumentSource"
        attribute-sets-resource="lucene.attributes.xml">
      <label>Lucene</label>
      <title>Apache Lucene</title>
      <mnemonic>L</mnemonic>
      <description>
        Apache Lucene index (local index access).
      </description>
      <icon-path>icons/lucene.png</icon-path>
      <example-queries>
        <example-query>data mining</example-query>
        <example-query>london</example-query>
        <example-query>clustering</example-query>
      </example-queries>
    </source>
  </sources>
  
  <algorithms>
    <algorithm id="lingo" 
        component-class="org.carrot2.clustering.lingo.LingoClusteringAlgorithm" 
        attribute-sets-resource="lingo.attributes.xml">
      <label>Lingo</label>
      <title>Lingo Clustering</title>
    </algorithm>
  </algorithms>
  
  <include suite="source-yahoo-boss.xml" />
  <include suite="algorithm-stc.xml" />
</component-suite>

The component suite definition can consist of the following elements:

  • sources  Document source definitions, optional.

  • algorithms  Clustering algorithm definitions, optional.

  • include  Includes other XML component suite definitions, optional. The resource specified in the suite attribute will be loaded from the current thread's context class loader.

Common parts of the source and algorithm tags include:

  • id  Identifier of the component within the suite, required. Identifiers must be unique within the component suite scope.

  • component-class  Fully qualified name of the processing component class, required.

  • attribute-sets-resource  XML file to load the component's attributes from. The resource specified in this attribute will be loaded from the current thread's context class loader. For the syntax of the XML file, please see Section 6.1.2.

  • label  A human readable label of the component, required.

  • label  A human readable title of the component, required. The title will be usually slightly longer than the label.

  • description  A longer description of the component, optional.

  • icon-path  Application specific definition of the component's icon.

Additionally, for the source tag you can use the example-queries tag to specify some example queries the applications may show for this source.

6.1.2 Component attributes

Component attribute is a specific property of a Carrot2 component that influences its behavior, e.g. the number of search results fetched by a document source or the depth of cluster hierarchy produced by a clustering algorithm. Each attribute is identified by a unique string key, Chapter 11 lists and describes all available components and their attributes.

You can specify attribute values for specific components in the component suite using attribute sets. Attribute sets are defined in XML files referenced by the attribute-sets-resource attribute of the component's entry in the component suite. Figure 6.2 shows an example attribute set definition.

Figure 6.2 Example Carrot2 attribute set

<attribute-sets>
  <attribute-set id="lucene">
    <value-set>
      <label>Lucene</label>
      <attribute key="LuceneDocumentSource.directory">
        <value>
           <wrapper class="org.carrot2.source.lucene.FSDirectoryWrapper">
              <indexPath>/path/to/lucene/index/directory</indexPath>
           </wrapper>
        </value>
      </attribute>
      <attribute key="org.carrot2.source.lucene.SimpleFieldMapper.contentField">
        <value type="java.lang.String" value="summary" />
      </attribute>
      <attribute key="org.carrot2.source.lucene.SimpleFieldMapper.titleField">
        <value type="java.lang.String" value="title" />
      </attribute>
      <attribute key="org.carrot2.source.lucene.SimpleFieldMapper.urlField">
        <value type="java.lang.String" value="url" />
      </attribute>
    </value-set>
  </attribute-set>
</attribute-sets>

An attribute-sets element can contain one or more attribute-sets. Each attribute-set must specify a unique id and a value-set.

Saving attributes to XML using Carrot2 Document Clustering Workbench  As the syntax of the value elements depends on the type of the attribute being set, the easiest way to obtain the XML file is to use the Carrot2 Document Clustering Workbench.

To generate attribute set XML for a document source:

  1. In the Search view, choose the document source for which you would like to save attributes.

  2. Use the Search view to set the desired attribute values.

  3. Choose the Save as... option from Search view's menu bar. Carrot2 Document Clustering Workbench will suggest the XML file name based on the value of the document source's attribute-sets-resource attribute.

Note

Please note that the Carrot2 Document Clustering Workbench will remove a number of common attributes from the XML file being saved, including: query, start result index, number of results.

To generate attribute set XML for a clustering algorithm:

  1. In the Search view, choose the clustering algorithm for which you would like to save attributes. Choose any document source and perform processing using the selected algorithm.

  2. Use the Attributes view to set the desired attribute values.

  3. Choose the Save as... option from Attribute view's menu bar. Carrot2 Document Clustering Workbench will suggest the XML file name based on the value of the clustering algorithm's attribute-sets-resource attribute.

Tip

If for some reason you cannot use the Carrot2 Document Clustering Workbench to save attribute set XML files, you can modify the SavingAttributeValuesToXml class from the carrot2-examples package to correspond to the attribute values you would like to set and run the class to print the XML encoding of the attribute values to the standard output.

6.2 Adding document sources to Carrot2 Web Application

To add a document source tab to the Carrot2 Web Application:

  1. Download Carrot2 Web Application WAR file.

  2. Open for editing the suite-webapp.xml file, located in the WEB-INF/classes/suites directory of the WAR file.

  3. Add a descriptor for the document source you want to add to the sources section of the suite-webapp.xml file. Alternatively, you may want to use the include element to reference one of the example document source descriptors shipped with the application (e.g. source-lucene.xml). Please see Section 6.1.1 for more information about the component suite XML file.

  4. If the document source you are adding requires setting specific attribute values (e.g. index location for the Lucene document source), use the Carrot2 Document Clustering Workbench to generate the attribute set XML file. Place the generated XML file in WEB-INF/classes/suites and make sure it is appropriately referenced by the attribute-sets-resource attribute of the descriptor added in the previous step.

  5. Deploy the WAR file with the above modifications to your container. If the new document source tab is not showing, clear cookies for the domain on which the web application is deployed.

6.3 Adding document sources to Carrot2 Document Clustering Server

To add a document source tab to the Carrot2 Document Clustering Server:

  1. Download Carrot2 Document Clustering Server distribution archive and extract it to some local folder.

  2. Open for editing the suite-dcs.xml file, located in the WEB-INF/classes/suites directory of the DCS WAR file located in the war/ of the DCS distribution.

  3. Add a descriptor for the document source you want to add to the sources section of the suite-dcs.xml file. Alternatively, you may want to use the include element to reference one of the example document source descriptors shipped with the application (e.g. source-lucene.xml). Please see Section 6.1.1 for more information about the component suite XML file.

  4. If the document source you are adding requires setting specific attribute values (e.g. index location for the Lucene document source), use the Carrot2 Document Clustering Workbench to generate the attribute set XML file. Place the generated XML file in WEB-INF/classes/suites and make sure it is appropriately referenced by the attribute-sets-resource attribute of the descriptor added in the previous step.

  5. Restart the DCS. The new document source should be available for processing.

6.4 Customizing Lingo for Carrot2 Web Application

To run the Carrot2 Web Application with custom attributes of the Lingo clustering algorithm:

  1. Use the Carrot2 Document Clustering Workbench to save the attribute set XML file with the desired Lingo attribute values.

  2. Replace the contents of lingo.attributes.xml, located in the WEB-INF/classes/suites directory of the web application WAR file, with the XML file saved in the previous step.

  3. Deploy the WAR file with the above modifications to your container.

You can use the same procedure to customize other algorithms, e.g. STC.

6.5 Customizing Lingo for Carrot2 Document Clustering Server

To run the Carrot2 Document Clustering Server with custom attributes of the Lingo clustering algorithm:

  1. Use the Carrot2 Document Clustering Workbench to save the attribute set XML file with the desired Lingo attribute values.

  2. Replace the contents of algorithm-lingo-attributes.xml, located in the WEB-INF/classes/suites directory of the DCS WAR file, located in the war/ directory of the DCS distribution, with the XML file saved in the previous step.

  3. Restart the DCS.

You can use the same procedure to customize other algorithms, e.g. STC.

6.6 Customizing Lingo for Carrot2 Command Line Interface

To run the Carrot2 Command Line Interface with custom attributes of the Lingo clustering algorithm:

  1. Use the Carrot2 Document Clustering Workbench to save the attribute set XML file with the desired Lingo attribute values.

  2. Replace the contents of algorithm-lingo-attributes.xml, located in the /suites directory of the carrot2-mini.jar file, located in the lib/ directory of the CLI distribution, with the XML file saved in the previous step.

  3. Run the CLI application.

You can use the same procedure to customize other algorithms, e.g. STC.

6.7 Adding document sources to Carrot2 Document Clustering Workbench

Not yet covered, please contact us if you need this section.

7 Advanced topics

Building and running Carrot2 from source code

This chapter discusses more advanced usage scenarios of Carrot2 such as integration with Apache Solr, running Carrot2 applications in Eclipse and building Carrot2 from source code.

7.1 Integration with Apache Solr

As of version 1.4 of Apache Solr, Carrot2 clustering can be performed directly within Solr by means of the Carrot2 Clustering Component

7.2 Running Carrot2 in Eclipse IDE

7.2.1 Running Carrot2 Document Clustering Workbench in Eclipse IDE

To run Carrot2 Document Clustering Workbench in Eclipse IDE (version 3.4 or higher required):

  1. Set up Carrot2 source code in your Eclipse IDE.

  2. Choose Window > Preferences and then Run/Debug > String substitution. Add a temp_workspaces variable pointing to a an existing disk directory where the Workbench's workspace should be created.

  3. Choose Run > External Tools > External Tools Configurations... from the main menu and run the Attribute Metadata XML configuration. This will build the metadata files required for Workbench to show descriptions of Carrot2 components' attributes.

    Figure 7.1 Attribute Metadata XML Run Configuration

    Attribute Metadata XML Run Configuration
  4. Choose Run > Run Configurations... from the main menu and run the Workbench configuration.

    Figure 7.2 Workbench Run Configuration

    Workbench Run Configuration

7.2.2 Running Carrot2 Web Application in Eclipse IDE

To run Carrot2 Document Clustering Workbench in Eclipse IDE:

  1. Set up Carrot2 source code in your Eclipse IDE.

  2. Choose Run > External Tools > External Tools Configurations... from the main menu and run the Attribute Metadata XML configuration. This will build the metadata files required for the web application to show advanced options of document sources.

  3. Choose Run > External Tools > External Tools Configurations... from the main menu and run the Web Application Setup [carrot2] configuration. This will preprocess various configuration files required by the web application.

  4. Choose Run > Run Configurations... from the main menu and run the Web Application Runner [carrot2] configuration.

  5. Point your browser to http://localhost:8080 to access the running web application.

7.3 Building Carrot2 from source code

To build Carrot2 applications from source code, you will need Java Softwade Development Kit (Java SDK) version 1.6 or higher and Apache Ant version 1.7.1 or higher. You can chcek out the latest Carrot2 source code from the following SVN location:

https://carrot2.svn.sourceforge.net/svnroot/carrot2/trunk

7.3.1 Building Carrot2 Document Clustering Workbench

To build Carrot2 Document Clustering Workbench from source code:

  1. Download Eclipse Target Platform from http://download.carrot2.org/eclipse and extract to some local folder.

  2. Copy workbench.properties.example from Carrot2 checkout folder to workbench.properties in the same folder. In workbench.properties edit the target.platform property to point to the Eclipse Target Platform you have downloaded.

    Important

    The folder pointed to by target.platform must have the eclipse/ folder inside.

    You may also need to change the configs property to match the platform you want to build Carrot2 Document Clustering Workbench for.

  3. Run:

    ant -f build-workbench.xml build
    to build Carrot2 Document Clustering Workbench binaries.

  4. Go to the tmp/ workbench/ tmp/ carrot2-workbench folder in the Carrot2 checkout dir and run Carrot2 Document Clustering Workbench.

7.3.2 Building Carrot2 Web Application

To build Carrot2 Web Application from source code:

  1. Run:

    ant webapp
    in the main Carrot2 checkout directory.

  2. Go to the tmp/webapp/ folder in the Carrot2 checkout dir where you will find the web application WAR file.

7.4 Using Carrot2 Document Clustering Server with curl

You can use curl to post requests to the Carrot2 Document Clustering Server Figure 7.3 shows how to use curl to query an external document source and cluster the results using the DCS. Figure 7.4 shows how to cluster documents from an XML file in Carrot2 format using the DCS. Please see the examples/curl directory of the Carrot2 Document Clustering Server distribution archive for more curl DCS invocation examples.

Figure 7.3 Using DCS and curl to cluster data from document source

curl http://localhost/dcs/rest \
     -F "dcs.source=etools" \
     -F "query=test" \
     -o result.xml

Figure 7.4 Using DCS and curl to cluster data from document source

curl http://localhost/dcs/rest \
     -F "dcs.c2stream=@documents-in-carrot2-format.xml" \
     -o result.xml

Tip

You can download curl for Windows from http://curl.haxx.se/latest.cgi?curl=win32-nossl.

7.5 Working with HTTP proxies

If your server or development machine connects to HTTP servers via a HTTP proxy, you can most of Carrot2 document source implementations to take this information into account by defining the following global system properties:

http.proxyhost

URL of the HTTP proxy (numeric or full address, but without the port number).

http.proxyport

Proxy server's port number.

Two sources that currently do not support the above properties are: MicrosoftLiveDocumentSource and OpenSearchDocumentSource.

Note

Password-based authentication is not supported at the moment. You can alter the source code to change this in the HttpUtils class.

7.6 Enabling native matrix computations

To speed up clustering performed by the Lingo algorithm, you can configure Carrot2 to use a native platform-specific matrix computation library. Depending on the platform, you may see up to a 400% speed-up compared to the Java-only mode.

To enable native matrix computations for Carrot2:

  1. Download precompiled libraries for your platform and extract the archive to some local directory.

    Note

    If no distribution matches your platform, and you would like to compile your own version, please ask on the mailing list for instructions. You can also try the PIII (Pentium III) versions, which seem to work quite well on modern processors as well (e.g. Core2 Duo).

  2. Add an additional option to your JVM command line invocation providing the path to the directory to which you extracted the native library:

    java -Djava.library.path=[native-lib-dir] ...

    To enable native computations in web applications deployed to Apache Tomcat, pass the above directive in the JAVA_OPTS environment variable, e.g.:

    export JAVA_OPTS="-Djava.library.path=[native-lib-dir]"
  3. When Carrot2 correctly loads the native library, upon initialization of the Lingo clustering algorithm, the following entry should appear in application logs:

    INFO org.carrot2.clustering.lingo.LingoClustering
    Algorithm: Native BLAS routines available

8 Troubleshooting

Solving common problems with Carrot2

This chapter discusses solutions to some common problems with Carrot2 code or applications.

8.1 Troubleshooting Carrot2 Document Clustering Workbench

8.1.1 Increasing memory size

To increase Java heap size for Carrot2 Document Clustering Workbench, use the following command line parameters:

carrot2-workbench -vmargs -Xmx256m

Tip

Using the above pattern you can specify any other JVM options if needed.

8.1.2 Getting exception stack trace

To get the stack trace (useful for Carrot2 team to spot errors) corresponding to a processing error in Carrot2 Document Clustering Workbench, follow the following procedure:

  1. Click OK on the Problem Occurred dialog box (Figure 8.1).

    Figure 8.1 Carrot2 Document Clustering Workbench error dialog

    Carrot2 Document Clustering Workbench error dialog
  2. Go to Window > Show view > Other... and choose Error Log (Figure 8.2).

    Figure 8.2 Carrot2 Document Clustering Workbench Show View dialog

    Carrot2 Document Clustering Workbench Show View dialog
  3. In the Error Log view double click the line corresponding to the error (Figure 8.3).

    Figure 8.3 Carrot2 Document Clustering Workbench Error Log view

    Carrot2 Document Clustering Workbench Error Log view
  4. Copy the exception stack trace from the Event Details dialog and pass to Carrot2 team (Figure 8.4).

    Figure 8.4 Carrot2 Document Clustering Workbench Event Details dialog

    Carrot2 Document Clustering Workbench Event Details dialog

8.2 Troubleshooting Carrot2 Web Application

8.2.1 "?" characters instead of Unicode special characters

Symptoms

If you see question marks ("?") instead of Chinese, Polish or other special Unicode characters in clusters and documents output by the Carrot2 Web Application

Cause

The Carrot2 Web Application running under a Web application container (such as Tomcat) relies on proper decoding of Unicode characters from the request URI. This decoding is done by the container and must be properly configured at the container level. Unfortunately, this configuration is not part of the J2EE standard and is therefore different for each container.

Solution for Apache Tomcat

For Apache Tomcat, you can enforce the URI decoding code page at the connector configuration level. Locate server.xml file inside Tomcat's conf folder and add the following attribute to the Connector section:

URIEncoding="UTF-8"

A typical connector configuration should look like this:

<Connector port="8080" maxThreads="25" 
    minSpareThreads="5" maxSpareThreads="10" 
    minProcessors="5" maxProcessors="25" 
    enableLookups="false" redirectPort="8443" 
    acceptCount="10" debug="0" 
    connectionTimeout="20000" URIEncoding="UTF-8" />

9 Architecture and API

Discussion of Carrot2 internals

This chapter discusses some Carrot2 architecture assumptions, internals and more complex API use cases.

9.1 Carrot2 architecture overview

This section provides a very brief overview of Carrot2 architecture. If you would like us to cover some specific topic in more detail, please let us know on the mailing list.

9.1.1 Processing component pipeline

Processing in Carrot2 is based on a pipeline of processing components. The two main types of Carrot2 processing components are:

  • Document Sources  provide data for further processing. In a typical scenario, such a component would fetch search results from e.g. an external search engine, Lucene / Solr index or an XML file. Currently, Carrot2 distribution contains 12 different document source components.

  • Clustering Algorithms  organize documents provided by document sources into meaningful groups. Currently, two specialized clustering algorithms are available in Carrot2: Lingo and STC. Additionally, a number of "synthetic" clustering algorithms are available, such as by URL clustering.

Carrot2 applications, such as Carrot2 Document Clustering Workbench or Carrot2 Document Clustering Server operate on a pipeline consisting of one document source and one clustering algorithm, but using Carrot2 Java API you can insert additional components at any point in the pipeline. Currently, the only component not falling into the above categories is a component for computing certain cluster quality metrics, but more components may be added in the future, e.g. for spell checking of user queries.

9.1.2 Processing component attributes

The behavior of both document sources and clustering algorithms depends on a number of attributes (settings) such as the number of documents to fetch or the number of clusters to produce. The way you provide attribute values for specific components depends on the Carrot2 application you are working with:

  • Carrot2 Document Clustering Workbench.  In Carrot2 Document Clustering Workbench you can provide attributes for document sources (such as number of results to fetch or preferred results language) before you issue a query in the Search view. Clustering algorithm attributes you can change using the sliders in the Attributes view.

  • Carrot2 Document Clustering Server.  In Carrot2 Document Clustering Server, you can provide attribute values as additional parameters in the POST request. Name of the POST parameter should be the identifier of the attribute you want to set (see Chapter 11 for attribute identifiers). Carrot2 will attempt to convert the string value of the parameter to the required type (integer, float etc.).

For a complete reference of attributes of each Carrot2 component, please see Chapter 11.

9.2 Carrot2 XML data formats

This section shows examples of Carrot2 input and output XML formats, used consistently by all Carrot2 applications, including Carrot2 Document Clustering Workbench, Carrot2 Document Clustering Server and Carrot2 Web Application.

9.2.1 Carrot2 input XML format

To provide documents for Carrot2 clustering, use the following XML format:

Figure 9.1 Carrot2 input XML format

<?xml version="1.0" encoding="UTF-8"?>
<searchresult>
  <query>Globe</query>
  <document id="0">
    <title>default</title>
    <url>http://www.globe.com.ph/</url>
    <snippet>
      Provides mobile communications (GSM) including 
      GenTXT, handyphones, wireline services, an
      broadband Internet services.
    </snippet>
  </document>
  <document id="1">
    <title>Skate Shoes by Globe | Time For Change</title>
    <url>http://www.globeshoes.com/</url>
    <snippet>
      Skaters, surfers, and showboarders
      designing in their own style.
    </snippet>
  </document>

  ...

</searchresult>

9.2.2 Carrot2 output XML format

Carrot2 saves the clusters in the following XML format:

Figure 9.2 Carrot2 output XML format

<?xml version="1.0" encoding="UTF-8"?>
<searchresult>
  <query>Globe</query>
  <document id="0">
    <title>default</title>
    <url>http://www.globe.com.ph/</url>
    <snippet>
      Provides mobile communications (GSM) including 
      GenTXT, handyphones, wireline services, an
      broadband Internet services.
    </snippet>
  </document>
  <document id="1">
    <title>Skate Shoes by Globe | Time For Change</title>
    <url>http://www.globeshoes.com/</url>
    <snippet>
      Skaters, surfers, and showboarders
      designing in their own style.
    </snippet>
  </document>

  ...

  <group id="0" size="60">
    <title>
      <phrase>com</phrase>
    </title>
    <group id="1" size="2">
      <title>
        <phrase>amazon.com</phrase>
      </title>
      <document refid="43"/>
      <document refid="77"/>
    </group>
    <group id="2" size="2">
      <title>
        <phrase>boston.com</phrase>
      </title>
      <document refid="4"/>
      <document refid="7"/>
    </group>
    
    ...
    
    <group id="7" size="48">
      <title>
        <phrase>Other Sites</phrase>
      </title>
      <attribute key="other-topics">
        <value type="java.lang.Boolean" value="true"/>
      </attribute>
      <document refid="1"/>
      <document refid="2"/>
      ...
    </group>
  </group>
  <group id="8" size="12">
    <title>
      <phrase>org</phrase>
    </title>
    <group id="9" size="2">
      <title>
        <phrase>en.wikipedia.org</phrase>
      </title>
      <document refid="9"/>
      <document refid="14"/>
      ...
    </group>
  </group>
  ...


</searchresult>

9.3 Carrot2 JSON data format

This section shows examples of Carrot2 output JSON format, used consistently by all Carrot2 applications, including Carrot2 Document Clustering Server and Carrot2 Java API.

9.3.1 Carrot2 output JSON format

Carrot2 saves documents and the clusters in the following JSON format:

Figure 9.3 Carrot2 output JSON format

{
  "clusters": [
    {
      "attributes": {
        "score": 1.0
      }, 
      "documents": [
        0, 
        2
      ], 
      "id": 0, 
      "phrases": [
        "Cluster 1"
      ], 
      "score": 1.0, 
      "size": 2
    }, 
    {
      "attributes": {
        "score": 0.63
      }, 
      "clusters": [
        {
          "attributes": {
            "score": 0.3
          }, 
          "documents": [
            1
          ], 
          "id": 2, 
          "phrases": [
            "Cluster 2.1"
          ], 
          "score": 0.3, 
          "size": 1
        }, 
        {
          "attributes": {
            "score": 0.15
          }, 
          "documents": [
            2
          ], 
          "id": 3, 
          "phrases": [
            "Cluster 2.2"
          ], 
          "score": 0.15, 
          "size": 1
        }
      ], 
      "documents": [
        0
      ], 
      "id": 1, 
      "phrases": [
        "Cluster 2"
      ], 
      "score": 0.63, 
      "size": 3
    }
  ], 
  "documents": [
    {
      "id": 0, 
      "snippet": "Document 1 Content.", 
      "title": "Document 1 Title", 
      "url": "http://document.url/1"
    }, 
    {
      "id": 1, 
      "snippet": "Document 2 Content.", 
      "title": "Document 2 Title", 
      "url": "http://document.url/2"
    }, 
    {
      "id": 2, 
      "snippet": "Document 3 Content.", 
      "title": "Document 3 Title", 
      "url": "http://document.url/3"
    }
  ], 
  "query": "query (optional)"
}

10 Carrot2 Development

Contributing to Carrot2

This chapter contains information for Carrot2 developers.

10.1 Stable release procedure

Each Carrot2 release should be performed according to the following procedure:

  1. Update JavaDoc documentation  Review JavaDoc documentation, provide missing public and protected members description, provide missing package descriptions.

  2. Update Carrot2 Manual  Review Carrot2 Manual, modify or add content related to the features implemented in the new release.

  3. Update Maven dependencies  Update Maven POMs so that dependencies are in sync with the JAR versions in the repository.

  4. Review of static code analysis reports  Review and fix reasonably-looking flaws from the following reports:

  5. Update source code headers and line endings  In project root:

    ant prerelease
    Commit changes to trunk.

  6. Precondition: successful trunk builds  The status of the C2HEAD-CORE and C2HEAD-SOURCES builds must be successful.

  7. Precondition: resolved issues  All issues related to the software to be released scheduled (fix for) for the release must be resolved.

  8. Replace the stable branch in SVN 

    svn remove https://carrot2.svn.sourceforge.net/svnroot/carrot2/branches/stable
    svn copy https://carrot2.svn.sourceforge.net/svnroot/carrot2/trunk 
             https://carrot2.svn.sourceforge.net/svnroot/carrot2/branches/stable

  9. Update version number strings in the stable branch 

    1. Version files  Update etc/version/carrot2.version to contain the desired stable version number. That number will be embedded in distribution file names, JavaDoc page title and other version-sensitive places. Note the property name should be carrot2.version.stable, e.g.:

      carrot2.version.stable=3.2.0
      carrot2.version=${carrot2.version.stable}
      
      # workbench plugin/ feature versions.
      carrot2.version.workbench=${carrot2.version.stable}

    Commit changes to the stable branch.

  10. Trigger stable branch build  Go to the C2STABLE-ALL build page and trigger a build. If the build is successful, all distribution files should be available in the download directory.

  11. Verify the distribution files  Download, unpack and run each distribution file to make sure there are no obvious release blockers.

  12. Create the release tag 

    svn copy https://carrot2.svn.sourceforge.net/svnroot/carrot2/branches/stable
             https://carrot2.svn.sourceforge.net/svnroot/carrot2/tags/VERSION_3_2_0

  13. Update version number strings in trunk  In case of major releases, update development version numbers.

    1. Version files  Update etc/version/carrot2.version to contain the desired development version number. Note the property name should be carrot2.version.head, e.g.:

      carrot2.version.head=3.3.0-dev
      
      # workbench plugin/ feature versions.
      carrot2.version.workbench=3.3.0.dev-snapshot

    2. Carrot2 plugin versions in Carrot2 Document Clustering Workbench  Update Carrot2 plugin version strings in the Carrot2 Document Clustering Workbench to the current development version.

    Commit changes to trunk.

  14. Update JIRA  Close issues scheduled for the release being made, release the version in JIRA, create a next version in JIRA.

  15. Update project website 

    1. Release notes  Add a page named release-[version]-notes that lists new features, major bug fixes and improvements introduced in the new release. The page will automatically become linked from all relevant sections of the website (done by an SVN external to etc/version/carrot2.version).

    2. Release note history  Add release date and link to the release's JIRA issues on the release-notes page.

  16. Upload distribution files to SourceForge  Perform (e.g. on the build server):

    rsync -e ssh *-3.2.0.zip \
    <sf.user>,carrot2@frs.sourceforge.net:/home/frs/project/c/ca/carrot2/carrot2/3.2.0

  17. Circulate release news  If appropriate, circulate release news to:

    1. Carrot2 mailing lists

    2. SourceForge

  18. Consider upgrading Carrot2 in dependent projects  If reasonable, upgrade Carrot2 dependency in other known projects, such as Apache Solr and Nutch.

10.2 QA check list

This a very quick quality assurance check list to run through before stable releases. This list also serves as some guide line for further automation of acceptance tests.

Note

Note that this list does not contain many checks for the Carrot2 Web Application, Carrot2 Document Clustering Server and Carrot2 Java API as these are fairly well tested during builds (webtests, smoke-tests).

  1. For each supported platform you can test, check that Carrot2 Document Clustering Workbench:

    1. launches without errors in the error log

    2. executes and cluters a remote search query without errors

    3. executes and clusters a Lucene query without errors (we've had a bug that caused the Lucene directory attribute editor to disappear, hence this step).

    4. can edit a clustering algorithm's attribute

    5. shows both cluster visualizations

    6. executes clustering algorithm benchmarks

  2. Check that a the Carrot2 Document Clustering Server starts up correctly using command line on Windows and Linux. More acceptance tests are performed during builds (but starting Carrot2 Document Clustering Server using the WAR file instead of command line).

11 Component reference

Detailed description of all Carrot2 components

This section lists and describes attributes of all Carrot2 components. By changing values of these attributes, you can change the behaviour of the component. Please see Chapter 6 for information on how you pass attribute values in different Carrot2 applications.

Each attribute is described by a number of properties:

  • Key  The unique identifier of the attribute.

  • Direction 

    • Input  The attribute is an input for the component, the behaviour of the component depends on its value.

    • Output  The attribute is an output produced by the component.

  • Level  Informs how advanced the attribute is.

    • Basic  Attribute value should be fairly easily tunable by a person without significant experience in text clustering.

    • Medium  Attribute value should be fairly easily tunable by a person without some intuition about text clustering

    • Advanced  Attribute may require in-depth knowledge of the component for successful tuning.

  • Required  If true and the attribute does not have a default value, a value must be provided for the component to perform processing.

  • Scope 

    • Initialization time  Attribute value will be respected only when the component is initializing; values provided at processing time will be ignored. This scope applies to the attributes that control time-consuming operations performed once per component instance (e.g. parsing of configuration files). As a result, only a handful of attributes fall into the initialization-time only scope.

    • Processing time  Attribute values will be respected both at initialization and clustering time. Most of the attributes fall into this scope.

    Please note that certain attributes can be both initialization- and processing-time. In most such cases it is advisable to provide the value at initialization time because processing the same value passed at processing time may degrade the performance a little (e.g. due to re-reading configuration files).

  • Value type  The Java type of the attribute's value.

  • Default value  The default value of the attribute or none if there is no default value defined for the attribute.

11.1 By Source Clustering

11.1.1 By Source Clustering input attributes by level

11.1.2 By Source Clustering attributes by direction

Output

11.1.3 Clusters

Clusters

Key clusters
Direction Output
DescriptionClusters created by the algorithm.
Scope Processing time
Value type java.util.List
Default value none

11.1.4 Documents

Documents

Key documents
Direction Input
Level BASIC
DescriptionDocuments to cluster.
Required no
Scope Processing time
Value type java.util.List
Default value none

11.1.5 Field

Field name

Key ByAttributeClusteringAlgorithm.fieldName
Direction Input
Level BASIC
DescriptionName of the field to cluster by. Each non-null scalar field value with distinct hash code will give rise to a single cluster, named using the value returned by buildClusterLabel(Object). If the field value is a collection, the document will be assigned to all clusters corresponding to the values in the collection. Note that arrays will not be 'unfolded' in this way.
Required yes
Scope Processing time
Value type java.lang.String
Default value sources
Value contentMust not be blank

11.2 By URL Clustering

11.2.1 By URL Clustering input attributes by level

Basic

11.2.2 By URL Clustering attributes by direction

Input

Output

11.2.3 Clusters

Clusters

Key clusters
Direction Output
DescriptionClusters created by the algorithm.
Scope Processing time
Value type java.util.List
Default value none

11.2.4 Documents

Documents

Key documents
Direction Input
Level BASIC
DescriptionDocuments to cluster.
Required no
Scope Processing time
Value type java.util.List
Default value none

11.3 Lingo Clustering

11.3.3 Clusters

Cluster count base

Key LingoClusteringAlgorithm.desiredClusterCountBase
Direction Input
Level BASIC
DescriptionDesired cluster count base. Base factor used to calculate the number of clusters based on the number of documents on input. The larger the value, the more clusters will be created. The number of clusters created by the algorithm will be proportional to the cluster count base, but not in a linear way.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 30
Min value 2
Max value 100

Clusters

Key clusters
Direction Output
DescriptionClusters created by the clustering algorithm.
Scope Processing time
Value type java.util.List
Default value none

Size-Score sorting ratio

Key LingoClusteringAlgorithm.scoreWeight
Direction Input
Level MEDIUM
DescriptionBalance between cluster score and size during cluster sorting. Value equal to 0.0 will cause Lingo to sort clusters based only on cluster size. Value equal to 1.0 will cause Lingo to sort clusters based only on cluster score.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.0
Min value 0.0
Max value 1.0

11.3.4 Documents

Documents

Key documents
Direction Input
Level BASIC
DescriptionDocuments to cluster.
Required yes
Scope Processing time
Value type java.util.List
Default value none

11.3.5 Label filtering

Remove labels ending in genitive form

Key GenitiveLabelFilter.enabled
Direction Input
Level BASIC
DescriptionRemove labels ending in genitive form. Removes labels that do end in words in the Saxon Genitive form (e.g. "Threatening the Country's").
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true

Remove leading and trailing stop words

Key StopWordLabelFilter.enabled
Direction Input
Level BASIC
DescriptionRemove leading and trailing stop words. Removes labels that consist of, start or end in stop words.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true

Remove numeric labels

Key NumericLabelFilter.enabled
Direction Input
Level BASIC
DescriptionRemove numeric labels. Remove labels that consist only of or start with numbers.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true

Remove query words

Key QueryLabelFilter.enabled
Direction Input
Level BASIC
DescriptionRemove query words. Removes labels that consist only of words contained in the query.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true

Remove short labels

Key MinLengthLabelFilter.enabled
Direction Input
Level BASIC
DescriptionRemove labels shorter than 3 characters. Removes labels whose total length in characters, including spaces, is less than 3.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true

Remove stop labels

Key StopLabelFilter.enabled
Direction Input
Level BASIC
DescriptionRemove stop labels. Removes labels that are declared as stop labels in the stoplabels.<lang> files. Please note that adding a long list of regular expressions to the stoplabels file may result in a noticeable performance penalty.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true

Remove truncated phrases

Key CompleteLabelFilter.enabled
Direction Input
Level BASIC
DescriptionRemove truncated phrases. Tries to remove "incomplete" cluster labels. For example, in a collection of documents related to Data Mining, the phrase Conference on Data is incomplete in a sense that most likely it should be Conference on Data Mining or even Conference on Data Mining in Large Databases. When truncated phrase removal is enabled, the algorithm would try to remove the "incomplete" phrases like the former one and leave only the more informative variants.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true

11.3.6 Labels

Cluster label assignment method

Key LingoClusteringAlgorithm.labelAssigner
Direction Input
Level ADVANCED
DescriptionCluster label assignment method.
Required yes
Scope Processing time
Value type org.carrot2.clustering.lingo.ILabelAssigner
Default value org.carrot2.clustering.lingo.UniqueLabelAssigner
Allowed value types Allowed value types: No other assignable value types are allowed.

Cluster merging threshold

Key LingoClusteringAlgorithm.clusterMergingThreshold
Direction Input
Level MEDIUM
DescriptionCluster merging threshold. The percentage overlap between two cluster's documents required for the clusters to be merged into one clusters. Low values will result in more aggressive merging, which may lead to irrelevant documents in clusters. High values will result in fewer clusters being merged, which may lead to very similar or duplicated clusters.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.7
Min value 0.0
Max value 1.0

Phrase label boost

Key LingoClusteringAlgorithm.phraseLabelBoost
Direction Input
Level MEDIUM
DescriptionPhrase label boost. The weight of multi-word labels relative to one-word labels. Low values will result in more one-word labels being produced, higher values will favor multi-word labels.
Required no
Scope Processing time
Value type java.lang.Double
Default value 1.5
Min value 0.0
Max value 10.0

Phrase length penalty start

Key LingoClusteringAlgorithm.phraseLengthPenaltyStart
Direction Input
Level ADVANCED
DescriptionPhrase length penalty start. The phrase length at which the overlong multi-word labels should start to be penalized. Phrases of length smaller than phraseLengthPenaltyStart will not be penalized.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 8
Min value 2
Max value 8

Phrase length penalty stop

Key LingoClusteringAlgorithm.phraseLengthPenaltyStop
Direction Input
Level ADVANCED
DescriptionPhrase length penalty stop. The phrase length at which the overlong multi-word labels should be removed completely. Phrases of length larger than phraseLengthPenaltyStop will be removed.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 8
Min value 2
Max value 8

Title word boost

Key TermDocumentMatrixBuilder.titleWordsBoost
Direction Input
Level MEDIUM
DescriptionTitle word boost. Gives more weight to words that appeared in Document.TITLE fields.
Required no
Scope Processing time
Value type java.lang.Double
Default value 2.0
Min value 0.0
Max value 10.0

11.3.7 Matrix model

Factorization method

Factorization quality

Key LingoClusteringAlgorithm.factorizationQuality
Direction Input
Level ADVANCED
DescriptionFactorization quality. The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
Required yes
Scope Processing time
Value type org.carrot2.matrix.factorization.IterationNumberGuesser$FactorizationQuality
Default value HIGH
Allowed values
  • LOW  (Low)
  • MEDIUM  (Medium)
  • HIGH  (High)

Maximum matrix size

Key TermDocumentMatrixBuilder.maximumMatrixSize
Direction Input
Level MEDIUM
DescriptionMaximum matrix size. The maximum number of the term-document matrix elements. The larger the size, the more accurate, time- and memory-consuming clustering.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 37500
Min value 5000

Maximum word document frequency

Key TermDocumentMatrixBuilder.maxWordDf
Direction Input
Level ADVANCED
DescriptionMaximum word document frequency. The maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than maxWordDf will be ignored. For example, when maxWordDf is 0.4, words appearing in more than 40% of documents will be be ignored. The default value of 1.0 means that all words will be taken into account, no matter in how many documents they appear.

This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting maxWordDf to a value lower than 1.0, e.g. 0.9 may improve the clusters.

Another useful application of this attribute is when there is a need to generate only very specific clusters, i.e. clusters containing small numbers of documents. This can be achieved by setting maxWordDf to extremely low values, e.g. 0.1 or 0.05.

Required no
Scope Processing time
Value type java.lang.Double
Default value 1.0
Min value 0.0
Max value 1.0

Native matrix operations used

Key LingoClusteringAlgorithm.nativeMatrixUsed
Direction Output
DescriptionIndicates whether Lingo used fast native matrix computation routines. Value of this attribute is equal to NNIInterface.isNativeBlasAvailable() at the time of running the algorithm.
Scope Processing time
Value type java.lang.Boolean
Default value none

Term weighting

Key TermDocumentMatrixBuilder.termWeighting
Direction Input
Level ADVANCED
DescriptionTerm weighting. The method for calculating weight of words in the term-document matrices.
Required yes
Scope Processing time
Value type org.carrot2.text.vsm.ITermWeighting
Default value org.carrot2.text.vsm.LogTfIdfTermWeighting
Allowed value types Allowed value types: Other assignable value types are allowed.

11.3.8 Multilingual clustering

Default clustering language

Key MultilingualClustering.defaultLanguage
Direction Input
Level MEDIUM
DescriptionDefault clustering language. The default language to use for documents with undefined Document.LANGUAGE.
Required yes
Scope Processing time
Value type org.carrot2.core.LanguageCode
Default value ENGLISH
Allowed values
  • ARABIC  (Arabic)
  • CHINESE_SIMPLIFIED  (Chinese Simplified)
  • DANISH  (Danish)
  • DUTCH  (Dutch)
  • ENGLISH  (English)
  • FINNISH  (Finnish)
  • FRENCH  (French)
  • GERMAN  (German)
  • HUNGARIAN  (Hungarian)
  • ITALIAN  (Italian)
  • KOREAN  (Korean)
  • NORWEGIAN  (Norwegian)
  • POLISH  (Polish)
  • PORTUGUESE  (Portuguese)
  • ROMANIAN  (Romanian)
  • RUSSIAN  (Russian)
  • SPANISH  (Spanish)
  • SWEDISH  (Swedish)
  • TURKISH  (Turkish)

Language aggregation strategy

Key MultilingualClustering.languageAggregationStrategy
Direction Input
Level MEDIUM
DescriptionLanguage aggregation strategy. Determines how clusters generated for individual languages should be combined to form the final result. Please see LanguageAggregationStrategy for the list of available options.
Required yes
Scope Processing time
Value type org.carrot2.text.clustering.MultilingualClustering$LanguageAggregationStrategy
Default value FLATTEN_MAJOR_LANGUAGE
Allowed values
  • FLATTEN_ALL  (Flatten clusters from all languages)
  • FLATTEN_MAJOR_LANGUAGE  (Flatten clusters from the majority language)
  • FLATTEN_NONE  (Dedicated parent cluster for each language)

11.3.9 Phrase extraction

Phrase Document Frequency threshold

Key PhraseExtractor.dfThreshold
Direction Input
Level ADVANCED
DescriptionPhrase Document Frequency threshold. Phrases appearing in fewer than dfThreshold documents will be ignored.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 1
Min value 1
Max value 100

Truncated label threshold

Key CompleteLabelFilter.labelOverrideThreshold
Direction Input
Level ADVANCED
DescriptionTruncated label threshold. Determines the strength of the truncated label filter. The lowest value means strongest truncated labels elimination, which may lead to overlong cluster labels and many unclustered documents. The highest value effectively disables the filter, which may result in short or truncated labels.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.65
Min value 0.0
Max value 1.0

11.3.10 Preprocessing

Document fields

Key Tokenizer.documentFields
Direction Input
Level ADVANCED
DescriptionTextual fields of documents that should be tokenized and parsed for clustering.
Required no
Scope Initialization time
Value type java.util.Collection
Default value [title, snippet]

Exact phrase assignment

Key DocumentAssigner.exactPhraseAssignment
Direction Input
Level MEDIUM
DescriptionOnly exact phrase assignments. Assign only documents that contain the label in its original form, including the order of words. Enabling this option will cause less documents to be put in clusters, which result in higher precision of assignment, but also a larger "Other Topics" group. Disabling this option will cause more documents to be put in clusters, which will make the "Other Topics" cluster smaller, but also lower the precision of cluster-document assignments.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false

Language model factory

Key PreprocessingPipeline.languageModelFactory
Direction Input
Level ADVANCED
DescriptionLanguage model factory. Creates language the language model to be used by the clustering algorithm. The language models provides the lexical resources required to perform clustering, including stop words and a word stemming algorithm.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.text.linguistic.ILanguageModelFactory
Default value org.carrot2.text.linguistic.DefaultLanguageModelFactory
Allowed value types Allowed value types: Other assignable value types are allowed.

Merge lexical resources

Key DefaultLanguageModelFactory.mergeResources
Direction Input
Level MEDIUM
DescriptionMerges stop words and stop labels from all known languages. If set to false, only stop words and stop labels of the active language will be used. If set to true, stop words from all LanguageCodes will be used together and stop labels from all languages will be used together, no matter the active language. Lexical resource merging is useful when clustering data in a mix of different languages and should increase clustering quality in such settings.
Required no
Scope Initialization time and Processing time
Value type java.lang.Boolean
Default value true

Minimum cluster size

Key DocumentAssigner.minClusterSize
Direction Input
Level MEDIUM
DescriptionDetermines the minimum number of documents in each cluster.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 2
Min value 1
Max value 100

Reload lexical resources

Key DefaultLanguageModelFactory.reloadResources
Direction Input
Level MEDIUM
DescriptionReloads cached stop words and stop labels on every processing request. For best performance, lexical resource reloading should be disabled in production.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false

Word Document Frequency threshold

Key CaseNormalizer.dfThreshold
Direction Input
Level ADVANCED
DescriptionWord Document Frequency threshold. Words appearing in fewer than dfThreshold documents will be ignored.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 1
Min value 1
Max value 100

11.3.11 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery that produced the documents. The query will help the algorithm to create better clusters. Therefore, providing the query is optional but desirable.
Required no
Scope Processing time
Value type java.lang.String
Default value none

11.4 Suffix Tree Clustering

11.4.3 Base cluster boosts

Document count boost

Key STCClusteringAlgorithm.documentCountBoost
Direction Input
Level MEDIUM
DescriptionDocument count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.
Required no
Scope Processing time
Value type java.lang.Double
Default value 1.0
Min value 0.0

Optimal label length

Key STCClusteringAlgorithm.optimalPhraseLength
Direction Input
Level BASIC
DescriptionOptimal label length. A factor in calculation of the base cluster score.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 3
Min value 1

Phrase length tolerance

Key STCClusteringAlgorithm.optimalPhraseLengthDev
Direction Input
Level MEDIUM
DescriptionPhrase length tolerance. A factor in calculation of the base cluster score.
Required no
Scope Processing time
Value type java.lang.Double
Default value 2.0
Min value 0.5

Single term boost

Key STCClusteringAlgorithm.singleTermBoost
Direction Input
Level MEDIUM
DescriptionSingle term boost. A factor in calculation of the base cluster score. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.5
Min value 0.0

11.4.4 Base clusters

Maximum base clusters count

Key STCClusteringAlgorithm.maxBaseClusters
Direction Input
Level ADVANCED
DescriptionMaximum base clusters count. Trims the base cluster array after N-th position for the merging phase.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 300
Min value 2

Minimum base cluster score

Key STCClusteringAlgorithm.minBaseClusterScore
Direction Input
Level ADVANCED
DescriptionMinimum base cluster score.
Required no
Scope Processing time
Value type java.lang.Double
Default value 2.0
Min value 0.0
Max value 10.0

Minimum documents per base cluster

Key STCClusteringAlgorithm.minBaseClusterSize
Direction Input
Level ADVANCED
DescriptionMinimum documents per base cluster.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 2
Min value 2
Max value 20

11.4.5 Clusters

Clusters

Key clusters
Direction Output
DescriptionClusters created by the algorithm.
Scope Processing time
Value type java.util.List
Default value none

11.4.6 Documents

Documents

Key documents
Direction Input
Level BASIC
DescriptionDocuments to cluster.
Required yes
Scope Processing time
Value type java.util.List
Default value none

11.4.7 Label creation

Maximum cluster phrase overlap

Key STCClusteringAlgorithm.maxPhraseOverlap
Direction Input
Level ADVANCED
DescriptionMaximum cluster phrase overlap.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.6
Min value 0.0
Max value 1.0

Maximum phrases per label

Key STCClusteringAlgorithm.maxPhrases
Direction Input
Level BASIC
DescriptionMaximum phrases per label. Maximum number of phrases from base clusters promoted to the cluster's label.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 3
Min value 1

Maximum words per label

Key STCClusteringAlgorithm.maxDescPhraseLength
Direction Input
Level BASIC
DescriptionMaximum words per label. Base clusters formed by phrases with more words than this ratio are trimmed.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 4
Min value 1

Minimum general phrase coverage

Key STCClusteringAlgorithm.mostGeneralPhraseCoverage
Direction Input
Level ADVANCED
DescriptionMinimum general phrase coverage. Minimum phrase coverage to appear in cluster description.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.5
Min value 0.0
Max value 1.0

11.4.8 Merging and output

Base cluster merge threshold

Key STCClusteringAlgorithm.mergeThreshold
Direction Input
Level ADVANCED
DescriptionBase cluster merge threshold.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.6
Min value 0.0
Max value 1.0

Maximum final clusters

Key STCClusteringAlgorithm.maxClusters
Direction Input
Level BASIC
DescriptionMaximum final clusters.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 15
Min value 1

11.4.9 Multilingual clustering

Default clustering language

Key MultilingualClustering.defaultLanguage
Direction Input
Level MEDIUM
DescriptionDefault clustering language. The default language to use for documents with undefined Document.LANGUAGE.
Required yes
Scope Processing time
Value type org.carrot2.core.LanguageCode
Default value ENGLISH
Allowed values
  • ARABIC  (Arabic)
  • CHINESE_SIMPLIFIED  (Chinese Simplified)
  • DANISH  (Danish)
  • DUTCH  (Dutch)
  • ENGLISH  (English)
  • FINNISH  (Finnish)
  • FRENCH  (French)
  • GERMAN  (German)
  • HUNGARIAN  (Hungarian)
  • ITALIAN  (Italian)
  • KOREAN  (Korean)
  • NORWEGIAN  (Norwegian)
  • POLISH  (Polish)
  • PORTUGUESE  (Portuguese)
  • ROMANIAN  (Romanian)
  • RUSSIAN  (Russian)
  • SPANISH  (Spanish)
  • SWEDISH  (Swedish)
  • TURKISH  (Turkish)

Language aggregation strategy

Key MultilingualClustering.languageAggregationStrategy
Direction Input
Level MEDIUM
DescriptionLanguage aggregation strategy. Determines how clusters generated for individual languages should be combined to form the final result. Please see LanguageAggregationStrategy for the list of available options.
Required yes
Scope Processing time
Value type org.carrot2.text.clustering.MultilingualClustering$LanguageAggregationStrategy
Default value FLATTEN_MAJOR_LANGUAGE
Allowed values
  • FLATTEN_ALL  (Flatten clusters from all languages)
  • FLATTEN_MAJOR_LANGUAGE  (Flatten clusters from the majority language)
  • FLATTEN_NONE  (Dedicated parent cluster for each language)

11.4.10 Preprocessing

Document fields

Key Tokenizer.documentFields
Direction Input
Level ADVANCED
DescriptionTextual fields of documents that should be tokenized and parsed for clustering.
Required no
Scope Initialization time
Value type java.util.Collection
Default value [title, snippet]

Language model factory

Key PreprocessingPipeline.languageModelFactory
Direction Input
Level ADVANCED
DescriptionLanguage model factory. Creates language the language model to be used by the clustering algorithm. The language models provides the lexical resources required to perform clustering, including stop words and a word stemming algorithm.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.text.linguistic.ILanguageModelFactory
Default value org.carrot2.text.linguistic.DefaultLanguageModelFactory
Allowed value types Allowed value types: Other assignable value types are allowed.

Merge lexical resources

Key DefaultLanguageModelFactory.mergeResources
Direction Input
Level MEDIUM
DescriptionMerges stop words and stop labels from all known languages. If set to false, only stop words and stop labels of the active language will be used. If set to true, stop words from all LanguageCodes will be used together and stop labels from all languages will be used together, no matter the active language. Lexical resource merging is useful when clustering data in a mix of different languages and should increase clustering quality in such settings.
Required no
Scope Initialization time and Processing time
Value type java.lang.Boolean
Default value true

Reload lexical resources

Key DefaultLanguageModelFactory.reloadResources
Direction Input
Level MEDIUM
DescriptionReloads cached stop words and stop labels on every processing request. For best performance, lexical resource reloading should be disabled in production.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false

Word Document Frequency threshold

Key CaseNormalizer.dfThreshold
Direction Input
Level ADVANCED
DescriptionWord Document Frequency threshold. Words appearing in fewer than dfThreshold documents will be ignored.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 1
Min value 1
Max value 100

11.4.11 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery that produced the documents. The query will help the algorithm to create better clusters. Therefore, providing the query is optional but desirable.
Required no
Scope Processing time
Value type java.lang.String
Default value none

11.4.12 Word filtering

Maximum word-document ratio

Key STCClusteringAlgorithm.ignoreWordIfInHigherDocsPercent
Direction Input
Level MEDIUM
DescriptionMaximum word-document ratio. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.9
Min value 0.0
Max value 1.0

Minimum word-document recurrences

Key STCClusteringAlgorithm.ignoreWordIfInFewerDocs
Direction Input
Level MEDIUM
DescriptionMinimum word-document recurrences.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 2
Min value 2

11.5 Open Search

Open Search document source retrieves search results from search engines supporting the OpenSearch standard.

11.5.3 Data source status

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none

Page Requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

Successful Queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

11.5.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system.
Scope Processing time
Value type java.util.Collection
Default value none

11.5.5 Results paging

Search Mode

Key search-mode
Direction Input
Level ADVANCED
DescriptionSearch mode defines how fetchers returned from createFetcher(SearchRange) are called.
Required no
Scope Processing time
Value type org.carrot2.source.MultipageSearchEngine$SearchMode
Default value SPECULATIVE
Allowed values
  • CONSERVATIVE  (CONSERVATIVE)
  • SPECULATIVE  (SPECULATIVE)

11.5.6 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1

Start Index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0

11.5.7 Search request information

Total Results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none

11.5.8 Service

Feed URL parameters

Key OpenSearchDocumentSource.feedUrlParams
Direction Input
Level ADVANCED
DescriptionAdditional parameters to be appended to feedUrlTemplate on each request.
Required no
Scope Initialization time and Processing time
Value type java.util.Map
Default value none

Feed URL template

Key OpenSearchDocumentSource.feedUrlTemplate
Direction Input
Level ADVANCED
DescriptionURL to fetch the search feed from. The URL template can contain variable place holders as defined by the OpenSearch specification that will be replaced during runtime. The format of the place holder is ${variable}. The following variables are supported:
  • searchTerms will be replaced by the query
  • startIndex index of the first result to be searched. Mutually exclusive with startPage
  • startPage index of the first result to be searched. Mutually exclusive with startIndex.
  • count the number of search results per page
Required yes
Scope Initialization time
Value type java.lang.String
Default value none

Maximum results

Key OpenSearchDocumentSource.maximumResults
Direction Input
Level ADVANCED
DescriptionMaximum number of results. The maximum number of results the document source can deliver.
Required no
Scope Initialization time
Value type java.lang.Integer
Default value 1000
Min value 1

Results per page

Key OpenSearchDocumentSource.resultsPerPage
Direction Input
Level ADVANCED
DescriptionResults per page. The number of results per page the document source will expect the feed to return.
Required yes
Scope Initialization time
Value type java.lang.Integer
Default value 0
Min value 1

11.6 Google Web Search

Searches the web using Google.

11.6.1 Google Web Search input attributes by level

11.6.3 Data source status

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none

Page Requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

Successful Queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

11.6.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system.
Scope Processing time
Value type java.util.Collection
Default value none

11.6.5 Postprocessing

Keep highlights

Key GoogleDocumentSource.keepHighlights
Direction Input
Level ADVANCED
DescriptionKeep query word highlighting. Google by default highlights query words in snippets using the bold HTML tag. Set this attribute to true to keep these highlights.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false

11.6.6 Results paging

Search Mode

Key search-mode
Direction Input
Level ADVANCED
DescriptionSearch mode defines how fetchers returned from createFetcher(SearchRange) are called.
Required no
Scope Processing time
Value type org.carrot2.source.MultipageSearchEngine$SearchMode
Default value SPECULATIVE
Allowed values
  • CONSERVATIVE  (CONSERVATIVE)
  • SPECULATIVE  (SPECULATIVE)

11.6.7 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1

Start Index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0

11.6.8 Search request information

Total Results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none

11.6.9 Service

Google API Key

Key GoogleDocumentSource.apiKey
Direction Input
Level ADVANCED
DescriptionGoogle API Key. Please do not use the default key when deploying this component in production environments. Instead, apply generate and use your own key.
Required no
Scope Processing time
Value type java.lang.String
Default value ABQIAAAA_XmITjrzoipJYoBApAgGJhS8yIvkL4-1sNwOJWkV7nbkjq_Z_BQW0-uzOh5lKXRtEXQDTGbzIEz06Q

Referer

Key GoogleDocumentSource.referer
Direction Input
Level ADVANCED
DescriptionRequest referer. Please do not use the default value when deploying this component in production environments. Instead, put the URL to your application here.
Required no
Scope Processing time
Value type java.lang.String
Default value http://www.carrot2.org

Service URL

Key GoogleDocumentSource.serviceUrl
Direction Input
Level ADVANCED
DescriptionService URL. Google web search service URL.
Required no
Scope Processing time
Value type java.lang.String
Default value http://ajax.googleapis.com/ajax/services/search/web

11.7 eTools Metasearch Engine

eTools document source searches the web using etools.ch metasearch engine

11.7.1 eTools Metasearch Engine input attributes by level

Medium

11.7.3 Data source status

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none

Page Requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

Successful Queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

11.7.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system.
Scope Processing time
Value type java.util.Collection
Default value none

11.7.5 Results filtering

Country

Key EToolsDocumentSource.country
Direction Input
Level MEDIUM
DescriptionDetermines the country of origin for the returned search results.
Required no
Scope Processing time
Value type org.carrot2.source.etools.EToolsDocumentSource$Country
Default value ALL
Allowed values
  • ALL  (All)
  • AUSTRIA  (Austria)
  • FRANCE  (France)
  • GERMANY  (Germany)
  • GREAT_BRITAIN  (Great Britain)
  • ITALY  (Italy)
  • LICHTENSTEIN  (Lichtenstein)
  • SPAIN  (Spain)
  • SWITZERLAND  (Switzerland)

Language

Key EToolsDocumentSource.language
Direction Input
Level MEDIUM
DescriptionDetermines the language of the returned search results.
Required no
Scope Processing time
Value type org.carrot2.source.etools.EToolsDocumentSource$Language
Default value ENGLISH
Allowed values
  • ALL  (All)
  • ENGLISH  (English)
  • FRENCH  (French)
  • GERMAN  (German)
  • ITALIAN  (Italian)
  • SPANISH  (Spanish)

Safe search

Key EToolsDocumentSource.safeSearch
Direction Input
Level BASIC
DescriptionIf enabled, excludes offensive content from the results.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false

11.7.6 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1

Start Index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0

11.7.7 Search request information

Total Results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none

11.7.8 Service

Data sources

Key EToolsDocumentSource.dataSources
Direction Input
Level ADVANCED
DescriptionDetermines which data sources to search.
Required no
Scope Processing time
Value type org.carrot2.source.etools.EToolsDocumentSource$DataSources
Default value ALL
Allowed values
  • ALL  (All)
  • FASTEST  (Fastest)

Partner

Key EToolsDocumentSource.partnerId
Direction Input
Level ADVANCED
DescriptioneTools partner identifier. If you have commercial arrangements with eTools, specify your partner id here.
Required no
Scope Processing time
Value type java.lang.String
Default value Carrot2

Service URL

Key EToolsDocumentSource.serviceUrlBase
Direction Input
Level ADVANCED
DescriptionBase URL for the eTools service.
Required no
Scope Processing time
Value type java.lang.String
Default value http://www.etools.ch/partnerSearch.do

Timeout

Key EToolsDocumentSource.timeout
Direction Input
Level ADVANCED
DescriptionMaximum time in milliseconds to wait for all data sources to return results.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 4000
Min value 0

11.8 MSN Live Search

Searches the web using MSN Live API

11.8.1 MSN Live Search input attributes by level

11.8.3 Data source status

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none

Page Requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

Successful Queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

11.8.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system.
Scope Processing time
Value type java.util.Collection
Default value none

11.8.5 Results filtering

Culture

Key MicrosoftLiveDocumentSource.culture
Direction Input
Level MEDIUM
DescriptionCulture and language restriction.
Required yes
Scope Processing time
Value type org.carrot2.source.microsoft.CultureInfo
Default value ENGLISH_UNITED_STATES
Allowed values
  • ARABIC_ARABIA  (Arabic – Arabia)
  • BULGARIAN_BULGARIA  (Bulgarian – Bulgaria)
  • CHINESE_CHINA  (Chinese – China)
  • CHINESE_HONG_KONG_SAR  (Chinese – Hong Kong SAR)
  • CHINESE_TAIWAN  (Chinese – Taiwan)
  • CROATIAN_CROATIA  (Croatian – Croatia)
  • CZECH_CZECH_REPUBLIC  (Czech – Czech Republic)
  • DANISH_DENMARK  (Danish – Denmark)
  • DUTCH_BELGIUM  (Dutch – Belgium)
  • DUTCH_NETHERLANDS  (Dutch – Netherlands)
  • ENGLISH_AUSTRALIA  (English – Australia)
  • ENGLISH_ARABIA  (English – Arabia)
  • ENGLISH_CANADA  (English – Canada)
  • ENGLISH_INDIA  (English – India)
  • ENGLISH_INDONESIA  (English – Indonesia)
  • ENGLISH_IRELAND  (English – Ireland)
  • ENGLISH_MALAYSIA  (English – Malaysia)
  • ENGLISH_NEW_ZEALAND  (English – New Zealand)
  • ENGLISH_PHILIPPINES  (English – Philippines)
  • ENGLISH_SINGAPORE  (English – Singapore)
  • ENGLISH_SOUTH_AFRICA  (English – South Africa)
  • ENGLISH_UNITED_KINGDOM  (English – United Kingdom)
  • ENGLISH_UNITED_STATES  (English – United States)
  • ESTONIAN_ESTONIA  (Estonian – Estonia)
  • FINNISH_FINLAND  (Finnish – Finland)
  • FRENCH_BELGIUM  (French – Belgium)
  • FRENCH_FRANCE  (French – France)
  • FRENCH_CANADA  (French – Canada)
  • FRENCH_SWITZERLAND  (French – Switzerland)
  • GERMAN_AUSTRIA  (German – Austria)
  • GERMAN_GERMANY  (German – Germany)
  • GERMAN_SWITZERLAND  (German – Switzerland)
  • GREEK_GREECE  (Greek – Greece)
  • HEBREW_ISRAEL  (Hebrew – Israel)
  • HUNGARIAN_HUNGARY  (Hungarian – Hungary)
  • ITALIAN_ITALY  (Italian – Italy)
  • JAPANESE_JAPAN  (Japanese – Japan)
  • KOREAN_KOREA  (Korean – Korea)
  • LATVIAN_LATVIA  (Latvian – Latvia)
  • LITHUANIAN_LITHUANIA  (Lithuanian – Lithuania)
  • NORWEGIAN_NORWAY  (Norwegian – Norway)
  • POLISH_POLAND  (Polish – Poland)
  • PORTUGUESE_BRAZIL  (Portuguese – Brazil)
  • PORTUGUESE_PORTUGAL  (Portuguese – Portugal)
  • ROMANIAN_ROMANIA  (Romanian – Romania)
  • RUSSIAN_RUSSIA  (Russian – Russia)
  • SLOVAK_SLOVAK_REPUBLIC  (Slovak – Slovak Republic)
  • SLOVENIAN_SLOVENIA  (Slovenian – Slovenia)
  • SPANISH_ARGENTINA  (Spanish – Argentina)
  • SPANISH_CHILE  (Spanish – Chile)
  • SPANISH_LATIN_AMERICA  (Spanish – Latin America)
  • SPANISH_MEXICO  (Spanish – Mexico)
  • SPANISH_SPAIN  (Spanish – Spain)
  • SPANISH_UNITED_STATES  (Spanish – United States)
  • SWEDISH_SWEDEN  (Swedish – Sweden)
  • THAI_THAILAND  (Thai – Thailand)
  • TURKISH_TURKEY  (Turkish – Turkey)
  • UKRAINIAN_UKRAINE  (Ukrainian – Ukraine)

Safe Search

Key MicrosoftLiveDocumentSource.safeSearch
Direction Input
Level MEDIUM
DescriptionSafe search restriction (porn filter).
Required yes
Scope Processing time
Value type org.carrot2.source.microsoft.SafeSearch
Default value MODERATE
Allowed values
  • OFF  (Off)
  • MODERATE  (Moderate)
  • STRICT  (Strict)

11.8.6 Results paging

Search Mode

Key search-mode
Direction Input
Level ADVANCED
DescriptionSearch mode defines how fetchers returned from createFetcher(SearchRange) are called.
Required no
Scope Processing time
Value type org.carrot2.source.MultipageSearchEngine$SearchMode
Default value SPECULATIVE
Allowed values
  • CONSERVATIVE  (CONSERVATIVE)
  • SPECULATIVE  (SPECULATIVE)

11.8.7 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1

Start Index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0

11.8.8 Search request information

Total Results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none

11.8.9 Service

Application ID

Key MicrosoftLiveDocumentSource.appid
Direction Input
Level ADVANCED
DescriptionMicrosoft-assigned application ID for querying the API. Please generate your own ID for production deployments and branches off the Carrot2.org's code.
Required yes
Scope Initialization time
Value type java.lang.String
Default value DE531D8A42139F590B253CADFAD7A86172F93B96

11.9 Yahoo Web Search

Searches the web using Yahoo Boss Web Search API

11.9.3 Data source status

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none

Page Requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

Successful Queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

11.9.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system.
Scope Processing time
Value type java.util.Collection
Default value none

11.9.5 Postprocessing

Keep highlights

Key BossDocumentSource.keepHighlights
Direction Input
Level ADVANCED
DescriptionDetermines whether to keep the original query word highlights. Yahoo by default highlights query words in search results using the <b> HTML tag. Set this attribute to true to keep these highlights.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false

11.9.6 Results filtering

Content filter

Key BossWebSearchService.filter
Direction Input
Level MEDIUM
DescriptionFilters out adult or hate content. Must be a comma-separated list of content types to filter out.

The following content types are supported:

ValueContent
-porn Filters out adult content
-hate Filters out hate content

Adult content filtering is supported for all languages, hate content filtering is supported for English only.

Required no
Scope Processing time
Value type org.carrot2.source.boss.BossWebSearchService$OffensiveContentFilter
Default value none
Allowed values
  • PORN  (remove porn)
  • HATE  (remove hate)
  • PORN_AND_HATE  (remove porn and hate)

Domain restriction

Key BossSearchService.sites
Direction Input
Level MEDIUM
DescriptionRestricts search results to a set of sites. Must be a comma-separated list of site's domain names, e.g. abc.com,cnn.com.
Required no
Scope Processing time
Value type java.lang.String
Default value none

Language and Region

Key BossSearchService.languageAndRegion
Direction Input
Level MEDIUM
DescriptionRestricts search to the specified language and region. Must be a concatenation of constants defined by the language codes supported by the Yahoo Boss API.

The following languages and regions are currently (July 2009) supported:

CountryRegionLanguage
Argentinaares
Austriaatde
Australiaauen
Brazilbrpt
Canada - Englishcaen
Canada - Frenchcafr
Catalanctca
Chilecles
Columbiacoes
Czech Republicczcs
Denmarkdkda
Finlandfifi
Hong Konghktzh
Hungary Hungaryhuhu
Indonesia - Englishiden
Indonesia - Indonesianidid
Israelilhe
Indiainen
Japanjpjp
Koreakrkr
Mexicomxes
Malaysia - Englishmyen
Malaysiamyms
Netherlandsnlnl
Norwaynono
New Zealandnzen
Perupees
Philippinesphtl
Philippines - Englishphen
Russiaruru
Romaniaroro
Swedensesv
Singaporesgen
Taiwantwtzh
Thailandthth
Turkeytrtr
Switzerland - Germanchde
Switzerland - Frenchchfr
Switzerland - Italianchit
Germandede
Spanisheses
Frenchfrfr
Italianitit
United Kingdomuken
United States - Englishusen
United States - Spanishuses
Vietnamvnvi
Venezuelavees

Use BossLanguageCodes.getAttributeValue() to acquire proper constant for this field.

Required no
Scope Initialization time and Processing time
Value type org.carrot2.source.boss.BossLanguageCodes
Default value none
Allowed values
  • ARGENTINA  (Argentina)
  • AUSTRIA  (Austria)
  • AUSTRALIA  (Australia)
  • BRAZIL  (Brazil)
  • CANADA_ENGLISH  (Canada – English)
  • CANADA_FRENCH  (Canada – French)
  • CATALAN  (Catalan)
  • CHILE  (Chile)
  • COLUMBIA  (Columbia)
  • CZECH_REPUBLIC  (Czech Republic)
  • DENMARK  (Denmark)
  • FINLAND  (Finland)
  • FRENCH  (French)
  • GERMAN  (German)
  • HONG_KONG  (Hong Kong)
  • HUNGARY  (Hungary)
  • INDIA  (India)
  • INDONESIA_ENGLISH  (Indonesia – English)
  • INDONESIA_INDONESIAN  (Indonesia – Indonesian)
  • ISRAEL  (Israel)
  • ITALIAN  (Italian)
  • JAPAN  (Japan)
  • KOREA  (Korea)
  • MALAYSIA_ENGLISH  (Malaysia – English)
  • MALAYSIA_MALAYSIAN  (Malaysia)
  • MEXICO  (Mexico)
  • NETHERLANDS  (Netherlands)
  • NEW_ZEALAND  (New Zealand)
  • NORWAY  (Norway)
  • PERU  (Peru)
  • PHILIPPINES  (Philippines)
  • PHILIPPINES_ENGLISH  (Philippines – English)
  • ROMANIA  (Romania)
  • RUSSIA  (Russia)
  • SINGAPORE  (Singapore)
  • SPANISH  (Spanish)
  • SWEDEN  (Sweden)
  • SWITZERLAND_GERMAN  (Switzerland – German)
  • SWITZERLAND_FRENCH  (Switzerland – French)
  • SWITZERLAND_ITALIAN  (Switzerland – Italian)
  • TAIWAN  (Taiwan)
  • THAILAND  (Thailand)
  • TURKEY  (Turkey)
  • UNITED_KINGDOM  (United Kingdom)
  • UNITED_STATES  (United States – English)
  • UNITED_STATES_SPANISH  (United States – Spanish)
  • VIETNAM  (Vietnam)
  • VENEZUELA  (Venezuela)

Type filter

Key BossWebSearchService.type
Direction Input
Level ADVANCED
DescriptionRestricts search to documents of the specified types. Must be a comma-separated list of the required document types or type groups.

The following document types are supported:

ValueDocument type
html HTML documents
text Plain text documents
pdf Portable Document Format documents
xl Microsoft Excel documents
msword Microsoft Word documents
ppt Microsoft Power Point documents

The following document type groups are supported:

ValueDocument type groups
msoffice All Microsoft Office documents (xl, msword, ppt)
nohtml Anything else than HTML documents (text, pdf, xl, msword, ppt)

You can also specify a format group and then exclude an item: type=msoffice,-ppt.

Required no
Scope Processing time
Value type java.lang.String
Default value none

11.9.7 Results paging

Search Mode

Key search-mode
Direction Input
Level ADVANCED
DescriptionSearch mode defines how fetchers returned from createFetcher(SearchRange) are called.
Required no
Scope Processing time
Value type org.carrot2.source.MultipageSearchEngine$SearchMode
Default value SPECULATIVE
Allowed values
  • CONSERVATIVE  (CONSERVATIVE)
  • SPECULATIVE  (SPECULATIVE)

11.9.8 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1

Start Index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0

11.9.9 Search request information

Total Results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none

11.9.10 Service

Application ID

Key BossSearchService.appid
Direction Input
Level ADVANCED
DescriptionApplication ID required for BOSS services. Please generate your own ID for production deployments and branches off the Carrot2.org's code.
Required no
Scope Initialization time
Value type java.lang.String
Default value txRLTt7V34GgabH9baqIrsnRLuy87i4dQ2kQyok0IIqlUXdw4HmxjE59xhq2_6mT0LM-

Boss Search Service

Key BossDocumentSource.service
Direction Input
Level ADVANCED
DescriptionThe specific search service to be used by this document source. Use this attribute to choose which BOSS's service to query, e.g. Web, News or Image search.
Required no
Scope Initialization time
Value type org.carrot2.source.boss.BossSearchService
Default value org.carrot2.source.boss.BossWebSearchService
Allowed value types Allowed value types: No other assignable value types are allowed.

Service URI

Key BossWebSearchService.serviceURI
Direction Input
Level ADVANCED
DescriptionBoss Web search service URI. Specifies the URI at which Yahoo Boss Web Search API is available. The ${query} place holder will be replaced with the URL-encoded text of the processed query.
Required no
Scope Initialization time
Value type java.lang.String
Default value http://boss.yahooapis.com/ysearch/web/v1/${query}

11.10 Wikipedia Search (with Yahoo Boss)

Searches the Wikipedia web using Yahoo Boss Web Search API

11.10.3 Data source status

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none

Page Requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

Successful Queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

11.10.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system.
Scope Processing time
Value type java.util.Collection
Default value none

11.10.5 Postprocessing

Keep highlights

Key BossDocumentSource.keepHighlights
Direction Input
Level ADVANCED
DescriptionDetermines whether to keep the original query word highlights. Yahoo by default highlights query words in search results using the <b> HTML tag. Set this attribute to true to keep these highlights.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false

11.10.6 Results filtering

Content filter

Key BossWebSearchService.filter
Direction Input
Level MEDIUM
DescriptionFilters out adult or hate content. Must be a comma-separated list of content types to filter out.

The following content types are supported:

ValueContent
-porn Filters out adult content
-hate Filters out hate content

Adult content filtering is supported for all languages, hate content filtering is supported for English only.

Required no
Scope Processing time
Value type org.carrot2.source.boss.BossWebSearchService$OffensiveContentFilter
Default value none
Allowed values
  • PORN  (remove porn)
  • HATE  (remove hate)
  • PORN_AND_HATE  (remove porn and hate)

Domain restriction

Key BossSearchService.sites
Direction Input
Level MEDIUM
DescriptionRestricts search results to a set of sites. Must be a comma-separated list of site's domain names, e.g. abc.com,cnn.com.
Required no
Scope Processing time
Value type java.lang.String
Default value en.wikipedia.org

Language and Region

Key BossSearchService.languageAndRegion
Direction Input
Level MEDIUM
DescriptionRestricts search to the specified language and region. Must be a concatenation of constants defined by the language codes supported by the Yahoo Boss API.

The following languages and regions are currently (July 2009) supported:

CountryRegionLanguage
Argentinaares
Austriaatde
Australiaauen
Brazilbrpt
Canada - Englishcaen
Canada - Frenchcafr
Catalanctca
Chilecles
Columbiacoes
Czech Republicczcs
Denmarkdkda
Finlandfifi
Hong Konghktzh
Hungary Hungaryhuhu
Indonesia - Englishiden
Indonesia - Indonesianidid
Israelilhe
Indiainen
Japanjpjp
Koreakrkr
Mexicomxes
Malaysia - Englishmyen
Malaysiamyms
Netherlandsnlnl
Norwaynono
New Zealandnzen
Perupees
Philippinesphtl
Philippines - Englishphen
Russiaruru
Romaniaroro
Swedensesv
Singaporesgen
Taiwantwtzh
Thailandthth
Turkeytrtr
Switzerland - Germanchde
Switzerland - Frenchchfr
Switzerland - Italianchit
Germandede
Spanisheses
Frenchfrfr
Italianitit
United Kingdomuken
United States - Englishusen
United States - Spanishuses
Vietnamvnvi
Venezuelavees

Use BossLanguageCodes.getAttributeValue() to acquire proper constant for this field.

Required no
Scope Initialization time and Processing time
Value type org.carrot2.source.boss.BossLanguageCodes
Default value none
Allowed values
  • ARGENTINA  (Argentina)
  • AUSTRIA  (Austria)
  • AUSTRALIA  (Australia)
  • BRAZIL  (Brazil)
  • CANADA_ENGLISH  (Canada – English)
  • CANADA_FRENCH  (Canada – French)
  • CATALAN  (Catalan)
  • CHILE  (Chile)
  • COLUMBIA  (Columbia)
  • CZECH_REPUBLIC  (Czech Republic)
  • DENMARK  (Denmark)
  • FINLAND  (Finland)
  • FRENCH  (French)
  • GERMAN  (German)
  • HONG_KONG  (Hong Kong)
  • HUNGARY  (Hungary)
  • INDIA  (India)
  • INDONESIA_ENGLISH  (Indonesia – English)
  • INDONESIA_INDONESIAN  (Indonesia – Indonesian)
  • ISRAEL  (Israel)
  • ITALIAN  (Italian)
  • JAPAN  (Japan)
  • KOREA  (Korea)
  • MALAYSIA_ENGLISH  (Malaysia – English)
  • MALAYSIA_MALAYSIAN  (Malaysia)
  • MEXICO  (Mexico)
  • NETHERLANDS  (Netherlands)
  • NEW_ZEALAND  (New Zealand)
  • NORWAY  (Norway)
  • PERU  (Peru)
  • PHILIPPINES  (Philippines)
  • PHILIPPINES_ENGLISH  (Philippines – English)
  • ROMANIA  (Romania)
  • RUSSIA  (Russia)
  • SINGAPORE  (Singapore)
  • SPANISH  (Spanish)
  • SWEDEN  (Sweden)
  • SWITZERLAND_GERMAN  (Switzerland – German)
  • SWITZERLAND_FRENCH  (Switzerland – French)
  • SWITZERLAND_ITALIAN  (Switzerland – Italian)
  • TAIWAN  (Taiwan)
  • THAILAND  (Thailand)
  • TURKEY  (Turkey)
  • UNITED_KINGDOM  (United Kingdom)
  • UNITED_STATES  (United States – English)
  • UNITED_STATES_SPANISH  (United States – Spanish)
  • VIETNAM  (Vietnam)
  • VENEZUELA  (Venezuela)

Type filter

Key BossWebSearchService.type
Direction Input
Level ADVANCED
DescriptionRestricts search to documents of the specified types. Must be a comma-separated list of the required document types or type groups.

The following document types are supported:

ValueDocument type
html HTML documents
text Plain text documents
pdf Portable Document Format documents
xl Microsoft Excel documents
msword Microsoft Word documents
ppt Microsoft Power Point documents

The following document type groups are supported:

ValueDocument type groups
msoffice All Microsoft Office documents (xl, msword, ppt)
nohtml Anything else than HTML documents (text, pdf, xl, msword, ppt)

You can also specify a format group and then exclude an item: type=msoffice,-ppt.

Required no
Scope Processing time
Value type java.lang.String
Default value none

11.10.7 Results paging

Search Mode

Key search-mode
Direction Input
Level ADVANCED
DescriptionSearch mode defines how fetchers returned from createFetcher(SearchRange) are called.
Required no
Scope Processing time
Value type org.carrot2.source.MultipageSearchEngine$SearchMode
Default value SPECULATIVE
Allowed values
  • CONSERVATIVE  (CONSERVATIVE)
  • SPECULATIVE  (SPECULATIVE)

11.10.8 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1

Start Index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0

11.10.9 Search request information

Total Results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none

11.10.10 Service

Application ID

Key BossSearchService.appid
Direction Input
Level ADVANCED
DescriptionApplication ID required for BOSS services. Please generate your own ID for production deployments and branches off the Carrot2.org's code.
Required no
Scope Initialization time
Value type java.lang.String
Default value txRLTt7V34GgabH9baqIrsnRLuy87i4dQ2kQyok0IIqlUXdw4HmxjE59xhq2_6mT0LM-

Boss Search Service

Key BossDocumentSource.service
Direction Input
Level ADVANCED
DescriptionThe specific search service to be used by this document source. Use this attribute to choose which BOSS's service to query, e.g. Web, News or Image search.
Required no
Scope Initialization time
Value type org.carrot2.source.boss.BossSearchService
Default value org.carrot2.source.boss.BossWebSearchService
Allowed value types Allowed value types: No other assignable value types are allowed.

Service URI

Key BossWebSearchService.serviceURI
Direction Input
Level ADVANCED
DescriptionBoss Web search service URI. Specifies the URI at which Yahoo Boss Web Search API is available. The ${query} place holder will be replaced with the URL-encoded text of the processed query.
Required no
Scope Initialization time
Value type java.lang.String
Default value http://boss.yahooapis.com/ysearch/web/v1/${query}

11.11 Yahoo Image Search

Searches web images using Yahoo Boss Image Search API

11.11.3 Data source status

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none

Page Requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

Successful Queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

11.11.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system.
Scope Processing time
Value type java.util.Collection
Default value none

11.11.5 Postprocessing

Keep highlights

Key BossDocumentSource.keepHighlights
Direction Input
Level ADVANCED
DescriptionDetermines whether to keep the original query word highlights. Yahoo by default highlights query words in search results using the <b> HTML tag. Set this attribute to true to keep these highlights.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false

11.11.6 Results filtering

Domain restriction

Key BossSearchService.sites
Direction Input
Level MEDIUM
DescriptionRestricts search results to a set of sites. Must be a comma-separated list of site's domain names, e.g. abc.com,cnn.com.
Required no
Scope Processing time
Value type java.lang.String
Default value none

Language and Region

Key BossSearchService.languageAndRegion
Direction Input
Level MEDIUM
DescriptionRestricts search to the specified language and region. Must be a concatenation of constants defined by the language codes supported by the Yahoo Boss API.

The following languages and regions are currently (July 2009) supported:

CountryRegionLanguage
Argentinaares
Austriaatde
Australiaauen
Brazilbrpt
Canada - Englishcaen
Canada - Frenchcafr
Catalanctca
Chilecles
Columbiacoes
Czech Republicczcs
Denmarkdkda
Finlandfifi
Hong Konghktzh
Hungary Hungaryhuhu
Indonesia - Englishiden
Indonesia - Indonesianidid
Israelilhe
Indiainen
Japanjpjp
Koreakrkr
Mexicomxes
Malaysia - Englishmyen
Malaysiamyms
Netherlandsnlnl
Norwaynono
New Zealandnzen
Perupees
Philippinesphtl
Philippines - Englishphen
Russiaruru
Romaniaroro
Swedensesv
Singaporesgen
Taiwantwtzh
Thailandthth
Turkeytrtr
Switzerland - Germanchde
Switzerland - Frenchchfr
Switzerland - Italianchit
Germandede
Spanisheses
Frenchfrfr
Italianitit
United Kingdomuken
United States - Englishusen
United States - Spanishuses
Vietnamvnvi
Venezuelavees

Use BossLanguageCodes.getAttributeValue() to acquire proper constant for this field.

Required no
Scope Initialization time and Processing time
Value type org.carrot2.source.boss.BossLanguageCodes
Default value none
Allowed values
  • ARGENTINA  (Argentina)
  • AUSTRIA  (Austria)
  • AUSTRALIA  (Australia)
  • BRAZIL  (Brazil)
  • CANADA_ENGLISH  (Canada – English)
  • CANADA_FRENCH  (Canada – French)
  • CATALAN  (Catalan)
  • CHILE  (Chile)
  • COLUMBIA  (Columbia)
  • CZECH_REPUBLIC  (Czech Republic)
  • DENMARK  (Denmark)
  • FINLAND  (Finland)
  • FRENCH  (French)
  • GERMAN  (German)
  • HONG_KONG  (Hong Kong)
  • HUNGARY  (Hungary)
  • INDIA  (India)
  • INDONESIA_ENGLISH  (Indonesia – English)
  • INDONESIA_INDONESIAN  (Indonesia – Indonesian)
  • ISRAEL  (Israel)
  • ITALIAN  (Italian)
  • JAPAN  (Japan)
  • KOREA  (Korea)
  • MALAYSIA_ENGLISH  (Malaysia – English)
  • MALAYSIA_MALAYSIAN  (Malaysia)
  • MEXICO  (Mexico)
  • NETHERLANDS  (Netherlands)
  • NEW_ZEALAND  (New Zealand)
  • NORWAY  (Norway)
  • PERU  (Peru)
  • PHILIPPINES  (Philippines)
  • PHILIPPINES_ENGLISH  (Philippines – English)
  • ROMANIA  (Romania)
  • RUSSIA  (Russia)
  • SINGAPORE  (Singapore)
  • SPANISH  (Spanish)
  • SWEDEN  (Sweden)
  • SWITZERLAND_GERMAN  (Switzerland – German)
  • SWITZERLAND_FRENCH  (Switzerland – French)
  • SWITZERLAND_ITALIAN  (Switzerland – Italian)
  • TAIWAN  (Taiwan)
  • THAILAND  (Thailand)
  • TURKEY  (Turkey)
  • UNITED_KINGDOM  (United Kingdom)
  • UNITED_STATES  (United States – English)
  • UNITED_STATES_SPANISH  (United States – Spanish)
  • VIETNAM  (Vietnam)
  • VENEZUELA  (Venezuela)

Offensive content filter

Key BossImageSearchService.filter
Direction Input
Level MEDIUM
DescriptionIf enabled, excludes offensive content from the results.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true

Preferred size

Key BossImageSearchService.dimensions
Direction Input
Level MEDIUM
DescriptionThe size of images to fetch. Small images are generally thumbnail or icon sized. Medium sized images are average sized; usually not exceeding an average screen size. Large images are screen size or larger.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.source.boss.Dimensions
Default value ALL
Allowed values
  • ALL  (All)
  • SMALL  (Small)
  • MEDIUM  (Medium)
  • LARGE  (Large)
  • WALLPAPER  (Wallpaper)
  • WIDE_WALLPAPER  (Wide Wallpaper)

11.11.7 Results paging

Search Mode

Key search-mode
Direction Input
Level ADVANCED
DescriptionSearch mode defines how fetchers returned from createFetcher(SearchRange) are called.
Required no
Scope Processing time
Value type org.carrot2.source.MultipageSearchEngine$SearchMode
Default value SPECULATIVE
Allowed values
  • CONSERVATIVE  (CONSERVATIVE)
  • SPECULATIVE  (SPECULATIVE)

11.11.8 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1

Start Index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0

11.11.9 Search request information

Total Results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none

11.11.10 Service

Application ID

Key BossSearchService.appid
Direction Input
Level ADVANCED
DescriptionApplication ID required for BOSS services. Please generate your own ID for production deployments and branches off the Carrot2.org's code.
Required no
Scope Initialization time
Value type java.lang.String
Default value txRLTt7V34GgabH9baqIrsnRLuy87i4dQ2kQyok0IIqlUXdw4HmxjE59xhq2_6mT0LM-

Boss Search Service

Key BossDocumentSource.service
Direction Input
Level ADVANCED
DescriptionThe specific search service to be used by this document source. Use this attribute to choose which BOSS's service to query, e.g. Web, News or Image search.
Required no
Scope Initialization time
Value type org.carrot2.source.boss.BossSearchService
Default value org.carrot2.source.boss.BossImageSearchService
Allowed value types Allowed value types: No other assignable value types are allowed.

Service URI

Key BossImageSearchService.serviceURI
Direction Input
Level ADVANCED
DescriptionBoss Image search service URI. Specifies the URI at which Yahoo Boss Image Search API is available. The ${query} place holder will be replaced with the URL-encoded text of the processed query.
Required no
Scope Initialization time
Value type java.lang.String
Default value http://boss.yahooapis.com/ysearch/images/v1/${query}

11.12 Yahoo Boss News Search

Yahoo Boss News Search searches news using Yahoo Boss API.

11.12.3 Data source status

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none

Page Requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

Successful Queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

11.12.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system.
Scope Processing time
Value type java.util.Collection
Default value none

11.12.5 Postprocessing

Keep highlights

Key BossDocumentSource.keepHighlights
Direction Input
Level ADVANCED
DescriptionDetermines whether to keep the original query word highlights. Yahoo by default highlights query words in search results using the <b> HTML tag. Set this attribute to true to keep these highlights.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false

11.12.6 Results filtering

Age

Key BossNewsSearchService.age
Direction Input
Level MEDIUM
DescriptionMaximum age of returned news in days. The index stories for 30 days.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 7
Min value 1
Max value 30

Domain restriction

Key BossSearchService.sites
Direction Input
Level MEDIUM
DescriptionRestricts search results to a set of sites. Must be a comma-separated list of site's domain names, e.g. abc.com,cnn.com.
Required no
Scope Processing time
Value type java.lang.String
Default value none

Language and Region

Key BossSearchService.languageAndRegion
Direction Input
Level MEDIUM
DescriptionRestricts search to the specified language and region. Must be a concatenation of constants defined by the language codes supported by the Yahoo Boss API.

The following languages and regions are currently (July 2009) supported:

CountryRegionLanguage
Argentinaares
Austriaatde
Australiaauen
Brazilbrpt
Canada - Englishcaen
Canada - Frenchcafr
Catalanctca
Chilecles
Columbiacoes
Czech Republicczcs
Denmarkdkda
Finlandfifi
Hong Konghktzh
Hungary Hungaryhuhu
Indonesia - Englishiden
Indonesia - Indonesianidid
Israelilhe
Indiainen
Japanjpjp
Koreakrkr
Mexicomxes
Malaysia - Englishmyen
Malaysiamyms
Netherlandsnlnl
Norwaynono
New Zealandnzen
Perupees
Philippinesphtl
Philippines - Englishphen
Russiaruru
Romaniaroro
Swedensesv
Singaporesgen
Taiwantwtzh
Thailandthth
Turkeytrtr
Switzerland - Germanchde
Switzerland - Frenchchfr
Switzerland - Italianchit
Germandede
Spanisheses
Frenchfrfr
Italianitit
United Kingdomuken
United States - Englishusen
United States - Spanishuses
Vietnamvnvi
Venezuelavees

Use BossLanguageCodes.getAttributeValue() to acquire proper constant for this field.

Required no
Scope Initialization time and Processing time
Value type org.carrot2.source.boss.BossLanguageCodes
Default value none
Allowed values
  • ARGENTINA  (Argentina)
  • AUSTRIA  (Austria)
  • AUSTRALIA  (Australia)
  • BRAZIL  (Brazil)
  • CANADA_ENGLISH  (Canada – English)
  • CANADA_FRENCH  (Canada – French)
  • CATALAN  (Catalan)
  • CHILE  (Chile)
  • COLUMBIA  (Columbia)
  • CZECH_REPUBLIC  (Czech Republic)
  • DENMARK  (Denmark)
  • FINLAND  (Finland)
  • FRENCH  (French)
  • GERMAN  (German)
  • HONG_KONG  (Hong Kong)
  • HUNGARY  (Hungary)
  • INDIA  (India)
  • INDONESIA_ENGLISH  (Indonesia – English)
  • INDONESIA_INDONESIAN  (Indonesia – Indonesian)
  • ISRAEL  (Israel)
  • ITALIAN  (Italian)
  • JAPAN  (Japan)
  • KOREA  (Korea)
  • MALAYSIA_ENGLISH  (Malaysia – English)
  • MALAYSIA_MALAYSIAN  (Malaysia)
  • MEXICO  (Mexico)
  • NETHERLANDS  (Netherlands)
  • NEW_ZEALAND  (New Zealand)
  • NORWAY  (Norway)
  • PERU  (Peru)
  • PHILIPPINES  (Philippines)
  • PHILIPPINES_ENGLISH  (Philippines – English)
  • ROMANIA  (Romania)
  • RUSSIA  (Russia)
  • SINGAPORE  (Singapore)
  • SPANISH  (Spanish)
  • SWEDEN  (Sweden)
  • SWITZERLAND_GERMAN  (Switzerland – German)
  • SWITZERLAND_FRENCH  (Switzerland – French)
  • SWITZERLAND_ITALIAN  (Switzerland – Italian)
  • TAIWAN  (Taiwan)
  • THAILAND  (Thailand)
  • TURKEY  (Turkey)
  • UNITED_KINGDOM  (United Kingdom)
  • UNITED_STATES  (United States – English)
  • UNITED_STATES_SPANISH  (United States – Spanish)
  • VIETNAM  (Vietnam)
  • VENEZUELA  (Venezuela)

11.12.7 Results paging

Search Mode

Key search-mode
Direction Input
Level ADVANCED
DescriptionSearch mode defines how fetchers returned from createFetcher(SearchRange) are called.
Required no
Scope Processing time
Value type org.carrot2.source.MultipageSearchEngine$SearchMode
Default value SPECULATIVE
Allowed values
  • CONSERVATIVE  (CONSERVATIVE)
  • SPECULATIVE  (SPECULATIVE)

11.12.8 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1

Start Index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0

11.12.9 Search request information

Total Results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none

11.12.10 Service

Application ID

Key BossSearchService.appid
Direction Input
Level ADVANCED
DescriptionApplication ID required for BOSS services. Please generate your own ID for production deployments and branches off the Carrot2.org's code.
Required no
Scope Initialization time
Value type java.lang.String
Default value txRLTt7V34GgabH9baqIrsnRLuy87i4dQ2kQyok0IIqlUXdw4HmxjE59xhq2_6mT0LM-

Boss Search Service

Key BossDocumentSource.service
Direction Input
Level ADVANCED
DescriptionThe specific search service to be used by this document source. Use this attribute to choose which BOSS's service to query, e.g. Web, News or Image search.
Required no
Scope Initialization time
Value type org.carrot2.source.boss.BossSearchService
Default value org.carrot2.source.boss.BossNewsSearchService
Allowed value types Allowed value types: No other assignable value types are allowed.

Service URI

Key BossNewsSearchService.serviceURI
Direction Input
Level ADVANCED
DescriptionBoss News search service URI.
Required no
Scope Initialization time
Value type java.lang.String
Default value http://boss.yahooapis.com/ysearch/news/v1/${query}

11.13 Jobs from indeed.com

Searches jobs from indeed.com

11.13.1 Jobs from indeed.com input attributes by level

11.13.3 Data source status

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none

Page Requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

Successful Queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

11.13.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system.
Scope Processing time
Value type java.util.Collection
Default value none

11.13.5 Results paging

Search Mode

Key search-mode
Direction Input
Level ADVANCED
DescriptionSearch mode defines how fetchers returned from createFetcher(SearchRange) are called.
Required no
Scope Processing time
Value type org.carrot2.source.MultipageSearchEngine$SearchMode
Default value SPECULATIVE
Allowed values
  • CONSERVATIVE  (CONSERVATIVE)
  • SPECULATIVE  (SPECULATIVE)

11.13.6 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1

Start Index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0

11.13.7 Search request information

Total Results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none

11.13.8 Service

Feed URL parameters

Key OpenSearchDocumentSource.feedUrlParams
Direction Input
Level ADVANCED
DescriptionAdditional parameters to be appended to feedUrlTemplate on each request.
Required no
Scope Initialization time and Processing time
Value type java.util.Map
Default value none

Feed URL template

Key OpenSearchDocumentSource.feedUrlTemplate
Direction Input
Level ADVANCED
DescriptionURL to fetch the search feed from. The URL template can contain variable place holders as defined by the OpenSearch specification that will be replaced during runtime. The format of the place holder is ${variable}. The following variables are supported:
  • searchTerms will be replaced by the query
  • startIndex index of the first result to be searched. Mutually exclusive with startPage
  • startPage index of the first result to be searched. Mutually exclusive with startIndex.
  • count the number of search results per page
Required yes
Scope Initialization time
Value type java.lang.String
Default value http://www.indeed.com/opensearch?q=${searchTerms}&start=${startIndex}&limit=${count}

Maximum results

Key OpenSearchDocumentSource.maximumResults
Direction Input
Level ADVANCED
DescriptionMaximum number of results. The maximum number of results the document source can deliver.
Required no
Scope Initialization time
Value type java.lang.Integer
Default value 400
Min value 1

Results per page

Key OpenSearchDocumentSource.resultsPerPage
Direction Input
Level ADVANCED
DescriptionResults per page. The number of results per page the document source will expect the feed to return.
Required yes
Scope Initialization time
Value type java.lang.Integer
Default value 50
Min value 1

11.14 XML

XML document source retrieves documents from local XML files or remote XML streams. It can optionally apply an XSLT transformation to convert the XML to the required format.

11.14.1 XML input attributes by level

11.14.3 Documents

Documents

Key documents
Direction Output
DescriptionDocuments read from the XML data.
Scope Processing time
Value type java.util.List
Default value none

11.14.4 Search query

Query

Key query
Direction Input and Output
Level BASIC
DescriptionAfter processing this field may hold the query read from the XML data, if any. For the semantics of this field on input, see xml.
Required no
Scope Processing time
Value type java.lang.String
Default value none

Read all documents

Key XmlDocumentSource.readAll
Direction Input
Level BASIC
DescriptionIf true, all documents are read from the input XML stream, regardless of the limit set by results.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true

Results

Key results
Direction Input
Level BASIC
DescriptionThe maximum number of documents to read from the XML data if readAll is false.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1

11.14.5 Search request information

Title

Key processing-result.title
Direction Output
DescriptionThe title (file name or query attribute, if present) for the search result fetched from the resource. A typical title for a processing result will be the query used to fetch documents from that source. For certain document sources the query may not be needed (on-disk XML, feed of syndicated news); in such cases, the input component should set its title properly for visual interfaces such as the workbench.
Scope Processing time
Value type java.lang.String
Default value none

11.14.6 XML data

XML Parameters

Key XmlDocumentSource.xmlParameters
Direction Input
Level ADVANCED
DescriptionValues for custom placeholders in the XML URL. If the type of resource provided in the xml attribute is URLResourceWithParams, this map provides values for custom placeholders found in the XML URL. Keys of the map correspond to placeholder names, values of the map will be used to replace the placeholders. Please see xml for the placeholder syntax.
Required no
Scope Initialization time and Processing time
Value type java.util.Map
Default value {}

XML Resource

Key XmlDocumentSource.xml
Direction Input
Level BASIC
DescriptionThe resource to load XML data from. You can either create instances of IResource implementations directly or use ResourceUtils to look up IResource instances from a variety of locations.

One special IResource implementation you can use is URLResourceWithParams. It allows you to specify attribute placeholders in the URL that will be replaced with actual values at runtime. The placeholder format is ${attribute}. The following common attributes will be substituted:

  • query will be replaced with the current query being processed. If the query has not been provided, this attribute will fall back to an empty string.
  • results will be replaced with the number of results requested. If the number of results has not been provided, this attribute will be substituted with an empty string.

Additionally, custom placeholders can be used. Values for the custom placeholders should be provided in the xmlParameters attribute.

Required yes
Scope Initialization time and Processing time
Value type org.carrot2.util.resource.IResource
Default value none
Allowed value types Allowed value types: Other assignable value types are allowed.

11.14.7 XML transformation

XSLT Parameters

Key XmlDocumentSource.xsltParameters
Direction Input
Level ADVANCED
DescriptionParameters to be passed to the XSLT transformer. Keys of the map will be used as parameter names, values of the map as parameter values.
Required no
Scope Initialization time and Processing time
Value type java.util.Map
Default value {}

XSLT Stylesheet

Key XmlDocumentSource.xslt
Direction Input
Level MEDIUM
DescriptionThe resource to load XSLT stylesheet from. The XSLT stylesheet is optional and is useful when the source XML stream does not follow the Carrot2 format. The XSLT transformation will be applied to the source XML stream, the transformed XML stream will be deserialized into Documents.

The XSLT IResource can be provided both on initialization and processing time. The stylesheet provided on initialization will be cached for the life time of the component, while processing-time style sheets will be compiled every time processing is requested and will override the initialization-time stylesheet.

To pass additional parameters to the XSLT transformer, use the xsltParameters attribute.

Required no
Scope Initialization time and Processing time
Value type org.carrot2.util.resource.IResource
Default value none
Allowed value types Allowed value types: Other assignable value types are allowed.

11.15 Google Desktop search

Google Desktop document source searches the local instance of Google Desktop.

11.15.1 Google Desktop search input attributes by level

11.15.2 Google Desktop search attributes by direction

11.15.3 Data source status

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none

Page Requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

Successful Queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

11.15.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system.
Scope Processing time
Value type java.util.Collection
Default value none

11.15.5 Postprocessing

Keep highlights

Key GoogleDesktopDocumentSource.keepHighlights
Direction Input
Level ADVANCED
DescriptionKeep query word highlighting. Google by default highlights query words in snippets using the bold HTML tag. Set this attribute to true to keep these highlights.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false

11.15.6 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1

Start Index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0

11.15.7 Search request information

Total Results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none

11.15.8 Service

Query URL

Key GoogleDesktopDocumentSource.queryUrl
Direction Input
Level ADVANCED
DescriptionQuery URL. Installation-specific URL at which Google Desktop search service is available. On Windows machines, the URL is available at the HKEY_CURRENT_USER\Software\Google\Google Desktop\API\search_url system registry key and Carrot2 will attempt to automatically read the value from the registry when run with Administrator provileges. Please consult Google Desktop API documentation for further instructions on how to determine the query URL on other systems.
Required no
Scope Processing time
Value type java.lang.String
Default value none

11.16 Solr Search Engine

Solr document source queries an instance of Apache Solr search engine.

11.16.1 Solr Search Engine input attributes by level

11.16.3 Data source status

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none

Page Requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

Successful Queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

11.16.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system.
Scope Processing time
Value type java.util.Collection
Default value none

11.16.5 Index field mapping

Summary field name

Key SolrDocumentSource.solrSummaryFieldName
Direction Input
Level MEDIUM
DescriptionSummary field name. Name of the Solr field that will provide document summary.
Required no
Scope Processing time
Value type java.lang.String
Default value description

Title field name

Key SolrDocumentSource.solrTitleFieldName
Direction Input
Level MEDIUM
DescriptionTitle field name. Name of the Solr field that will provide document titles.
Required no
Scope Processing time
Value type java.lang.String
Default value title

URL field name

Key SolrDocumentSource.solrUrlFieldName
Direction Input
Level MEDIUM
DescriptionURL field name. Name of the Solr field that will provide document URLs.
Required no
Scope Processing time
Value type java.lang.String
Default value url

11.16.6 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1

Start Index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0

11.16.7 Search request information

Total Results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none

11.16.8 Service

Service URL

Key SolrDocumentSource.serviceUrlBase
Direction Input
Level ADVANCED
DescriptionSolr service URL base.
Required no
Scope Processing time
Value type java.lang.String
Default value http://localhost:8983/solr/select

11.17 Ambient Test Set

Serves documents from the Ambient test set. Ambient (AMBIgous ENTries) is a data set designed for evaluating subtopic information retrieval. It consists of 44 topics, each with a set of subtopics and a list of 100 ranked documents. For more information, please see: http://credo.fub.it/ambient.

11.17.1 Ambient Test Set input attributes by level

11.17.3 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system.
Scope Processing time
Value type java.util.List
Default value none

11.17.4 Filtering

Minimum topic size

Key FubDocumentSource.minTopicSize
Direction Input
Level MEDIUM
DescriptionMinimum topic size. Documents belonging to a topic with fewer documents than minimum topic size will not be returned.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 1
Min value 1

11.17.5 Search query

Query

Key query
Direction Output
DescriptionQuery to perform.
Scope Processing time
Value type java.lang.String
Default value none

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1
Max value 100

11.17.6 Search request information

Total Results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none

11.17.7 Topic ID

Ambient Topic

Key AmbientDocumentSource.topic
Direction Input
Level BASIC
DescriptionAmbient Topic. The Ambient Topic to load documents from.
Required yes
Scope Processing time
Value type org.carrot2.source.ambient.AmbientDocumentSource$AmbientTopic
Default value AIDA
Allowed values
  • AIDA  (Aida)
  • B_52  (B-52)
  • BEAGLE  (Beagle)
  • BRONX  (Bronx)
  • CAIN  (Cain)
  • CAMEL  (Camel)
  • CORAL_SEA  (Coral Sea)
  • CUBE  (Cube)
  • EOS  (Eos)
  • EXCALIBUR  (Excalibur)
  • FAHRENHEIT  (Fahrenheit)
  • GLOBE  (Globe)
  • HORNET  (Hornet)
  • INDIGO  (Indigo)
  • IWO_JIMA  (Iwo Jima)
  • JAGUAR  (Jaguar)
  • LA_PLATA  (La Plata)
  • LABYRINTH  (Labyrinth)
  • LANDAU  (Landau)
  • LIFE_ON_MARS  (Life on Mars)
  • LOCUST  (Locust)
  • MAGIC_MOUNTAIN  (Magic Mountain)
  • MATADOR  (Matador)
  • METAMORPHOSIS  (Metamorphosis)
  • MINOTAUR  (Minotaur)
  • MIRA  (Mira)
  • MIRAGE  (Mirage)
  • MONTE_CARLO  (Monte Carlo)
  • OPPENHEIM  (Oppenheim)
  • OUT_OF_CONTROL  (Out of Control)
  • PELICAN  (Pelican)
  • PURPLE_HAZE  (Purple Haze)
  • RAAM  (Raam)
  • RHEA  (Rhea)
  • SCORPION  (Scorpion)
  • THE_LITTLE_MERMAID  (The Little Mermaid)
  • TORTUGA  (Tortuga)
  • URANIA  (Urania)
  • WINK  (Wink)
  • XANADU  (Xanadu)
  • ZEBRA  (Zebra)
  • ZENITH  (Zenith)
  • ZODIAC  (Zodiac)
  • ZOMBIE  (Zombie)

Topics and subtopics covered in the output documents

Key FubDocumentSource.topicIds
Direction Output
DescriptionTopics and subtopics covered in the output documents. The set is computed for the output documents and it may vary for the same main topic based e.g. on the requested number of requested results or minTopicSize.
Scope Processing time
Value type java.util.Set
Default value none

11.18 ODP239 Test Set

Serves documents from the ODP239 test set. ODP239 is a data set designed for evaluating subtopic information retrieval. It consists of 239 topics extracted from the Open Directory Project, each with a set of subtopics and a list of about 100 documents. For more information, please see: http://credo.fub.it/odp239.

11.18.1 ODP239 Test Set input attributes by level

11.18.3 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system.
Scope Processing time
Value type java.util.List
Default value none

11.18.4 Filtering

Minimum topic size

Key FubDocumentSource.minTopicSize
Direction Input
Level MEDIUM
DescriptionMinimum topic size. Documents belonging to a topic with fewer documents than minimum topic size will not be returned.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 1
Min value 1

11.18.5 Search query

Query

Key query
Direction Output
DescriptionQuery to perform.
Scope Processing time
Value type java.lang.String
Default value none

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 1000
Min value 1
Max value 1000

11.18.6 Search request information

Total Results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none

11.18.7 Topic ID

ODP239 Topic

Key Odp239DocumentSource.topic
Direction Input
Level BASIC
DescriptionODP239 Topic. The ODP239 Topic to load documents from.
Required yes
Scope Processing time
Value type org.carrot2.source.ambient.Odp239DocumentSource$Odp239Topic
Default value ARTS_ANIMATION
Allowed values
  • ARTS_ANIMATION  (Arts > Animation)
  • ARTS_ARCHITECTURE  (Arts > Architecture)
  • ARTS_BODYART  (Arts > Bodyart)
  • ARTS_COMICS  (Arts > Comics)
  • ARTS_CRAFTS  (Arts > Crafts)
  • ARTS_EDUCATION  (Arts > Education)
  • ARTS_ILLUSTRATION  (Arts > Illustration)
  • ARTS_LITERATURE  (Arts > Literature)
  • ARTS_MOVIES  (Arts > Movies)
  • ARTS_MUSIC  (Arts > Music)
  • ARTS_ONLINE_WRITING  (Arts > Online Writing)
  • ARTS_PEOPLE  (Arts > People)
  • ARTS_PERFORMING_ARTS  (Arts > Performing Arts)
  • ARTS_PHOTOGRAPHY  (Arts > Photography)
  • ARTS_RADIO  (Arts > Radio)
  • ARTS_TELEVISION  (Arts > Television)
  • ARTS_VIDEO  (Arts > Video)
  • ARTS_VISUAL_ARTS  (Arts > Visual Arts)
  • ARTS_WRITERS_RESOURCES  (Arts > Writers Resources)
  • BUSINESS_AGRICULTURE_AND_FORESTRY  (Business > Agriculture and Forestry)
  • BUSINESS_ARTS_AND_ENTERTAINMENT  (Business > Arts and Entertainment)
  • BUSINESS_AUTOMOTIVE  (Business > Automotive)
  • BUSINESS_BUSINESS_SERVICES  (Business > Business Services)
  • BUSINESS_CHEMICALS  (Business > Chemicals)
  • BUSINESS_CONSTRUCTION_AND_MAINTENANCE  (Business > Construction and Maintenance)
  • BUSINESS_CONSUMER_GOODS_AND_SERVICES  (Business > Consumer Goods and Services)
  • BUSINESS_ECOMMERCE  (Business > E-Commerce)
  • BUSINESS_EDUCATION_AND_TRAINING  (Business > Education and Training)
  • BUSINESS_ELECTRONICS_AND_ELECTRICAL  (Business > Electronics and Electrical)
  • BUSINESS_ENERGY  (Business > Energy)
  • BUSINESS_FINANCIAL_SERVICES  (Business > Financial Services)
  • BUSINESS_FOOD_AND_RELATED_PRODUCTS  (Business > Food and Related Products)
  • BUSINESS_HEALTHCARE  (Business > Healthcare)
  • BUSINESS_HOSPITALITY  (Business > Hospitality)
  • BUSINESS_HUMAN_RESOURCES  (Business > Human Resources)
  • BUSINESS_INDUSTRIAL_GOODS_AND_SERVICES  (Business > Industrial Goods and Services)
  • BUSINESS_INFORMATION_TECHNOLOGY  (Business > Information Technology)
  • BUSINESS_INVESTING  (Business > Investing)
  • BUSINESS_MANAGEMENT  (Business > Management)
  • BUSINESS_MARKETING_AND_ADVERTISING  (Business > Marketing and Advertising)
  • BUSINESS_MATERIALS  (Business > Materials)
  • BUSINESS_OPPORTUNITIES  (Business > Opportunities)
  • BUSINESS_REAL_ESTATE  (Business > Real Estate)
  • BUSINESS_RETAIL_TRADE  (Business > Retail Trade)
  • BUSINESS_SMALL_BUSINESS  (Business > Small Business)
  • BUSINESS_TELECOMMUNICATIONS  (Business > Telecommunications)
  • BUSINESS_TEXTILES_AND_NONWOVENS  (Business > Textiles and Nonwovens)
  • BUSINESS_TRANSPORTATION_AND_LOGISTICS  (Business > Transportation and Logistics)
  • COMPUTERS_ALGORITHMS  (Computers > Algorithms)
  • COMPUTERS_ARTIFICIAL_INTELLIGENCE  (Computers > Artificial Intelligence)
  • COMPUTERS_ARTIFICIAL_LIFE  (Computers > Artificial Life)
  • COMPUTERS_CAD_AND_CAM  (Computers > CAD and CAM)
  • COMPUTERS_COMPANIES  (Computers > Companies)
  • COMPUTERS_COMPUTER_SCIENCE  (Computers > Computer Science)
  • COMPUTERS_CONSULTANTS  (Computers > Consultants)
  • COMPUTERS_DATA_COMMUNICATIONS  (Computers > Data Communications)
  • COMPUTERS_DATA_FORMATS  (Computers > Data Formats)
  • COMPUTERS_EMULATORS  (Computers > Emulators)
  • COMPUTERS_GRAPHICS  (Computers > Graphics)
  • COMPUTERS_HACKING  (Computers > Hacking)
  • COMPUTERS_HARDWARE  (Computers > Hardware)
  • COMPUTERS_INTERNET  (Computers > Internet)
  • COMPUTERS_MOBILE_COMPUTING  (Computers > Mobile Computing)
  • COMPUTERS_MULTIMEDIA  (Computers > Multimedia)
  • COMPUTERS_OPEN_SOURCE  (Computers > Open Source)
  • COMPUTERS_PARALLEL_COMPUTING  (Computers > Parallel Computing)
  • COMPUTERS_PROGRAMMING  (Computers > Programming)
  • COMPUTERS_ROBOTICS  (Computers > Robotics)
  • COMPUTERS_SECURITY  (Computers > Security)
  • COMPUTERS_SOFTWARE  (Computers > Software)
  • COMPUTERS_SPEECH_TECHNOLOGY  (Computers > Speech Technology)
  • COMPUTERS_SYSTEMS  (Computers > Systems)
  • COMPUTERS_USENET  (Computers > Usenet)
  • COMPUTERS_VIRTUAL_REALITY  (Computers > Virtual Reality)
  • GAMES_BOARD_GAMES  (Games > Board Games)
  • GAMES_GAMBLING  (Games > Gambling)
  • GAMES_MINIATURES  (Games > Miniatures)
  • GAMES_ROLEPLAYING  (Games > Roleplaying)
  • GAMES_TRADING_CARD_GAMES  (Games > Trading Card Games)
  • GAMES_VIDEO_GAMES  (Games > Video Games)
  • HEALTH_ALTERNATIVE  (Health > Alternative)
  • HEALTH_ANIMAL  (Health > Animal)
  • HEALTH_BEAUTY  (Health > Beauty)
  • HEALTH_CHILD_HEALTH  (Health > Child Health)
  • HEALTH_CONDITIONS_AND_DISEASES  (Health > Conditions and Diseases)
  • HEALTH_DENTISTRY  (Health > Dentistry)
  • HEALTH_FITNESS  (Health > Fitness)
  • HEALTH_MEDICINE  (Health > Medicine)
  • HEALTH_MENTAL_HEALTH  (Health > Mental Health)
  • HEALTH_NURSING  (Health > Nursing)
  • HEALTH_NUTRITION  (Health > Nutrition)
  • HEALTH_OCCUPATIONAL_HEALTH_AND_SAFETY  (Health > Occupational Health and Safety)
  • HEALTH_PROFESSIONS  (Health > Professions)
  • HEALTH_PUBLIC_HEALTH_AND_SAFETY  (Health > Public Health and Safety)
  • HEALTH_REPRODUCTIVE_HEALTH  (Health > Reproductive Health)
  • HEALTH_SENIOR_HEALTH  (Health > Senior Health)
  • HEALTH_WOMENS_HEALTH  (Health > Women's Health)
  • HOME_CONSUMER_INFORMATION  (Home > Consumer Information)
  • HOME_COOKING  (Home > Cooking)
  • HOME_FAMILY  (Home > Family)
  • HOME_GARDENING  (Home > Gardening)
  • HOME_HOME_IMPROVEMENT  (Home > Home Improvement)
  • HOME_PERSONAL_FINANCE  (Home > Personal Finance)
  • KIDS_AND_TEENS_ARTS  (Kids and Teens > Arts)
  • KIDS_AND_TEENS_ENTERTAINMENT  (Kids and Teens > Entertainment)
  • KIDS_AND_TEENS_GAMES  (Kids and Teens > Games)
  • KIDS_AND_TEENS_HEALTH  (Kids and Teens > Health)
  • KIDS_AND_TEENS_INTERNATIONAL  (Kids and Teens > International)
  • KIDS_AND_TEENS_PEOPLE_AND_SOCIETY  (Kids and Teens > People and Society)
  • KIDS_AND_TEENS_PRESCHOOL  (Kids and Teens > Pre-School)
  • KIDS_AND_TEENS_SCHOOL_TIME  (Kids and Teens > School Time)
  • KIDS_AND_TEENS_SPORTS_AND_HOBBIES  (Kids and Teens > Sports and Hobbies)
  • KIDS_AND_TEENS_TEEN_LIFE  (Kids and Teens > Teen Life)
  • NEWS_MEDIA  (News > Media)
  • NEWS_NEWSPAPERS  (News > Newspapers)
  • NEWS_WEATHER  (News > Weather)
  • RECREATION_ANTIQUES  (Recreation > Antiques)
  • RECREATION_AUDIO  (Recreation > Audio)
  • RECREATION_AUTOS  (Recreation > Autos)
  • RECREATION_AVIATION  (Recreation > Aviation)
  • RECREATION_BIRDING  (Recreation > Birding)
  • RECREATION_BOATING  (Recreation > Boating)
  • RECREATION_CAMPS  (Recreation > Camps)
  • RECREATION_CLIMBING  (Recreation > Climbing)
  • RECREATION_COLLECTING  (Recreation > Collecting)
  • RECREATION_FOOD  (Recreation > Food)
  • RECREATION_GUNS  (Recreation > Guns)
  • RECREATION_HUMOR  (Recreation > Humor)
  • RECREATION_KITES  (Recreation > Kites)
  • RECREATION_LIVING_HISTORY  (Recreation > Living History)
  • RECREATION_MODELS  (Recreation > Models)
  • RECREATION_MOTORCYCLES  (Recreation > Motorcycles)
  • RECREATION_OUTDOORS  (Recreation > Outdoors)
  • RECREATION_PETS  (Recreation > Pets)
  • RECREATION_ROADS_AND_HIGHWAYS  (Recreation > Roads and Highways)
  • RECREATION_SCOUTING  (Recreation > Scouting)
  • RECREATION_THEME_PARKS  (Recreation > Theme Parks)
  • RECREATION_TOBACCO  (Recreation > Tobacco)
  • RECREATION_TRAINS_AND_RAILROADS  (Recreation > Trains and Railroads)
  • REFERENCE_ARCHIVES  (Reference > Archives)
  • REFERENCE_DICTIONARIES  (Reference > Dictionaries)
  • REFERENCE_EDUCATION  (Reference > Education)
  • REFERENCE_KNOWLEDGE_MANAGEMENT  (Reference > Knowledge Management)
  • REFERENCE_LIBRARIES  (Reference > Libraries)
  • REFERENCE_MAPS  (Reference > Maps)
  • REFERENCE_MUSEUMS  (Reference > Museums)
  • REFERENCE_QUOTATIONS  (Reference > Quotations)
  • SCIENCE_AGRICULTURE  (Science > Agriculture)
  • SCIENCE_ANOMALIES_AND_ALTERNATIVE_SCIENCE  (Science > Anomalies and Alternative Science)
  • SCIENCE_ASTRONOMY  (Science > Astronomy)
  • SCIENCE_BIOLOGY  (Science > Biology)
  • SCIENCE_CHEMISTRY  (Science > Chemistry)
  • SCIENCE_EARTH_SCIENCES  (Science > Earth Sciences)
  • SCIENCE_EDUCATIONAL_RESOURCES  (Science > Educational Resources)
  • SCIENCE_ENVIRONMENT  (Science > Environment)
  • SCIENCE_INSTRUMENTS_AND_SUPPLIES  (Science > Instruments and Supplies)
  • SCIENCE_MATH  (Science > Math)
  • SCIENCE_PHYSICS  (Science > Physics)
  • SCIENCE_SCIENCE_IN_SOCIETY  (Science > Science in Society)
  • SCIENCE_SOCIAL_SCIENCES  (Science > Social Sciences)
  • SCIENCE_TECHNOLOGY  (Science > Technology)
  • SHOPPING_ANTIQUES_AND_COLLECTIBLES  (Shopping > Antiques and Collectibles)
  • SHOPPING_AUCTIONS  (Shopping > Auctions)
  • SHOPPING_CHILDREN  (Shopping > Children)
  • SHOPPING_CLASSIFIEDS  (Shopping > Classifieds)
  • SHOPPING_CLOTHING  (Shopping > Clothing)
  • SHOPPING_CONSUMER_ELECTRONICS  (Shopping > Consumer Electronics)
  • SHOPPING_CRAFTS  (Shopping > Crafts)
  • SHOPPING_ENTERTAINMENT  (Shopping > Entertainment)
  • SHOPPING_ETHNIC_AND_REGIONAL  (Shopping > Ethnic and Regional)
  • SHOPPING_FOOD  (Shopping > Food)
  • SHOPPING_GENERAL_MERCHANDISE  (Shopping > General Merchandise)
  • SHOPPING_GIFTS  (Shopping > Gifts)
  • SHOPPING_HEALTH  (Shopping > Health)
  • SHOPPING_HOME_AND_GARDEN  (Shopping > Home and Garden)
  • SHOPPING_JEWELRY  (Shopping > Jewelry)
  • SHOPPING_NICHE  (Shopping > Niche)
  • SHOPPING_PETS  (Shopping > Pets)
  • SHOPPING_PHOTOGRAPHY  (Shopping > Photography)
  • SHOPPING_PUBLICATIONS  (Shopping > Publications)
  • SHOPPING_RECREATION  (Shopping > Recreation)
  • SHOPPING_SPORTS  (Shopping > Sports)
  • SHOPPING_TOOLS  (Shopping > Tools)
  • SHOPPING_TOYS_AND_GAMES  (Shopping > Toys and Games)
  • SHOPPING_VEHICLES  (Shopping > Vehicles)
  • SHOPPING_VISUAL_ARTS  (Shopping > Visual Arts)
  • SOCIETY_ACTIVISM  (Society > Activism)
  • SOCIETY_CRIME  (Society > Crime)
  • SOCIETY_DISABLED  (Society > Disabled)
  • SOCIETY_ETHNICITY  (Society > Ethnicity)
  • SOCIETY_FUTURE  (Society > Future)
  • SOCIETY_GAY_LESBIAN_AND_BISEXUAL  (Society > Gay, Lesbian, and Bisexual)
  • SOCIETY_GENEALOGY  (Society > Genealogy)
  • SOCIETY_GOVERNMENT  (Society > Government)
  • SOCIETY_HISTORY  (Society > History)
  • SOCIETY_HOLIDAYS  (Society > Holidays)
  • SOCIETY_ISSUES  (Society > Issues)
  • SOCIETY_LAW  (Society > Law)
  • SOCIETY_LIFESTYLE_CHOICES  (Society > Lifestyle Choices)
  • SOCIETY_MILITARY  (Society > Military)
  • SOCIETY_ORGANIZATIONS  (Society > Organizations)
  • SOCIETY_PARANORMAL  (Society > Paranormal)
  • SOCIETY_PEOPLE  (Society > People)
  • SOCIETY_PHILANTHROPY  (Society > Philanthropy)
  • SOCIETY_PHILOSOPHY  (Society > Philosophy)
  • SOCIETY_POLITICS  (Society > Politics)
  • SOCIETY_RELATIONSHIPS  (Society > Relationships)
  • SOCIETY_RELIGION_AND_SPIRITUALITY  (Society > Religion and Spirituality)
  • SOCIETY_SEXUALITY  (Society > Sexuality)
  • SOCIETY_SUBCULTURES  (Society > Subcultures)
  • SOCIETY_SUPPORT_GROUPS  (Society > Support Groups)
  • SOCIETY_TRANSGENDERED  (Society > Transgendered)
  • SOCIETY_WORK  (Society > Work)
  • SPORTS_ADVENTURE_RACING  (Sports > Adventure Racing)
  • SPORTS_BASEBALL  (Sports > Baseball)
  • SPORTS_BASKETBALL  (Sports > Basketball)
  • SPORTS_BOWLING  (Sports > Bowling)
  • SPORTS_BOXING  (Sports > Boxing)
  • SPORTS_CHEERLEADING  (Sports > Cheerleading)
  • SPORTS_CRICKET  (Sports > Cricket)
  • SPORTS_CYCLING  (Sports > Cycling)
  • SPORTS_DISABLED  (Sports > Disabled)
  • SPORTS_EQUESTRIAN  (Sports > Equestrian)
  • SPORTS_FANTASY  (Sports > Fantasy)
  • SPORTS_GOLF  (Sports > Golf)
  • SPORTS_HOCKEY  (Sports > Hockey)
  • SPORTS_LACROSSE  (Sports > Lacrosse)
  • SPORTS_MARTIAL_ARTS  (Sports > Martial Arts)
  • SPORTS_MOTORSPORTS  (Sports > Motorsports)
  • SPORTS_PAINTBALL  (Sports > Paintball)
  • SPORTS_RESOURCES  (Sports > Resources)
  • SPORTS_RODEO  (Sports > Rodeo)
  • SPORTS_RUNNING  (Sports > Running)
  • SPORTS_SKATEBOARDING  (Sports > Skateboarding)
  • SPORTS_SOCCER  (Sports > Soccer)
  • SPORTS_TENNIS  (Sports > Tennis)
  • SPORTS_TRACK_AND_FIELD  (Sports > Track and Field)
  • SPORTS_VOLLEYBALL  (Sports > Volleyball)
  • SPORTS_WATER_SPORTS  (Sports > Water Sports)

Topics and subtopics covered in the output documents

Key FubDocumentSource.topicIds
Direction Output
DescriptionTopics and subtopics covered in the output documents. The set is computed for the output documents and it may vary for the same main topic based e.g. on the requested number of requested results or minTopicSize.
Scope Processing time
Value type java.util.Set
Default value none

11.19 PubMed medical database

Searches the PubMed medical abstracts database

11.19.1 PubMed medical database input attributes by level

Advanced

11.19.2 PubMed medical database attributes by direction

11.19.3 Data source status

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none

Page Requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

Successful Queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none

11.19.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system.
Scope Processing time
Value type java.util.Collection
Default value none

11.19.5 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1

Start Index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0

11.19.6 Search request information

Total Results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none