org.carrot2.core
Class Cluster

java.lang.Object
  extended by org.carrot2.core.Cluster

public final class Cluster
extends Object

A cluster (group) of Documents. Each cluster has a human-readable label consisting of one or more phrases, a list of documents it contains and a list of its subclusters. Optionally, additional attributes can be associated with a cluster, e.g. OTHER_TOPICS. This class is not thread-safe.


Field Summary
static Comparator<Cluster> BY_LABEL_COMPARATOR
          Compares clusters by the natural order of their labels as returned by getLabel().
static Comparator<Cluster> BY_REVERSED_SCORE_AND_LABEL_COMPARATOR
          Compares clusters first by their size as returned by SCORE and labels as returned by getLabel().
static Comparator<Cluster> BY_REVERSED_SIZE_AND_LABEL_COMPARATOR
          Compares clusters first by their size as returned by size() and labels as returned by getLabel().
static Comparator<Cluster> BY_SCORE_COMPARATOR
          Compares clusters by score as returned by SCORE.
static Comparator<Cluster> BY_SIZE_COMPARATOR
          Compares clusters by size as returned by size().
static String OTHER_TOPICS
          Indicates that the cluster is an Other Topics cluster.
static Comparator<Cluster> OTHER_TOPICS_AT_THE_END
          A comparator that puts OTHER_TOPICS clusters at the end of the list.
static String SCORE
          Score of this cluster that indicates the clustering algorithm's beliefs on the quality of this cluster.
 
Constructor Summary
Cluster()
          Creates a Cluster with an empty label, no documents and no subclusters.
Cluster(String phrase, Document... documents)
          Creates a Cluster with the provided phrase to be used as the cluster's label and documents contained in the cluster.
 
Method Summary
 Cluster addDocuments(Document... documents)
          Adds document to this cluster.
 Cluster addDocuments(Iterable<Document> documents)
          Adds document to this cluster.
 Cluster addPhrases(Iterable<String> phrases)
          Adds phrases to the description of this cluster.
 Cluster addPhrases(String... phrases)
          Adds phrases to the description of this cluster.
 Cluster addSubclusters(Cluster... subclusters)
          Adds subclusters to this cluster
 Cluster addSubclusters(Iterable<Cluster> clusters)
          Adds subclusters to this cluster
static void appendOtherTopics(List<Document> allDocuments, List<Cluster> clusters)
          If there are unclustered documents, appends the "Other Topics" group to the clusters.
static void appendOtherTopics(List<Document> allDocuments, List<Cluster> clusters, String label)
          If there are unclustered documents, appends the "Other Topics" group to the clusters.
static void assignClusterIds(Collection<Cluster> clusters)
          Assigns sequential identifiers to the provided clusters (and their sub-clusters).
static Cluster buildOtherTopics(List<Document> allDocuments, List<Cluster> clusters)
          Builds an "Other Topics" cluster that groups those documents from allDocument that were not referenced in any cluster in clusters.
static Cluster buildOtherTopics(List<Document> allDocuments, List<Cluster> clusters, String label)
          Builds an "Other Topics" cluster that groups those documents from allDocument that were not referenced in any cluster in clusters.
static Comparator<Cluster> byReversedWeightedScoreAndSizeComparator(double scoreWeight)
          Returns a comparator that compares clusters based on the aggregation of their size and score.
static Cluster find(int id, Collection<Cluster> clusters)
          Locate the first cluster that has id equal to id.
 List<Document> getAllDocuments()
          Returns all documents contained in this cluster and (recursively) all documents from this cluster's subclusters.
 List<Document> getAllDocuments(Comparator<Document> comparator)
          Returns all documents in this cluster ordered according to the provided comparator.
<T> T
getAttribute(String key)
          Returns the attribute associated with this cluster under the provided key.
 Map<String,Object> getAttributes()
          Returns all attributes of this cluster.
 List<Document> getDocuments()
          Returns all documents contained in this cluster.
 Integer getId()
          Internal identifier of this cluster within the ProcessingResult.
 String getLabel()
          Formats this cluster's label.
 List<String> getPhrases()
          Returns all phrases describing this cluster.
 Double getScore()
          Returns this cluster's "score" field.
 List<Cluster> getSubclusters()
          Returns all subclusters of this cluster.
 boolean isOtherTopics()
          Returns true if this cluster is the OTHER_TOPICS cluster.
<T> Cluster
setAttribute(String key, T value)
          Associates an attribute with this cluster.
 Cluster setOtherTopics(boolean isOtherTopics)
          Sets the OTHER_TOPICS attribute of this cluster.
 Cluster setScore(Double score)
          Sets this cluster's SCORE field.
 int size()
          Returns the size of the cluster calculated as the number of unique documents it contains, including its subclusters.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

OTHER_TOPICS

public static final String OTHER_TOPICS
Indicates that the cluster is an Other Topics cluster. Such a cluster contains documents that remain unclustered at given level of cluster hierarchy.

Type of this attribute is Boolean.

See Also:
setAttribute(String, Object), getAttribute(String), Constant Field Values

SCORE

public static final String SCORE
Score of this cluster that indicates the clustering algorithm's beliefs on the quality of this cluster. The exact semantics of the score varies across algorithms.

Type of this attribute is Double.

See Also:
setAttribute(String, Object), getAttribute(String), Constant Field Values

BY_SIZE_COMPARATOR

public static final Comparator<Cluster> BY_SIZE_COMPARATOR
Compares clusters by size as returned by size(). Clusters with more documents are larger.


BY_SCORE_COMPARATOR

public static final Comparator<Cluster> BY_SCORE_COMPARATOR
Compares clusters by score as returned by SCORE. Clusters with larger score are larger.


BY_LABEL_COMPARATOR

public static final Comparator<Cluster> BY_LABEL_COMPARATOR
Compares clusters by the natural order of their labels as returned by getLabel().


BY_REVERSED_SIZE_AND_LABEL_COMPARATOR

public static final Comparator<Cluster> BY_REVERSED_SIZE_AND_LABEL_COMPARATOR
Compares clusters first by their size as returned by size() and labels as returned by getLabel(). In case of equal sizes, natural order of the labels decides.

Please note: this is a reversed comparator, so "larger" clusters end up nearer the beginning of the list being sorted (which is usually the order in which the applications want to display clusters).


BY_REVERSED_SCORE_AND_LABEL_COMPARATOR

public static final Comparator<Cluster> BY_REVERSED_SCORE_AND_LABEL_COMPARATOR
Compares clusters first by their size as returned by SCORE and labels as returned by getLabel(). In case of equal scores, natural order of the labels decides.

Please note: this is a reversed comparator, so "larger" clusters end up nearer the beginning of the list being sorted (which is usually the order in which the applications want to display clusters).


OTHER_TOPICS_AT_THE_END

public static final Comparator<Cluster> OTHER_TOPICS_AT_THE_END
A comparator that puts OTHER_TOPICS clusters at the end of the list. In other words, to this comparator an OTHER_TOPICS topics cluster is "bigger" than a non-{OTHER_TOPICS cluster.

Note: This comparator is designed for use in combination with other comparators, such as BY_REVERSED_SIZE_AND_LABEL_COMPARATOR. If you only need to partition a list of clusters into regular and other topic ones, this is better done in linear time without resorting to Collections.sort(List).

Constructor Detail

Cluster

public Cluster()
Creates a Cluster with an empty label, no documents and no subclusters.


Cluster

public Cluster(String phrase,
               Document... documents)
Creates a Cluster with the provided phrase to be used as the cluster's label and documents contained in the cluster.

Parameters:
phrase - the phrase to form the cluster's label
documents - documents contained in the cluster
Method Detail

getLabel

public String getLabel()
Formats this cluster's label. If there is more than one phrase describing this cluster, phrases will be separated by a comma followed by a space, e.g. "Phrase one, Phrase two". To format multi-phrase label in a different way, use getPhrases().

Returns:
formatted label of this cluster

getPhrases

public List<String> getPhrases()
Returns all phrases describing this cluster. The returned list is unmodifiable.

Returns:
phrases describing this cluster

getSubclusters

public List<Cluster> getSubclusters()
Returns all subclusters of this cluster. The returned list is unmodifiable.

Returns:
subclusters of this cluster

getDocuments

public List<Document> getDocuments()
Returns all documents contained in this cluster. The returned list is unmodifiable.

Returns:
documents contained in this cluster

getAllDocuments

public List<Document> getAllDocuments()
Returns all documents contained in this cluster and (recursively) all documents from this cluster's subclusters. The returned list contains unique documents, i.e. if a document is attached to multiple subclusters if this cluster, the document will appear only once on the list. The documents are enumerated in breadth first order, i.e. first come documents returned by getDocuments() and then documents from subclusters.

Returns:
all documents from this cluster and its subclusters

getAllDocuments

public List<Document> getAllDocuments(Comparator<Document> comparator)
Returns all documents in this cluster ordered according to the provided comparator. See Document for common comparators, e.g. Document.BY_ID_COMPARATOR .


addPhrases

public Cluster addPhrases(String... phrases)
Adds phrases to the description of this cluster.

Parameters:
phrases - to be added to the description of this cluster
Returns:
this cluster for convenience

addPhrases

public Cluster addPhrases(Iterable<String> phrases)
Adds phrases to the description of this cluster.

Parameters:
phrases - to be added to the description of this cluster
Returns:
this cluster for convenience

addDocuments

public Cluster addDocuments(Document... documents)
Adds document to this cluster.

Parameters:
documents - to be added to this cluster
Returns:
this cluster for convenience

addDocuments

public Cluster addDocuments(Iterable<Document> documents)
Adds document to this cluster.

Parameters:
documents - to be added to this cluster
Returns:
this cluster for convenience

addSubclusters

public Cluster addSubclusters(Cluster... subclusters)
Adds subclusters to this cluster

Parameters:
subclusters - to be added to this cluster
Returns:
this cluster for convenience

addSubclusters

public Cluster addSubclusters(Iterable<Cluster> clusters)
Adds subclusters to this cluster

Parameters:
clusters - to be added to this cluster
Returns:
this cluster for convenience

getScore

public Double getScore()
Returns this cluster's "score" field.


setScore

public Cluster setScore(Double score)
Sets this cluster's SCORE field.

Parameters:
score - score to set
Returns:
this cluster for convenience

getAttribute

public <T> T getAttribute(String key)
Returns the attribute associated with this cluster under the provided key. If there is no attribute under the provided key, null will be returned.

Parameters:
key - of the attribute
Returns:
attribute value of null

setAttribute

public <T> Cluster setAttribute(String key,
                                T value)
Associates an attribute with this cluster.

Parameters:
key - for the attribute
value - for the attribute
Returns:
this cluster for convenience

getAttributes

public Map<String,Object> getAttributes()
Returns all attributes of this cluster. The returned map is unmodifiable.

Returns:
all attributes of this cluster

size

public int size()
Returns the size of the cluster calculated as the number of unique documents it contains, including its subclusters.

Returns:
size of the cluster

getId

public Integer getId()
Internal identifier of this cluster within the ProcessingResult. This identifier is assigned dynamically after clusters are passed to ProcessingResult.

See Also:
ProcessingResult

isOtherTopics

public boolean isOtherTopics()
Returns true if this cluster is the OTHER_TOPICS cluster.


setOtherTopics

public Cluster setOtherTopics(boolean isOtherTopics)
Sets the OTHER_TOPICS attribute of this cluster.

Parameters:
isOtherTopics - if true, this cluster will be marked as an Other Topics cluster.
Returns:
this cluster for convenience

byReversedWeightedScoreAndSizeComparator

public static Comparator<Cluster> byReversedWeightedScoreAndSizeComparator(double scoreWeight)
Returns a comparator that compares clusters based on the aggregation of their size and score. If scoreWeight is 0.0, the order depends only on cluster sizes. If scoreWeight is 1.1, the order depends only on cluster scores. For scoreWeight values between 0.0 and 1.0, the higher the scoreWeight, the more contribution of cluster scores to the order. In case of a tie on the aggregated cluster size and score, clusters are compared by the natural order of their labels.

Please note: this is a reversed comparator, so "larger" clusters end up nearer the beginning of the list being sorted (which is usually the order in which the applications want to display clusters).


assignClusterIds

public static void assignClusterIds(Collection<Cluster> clusters)
Assigns sequential identifiers to the provided clusters (and their sub-clusters). If a cluster already has an identifier, the identifier will not be changed.

Parameters:
clusters - Clusters to assign identifiers to.
Throws:
IllegalArgumentException - if the provided clusters contain non-unique identifiers

find

public static Cluster find(int id,
                           Collection<Cluster> clusters)
Locate the first cluster that has id equal to id. The search includes all the clusters in the input and their sub-clusters. The first cluster with matching identifier is returned or null if no such cluster could be found.


buildOtherTopics

public static Cluster buildOtherTopics(List<Document> allDocuments,
                                       List<Cluster> clusters)
Builds an "Other Topics" cluster that groups those documents from allDocument that were not referenced in any cluster in clusters.

Parameters:
allDocuments - all documents to check against
clusters - list of clusters with assigned documents
Returns:
the "Other Topics" cluster

buildOtherTopics

public static Cluster buildOtherTopics(List<Document> allDocuments,
                                       List<Cluster> clusters,
                                       String label)
Builds an "Other Topics" cluster that groups those documents from allDocument that were not referenced in any cluster in clusters.

Parameters:
allDocuments - all documents to check against
clusters - list of clusters with assigned documents
label - label for the "Other Topics" group
Returns:
the "Other Topics" cluster

appendOtherTopics

public static void appendOtherTopics(List<Document> allDocuments,
                                     List<Cluster> clusters)
If there are unclustered documents, appends the "Other Topics" group to the clusters.

See Also:
buildOtherTopics(List, List)

appendOtherTopics

public static void appendOtherTopics(List<Document> allDocuments,
                                     List<Cluster> clusters,
                                     String label)
If there are unclustered documents, appends the "Other Topics" group to the clusters.

See Also:
buildOtherTopics(List, List, String)


Copyright (c) Dawid Weiss, Stanislaw Osinski