org.carrot2.core
Class Document

java.lang.Object
  extended by org.carrot2.core.Document

public final class Document
extends Object

A document that to be processed by the framework. Each document is a collection of fields carrying different bits of information, e.g. TITLE or CONTENT_URL.


Nested Class Summary
static class Document.DocumentToId
          Transforms a Document to its identifier returned by getId().
 
Field Summary
static Comparator<Document> BY_ID_COMPARATOR
          Compares Documents by their identifiers getId(), which effectively gives the original order in which they were returned by the document source.
static String CLICK_URL
          Click URL.
static String CONTENT_URL
          Field name for an URL pointing to the full version of the document.
static String LANGUAGE
          Field name for the language in which the document is written.
static String PARTITIONS
          Identifiers of reference clustering partitions this document belongs to.
static String SIZE
          Document size.
static String SOURCES
          Field name for a list of sources the document was found in.
static String SUMMARY
          Field name for a short summary of the document, e.g.
static String THUMBNAIL_URL
          Field name for an URL pointing to the thumbnail image associated with the document.
static String TITLE
          Field name for the title of the document.
 
Constructor Summary
Document()
          Creates an empty document with no fields.
Document(String title)
          Creates a document with the provided title.
Document(String title, String summary)
          Creates a document with the provided title and summary.
Document(String title, String summary, LanguageCode language)
          Creates a document with the provided title, summary and language.
Document(String title, String summary, String contentUrl)
          Creates a document with the provided title, summary and contentUrl.
Document(String title, String summary, String contentUrl, LanguageCode language)
          Creates a document with the provided title, summary, contentUrl and language.
 
Method Summary
static void assignDocumentIds(Collection<Document> documents)
          Assigns sequential identifiers to the provided documents.
 String getContentUrl()
          Returns this document's CONTENT_URL field.
<T> T
getField(String name)
          Returns value of the specified field of this document.
 Map<String,Object> getFields()
          Returns all fields of this document.
 Integer getId()
          A unique identifier of this document.
 LanguageCode getLanguage()
          Returns this document's LANGUAGE.
 List<String> getSources()
          Returns this document's SOURCES field.
 String getSummary()
          Returns this document's SUMMARY field.
 String getTitle()
          Returns this document's TITLE field.
 Document setContentUrl(String contentUrl)
          Sets this document's CONTENT_URL field.
 Document setField(String name, Object value)
          Sets a field in this document.
 Document setLanguage(LanguageCode language)
          Sets this document's LANGUAGE.
 Document setSources(List<String> sources)
          Sets this document's SOURCES field.
 Document setSummary(String summary)
          Sets this document's SUMMARY field.
 Document setTitle(String title)
          Sets this document's TITLE field.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TITLE

public static final String TITLE
Field name for the title of the document.

See Also:
Constant Field Values

SUMMARY

public static final String SUMMARY
Field name for a short summary of the document, e.g. the snippet returned by the search engine.

See Also:
Constant Field Values

CONTENT_URL

public static final String CONTENT_URL
Field name for an URL pointing to the full version of the document.

See Also:
Constant Field Values

CLICK_URL

public static final String CLICK_URL
Click URL. The URL that should be placed in the anchor to the document instead of the value returned in CONTENT_URL.

See Also:
Constant Field Values

THUMBNAIL_URL

public static final String THUMBNAIL_URL
Field name for an URL pointing to the thumbnail image associated with the document.

See Also:
Constant Field Values

SIZE

public static final String SIZE
Document size.

See Also:
Constant Field Values

SOURCES

public static final String SOURCES
Field name for a list of sources the document was found in. Value type: List<String>

See Also:
Constant Field Values

LANGUAGE

public static final String LANGUAGE
Field name for the language in which the document is written. Value type: LanguageCode. If the language field is not defined or is null, it means the language of the document is unknown or it is outside of the list defined in LanguageCode.

See Also:
Constant Field Values

PARTITIONS

public static final String PARTITIONS
Identifiers of reference clustering partitions this document belongs to. Currently, this field is used only to calculate various clustering quality metrics. In the future, clustering algorithms may be able to use values of this field to increase the quality of clustering.

Value type: Collection<Object>. There is no constraint on the actual type of the partition identifier in the collection. Identifiers are assumed to correctly implement the Object.equals(Object) and Object.hashCode() methods.

See Also:
Constant Field Values

BY_ID_COMPARATOR

public static final Comparator<Document> BY_ID_COMPARATOR
Compares Documents by their identifiers getId(), which effectively gives the original order in which they were returned by the document source.

Constructor Detail

Document

public Document()
Creates an empty document with no fields.


Document

public Document(String title)
Creates a document with the provided title.


Document

public Document(String title,
                String summary)
Creates a document with the provided title and summary.


Document

public Document(String title,
                String summary,
                LanguageCode language)
Creates a document with the provided title, summary and language.


Document

public Document(String title,
                String summary,
                String contentUrl)
Creates a document with the provided title, summary and contentUrl.


Document

public Document(String title,
                String summary,
                String contentUrl,
                LanguageCode language)
Creates a document with the provided title, summary, contentUrl and language.

Method Detail

getId

public Integer getId()
A unique identifier of this document. The identifiers are assigned to documents before processing finishes. Note that two documents with equal contents will be assigned different identifiers.

Returns:
unique identifier of this document

getTitle

public String getTitle()
Returns this document's TITLE field.


setTitle

public Document setTitle(String title)
Sets this document's TITLE field.

Parameters:
title - title to set
Returns:
this document for convenience

getSummary

public String getSummary()
Returns this document's SUMMARY field.


setSummary

public Document setSummary(String summary)
Sets this document's SUMMARY field.

Parameters:
summary - summary to set
Returns:
this document for convenience

getContentUrl

public String getContentUrl()
Returns this document's CONTENT_URL field.


setContentUrl

public Document setContentUrl(String contentUrl)
Sets this document's CONTENT_URL field.

Parameters:
contentUrl - content URL to set
Returns:
this document for convenience

getSources

public List<String> getSources()
Returns this document's SOURCES field.


setSources

public Document setSources(List<String> sources)
Sets this document's SOURCES field.

Parameters:
sources - the sources list to set
Returns:
this document for convenience

getLanguage

public LanguageCode getLanguage()
Returns this document's LANGUAGE.


setLanguage

public Document setLanguage(LanguageCode language)
Sets this document's LANGUAGE.

Parameters:
language - the language to set
Returns:
this document for convenience

getFields

public Map<String,Object> getFields()
Returns all fields of this document. The returned map is unmodifiable.

Returns:
all fields of this document

getField

public <T> T getField(String name)
Returns value of the specified field of this document. If no field corresponds to the provided name, null will be returned.

Parameters:
name - of the field to be returned
Returns:
value of the field or null

setField

public Document setField(String name,
                         Object value)
Sets a field in this document.

Parameters:
name - of the field to set
value - value of the field
Returns:
this document for convenience

assignDocumentIds

public static void assignDocumentIds(Collection<Document> documents)
Assigns sequential identifiers to the provided documents. If a document already has an identifier, the identifier will not be changed.

Parameters:
documents - documents to assign identifiers to.
Throws:
IllegalArgumentException - if the provided documents contain non-unique identifiers


Copyright (c) Dawid Weiss, Stanislaw Osinski