Package com.gengoai.hermes.format
Class CsvFormat
- java.lang.Object
-
- com.gengoai.hermes.format.WholeFileTextFormat
-
- com.gengoai.hermes.format.CsvFormat
-
- All Implemented Interfaces:
DocFormat
,Serializable
public class CsvFormat extends WholeFileTextFormat implements Serializable
Format Name: csv
Delimited separated files (e.g. CSV and TSV) with each row representing a document. The following additional parameters are available when reading/writing in CSV format:
- columns=<list of column names>: The list of column names when file does not have a header (default: empty).
- content=<String>: Name of the content column (default: "content").
- id=
- language=<String>: Name of the language column (default: "language").
- comment=<Character>: The character used for comments in the file (default: '#').
- delimiter=<Character>: The character used for delimiting columns in the file (default: ',').
- hasHeader=[true|false]: The file has a header naming the columns when true (default: false).
Note that columns name will be autogenerated as C0, C1, …, CN when no column names are given and there is no header in the file. Additional columns in the file not assigned to "id", "language", or "content" will be treated as document level attributes.
Note: Writing in csv only includes document id, language, content, and attributes. No annotations are written.
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
CsvFormat.CSVParameters
The type Csv parameters.static class
CsvFormat.Provider
The type Provider.
-
Field Summary
Fields Modifier and Type Field Description static ParameterDef<List<String>>
COLUMN_NAMES
List of strings representing the column namesstatic ParameterDef<Character>
COMMENT_CHAR
The character representing a commented linestatic ParameterDef<String>
CONTENT_COLUMN
The name of the column containing the contentstatic ParameterDef<Character>
DELIMITER_CHAR
The character representing the column delimiterstatic ParameterDef<Boolean>
HAS_HEADER
True when the CSV file has a header, False when notstatic ParameterDef<String>
ID_COLUMN
The name of the column representing the idstatic ParameterDef<String>
LANGUAGE_COLUMN
The name of the column representing the document language
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected Index<String>
getColumnNames()
Gets the names of the columns specifiedDocFormatParameters
getParameters()
protected Stream<Document>
readSingleFile(String file)
Converts the content of an entire file into one ore more documents.void
write(DocumentCollection corpus, Resource outputResource)
Writes a corpus of documents in this format to the given output resourcevoid
write(Document document, Resource outputResource)
Writes the given document in this format to the given output resource.-
Methods inherited from class com.gengoai.hermes.format.WholeFileTextFormat
read
-
-
-
-
Field Detail
-
COLUMN_NAMES
public static final ParameterDef<List<String>> COLUMN_NAMES
List of strings representing the column names
-
CONTENT_COLUMN
public static final ParameterDef<String> CONTENT_COLUMN
The name of the column containing the content
-
ID_COLUMN
public static final ParameterDef<String> ID_COLUMN
The name of the column representing the id
-
LANGUAGE_COLUMN
public static final ParameterDef<String> LANGUAGE_COLUMN
The name of the column representing the document language
-
COMMENT_CHAR
public static ParameterDef<Character> COMMENT_CHAR
The character representing a commented line
-
DELIMITER_CHAR
public static ParameterDef<Character> DELIMITER_CHAR
The character representing the column delimiter
-
HAS_HEADER
public static ParameterDef<Boolean> HAS_HEADER
True when the CSV file has a header, False when not
-
-
Method Detail
-
getColumnNames
protected Index<String> getColumnNames()
Gets the names of the columns specified- Returns:
- the column names
-
getParameters
public DocFormatParameters getParameters()
- Specified by:
getParameters
in interfaceDocFormat
- Returns:
- the
DocFormatParameters
set for the instance of this foramt
-
readSingleFile
protected Stream<Document> readSingleFile(String file)
Description copied from class:WholeFileTextFormat
Converts the content of an entire file into one ore more documents.- Specified by:
readSingleFile
in classWholeFileTextFormat
- Parameters:
file
- the content- Returns:
- the stream of documents.
-
write
public void write(DocumentCollection corpus, Resource outputResource) throws IOException
Description copied from interface:DocFormat
Writes a corpus of documents in this format to the given output resource- Specified by:
write
in interfaceDocFormat
- Parameters:
corpus
- the corpusoutputResource
- the output resource- Throws:
IOException
- Something went wrong writing the corpus
-
write
public void write(Document document, Resource outputResource) throws IOException
Description copied from interface:DocFormat
Writes the given document in this format to the given output resource.- Specified by:
write
in interfaceDocFormat
- Parameters:
document
- the documentoutputResource
- the output resource- Throws:
IOException
- Something went wrong writing the document
-
-