Class CsvFormat

  • All Implemented Interfaces:
    DocFormat, Serializable

    public class CsvFormat
    extends WholeFileTextFormat
    implements Serializable

    Format Name: csv

    Delimited separated files (e.g. CSV and TSV) with each row representing a document. The following additional parameters are available when reading/writing in CSV format:

    • columns=<list of column names>: The list of column names when file does not have a header (default: empty).
    • content=<String>: Name of the content column (default: "content").
    • id=
    • language=<String>: Name of the language column (default: "language").
    • comment=<Character>: The character used for comments in the file (default: '#').
    • delimiter=<Character>: The character used for delimiting columns in the file (default: ',').
    • hasHeader=[true|false]: The file has a header naming the columns when true (default: false).

    Note that columns name will be autogenerated as C0, C1, …​, CN when no column names are given and there is no header in the file. Additional columns in the file not assigned to "id", "language", or "content" will be treated as document level attributes.

    Note: Writing in csv only includes document id, language, content, and attributes. No annotations are written.

    See Also:
    Serialized Form
    • Field Detail

      • COLUMN_NAMES

        public static final ParameterDef<List<String>> COLUMN_NAMES
        List of strings representing the column names
      • CONTENT_COLUMN

        public static final ParameterDef<String> CONTENT_COLUMN
        The name of the column containing the content
      • ID_COLUMN

        public static final ParameterDef<String> ID_COLUMN
        The name of the column representing the id
      • LANGUAGE_COLUMN

        public static final ParameterDef<String> LANGUAGE_COLUMN
        The name of the column representing the document language
      • COMMENT_CHAR

        public static ParameterDef<Character> COMMENT_CHAR
        The character representing a commented line
      • DELIMITER_CHAR

        public static ParameterDef<Character> DELIMITER_CHAR
        The character representing the column delimiter
      • HAS_HEADER

        public static ParameterDef<Boolean> HAS_HEADER
        True when the CSV file has a header, False when not
    • Method Detail

      • getColumnNames

        protected Index<String> getColumnNames()
        Gets the names of the columns specified
        Returns:
        the column names
      • write

        public void write​(DocumentCollection corpus,
                          Resource outputResource)
                   throws IOException
        Description copied from interface: DocFormat
        Writes a corpus of documents in this format to the given output resource
        Specified by:
        write in interface DocFormat
        Parameters:
        corpus - the corpus
        outputResource - the output resource
        Throws:
        IOException - Something went wrong writing the corpus
      • write

        public void write​(Document document,
                          Resource outputResource)
                   throws IOException
        Description copied from interface: DocFormat
        Writes the given document in this format to the given output resource.
        Specified by:
        write in interface DocFormat
        Parameters:
        document - the document
        outputResource - the output resource
        Throws:
        IOException - Something went wrong writing the document