Comma-separated value ("CSV") format is a common semi-standard text-based format in which fields are delimited by commas. Spreadsheets and databases are often able to export data in some variant of it. The intention is to read tables in the version of the format spoken by MS Excel amongst other applications, though the documentation on which it was based was not obtained directly from Microsoft.
The rules for data which it understands are as follows:
#
" character
(or anything else) to introduce "comment" lines.
Because the CSV format contains no metadata beyond column names,
the handler is forced to guess the datatype of the values in each column.
It does this by reading the whole file through once and guessing
on the basis of what it has seen (though see the maxSample
configuration option). This has the disadvantages:
The delimiter
option makes it possible to use non-comma
characters to separate fields. Depending on the character used this
may behave in surprising ways; in particular for space-separated fields
the ascii
format may be a better choice.
The handler behaviour may be modified by specifying
one or more comma-separated name=value configuration options
in parentheses after the handler name, e.g.
"csv(header=true,delimiter=|)
".
The following options are available:
header = true|false|null
true
: the first line is a header line containing column namesfalse
: all lines are data lines, and column names will be assigned automaticallynull
: a guess will be made about whether the first line is a header or not depending on what it looks likenull
(auto-determination).
This usually works OK, but can get into trouble if
all the columns look like string values.
(Default: null
)
delimiter = <char>|0xNN
|
", a hexadecimal character code like "0x7C
", or one of the names "comma
", "space
" or "tab
". Some choices of delimiter, for instance whitespace characters, might not work well or might behave in surprising ways.
(Default: ,
)
maxSample = <int>
0
)
notypes = <type>[;<type>...]
blank
, boolean
, short
, int
, long
, float
, double
, date
, hms
and dms
. So if you want to make sure that all integer and floating-point columns are 64-bit (i.e. long
and double
respectively) you can set this value to "short;int;float
".This format cannot be automatically identified
by its content, so in general it is necessary
to specify that a table is in
CSV
format when reading it.
However, if the input file has
the extension ".csv
" (case insensitive)
an attempt will be made to read it using this format.
An example looks like this:
RECNO,SPECIES,NAME,LEGS,HEIGHT,MAMMAL 1,pig,Pigling Bland,4,0.8,true 2,cow,Daisy,4,2.0,true 3,goldfish,Dobbin,,0.05,false 4,ant,,6,0.001,false 5,ant,,6,0.001,false 6,queen ant,Ma'am,6,0.002,false 7,human,Mark,2,1.8,true
The handler class for this format is
CsvTableBuilder
.