7.1. CSV About

  • CSV - Comma/Character Separated Values

  • No CSV formal standard, just a good practice

  • Flat file (2D) without relations

  • Relations has to be flatten (serialization, additional columns, etc...)

  • Typically first line (header) represents column names

  • Rarely first line can also have a structure (nrows, ncols)

  • Internationalization: encoding

  • Localization: decimal separator, thousands separator, date format

  • Parameters: delimiter, quotechar, quoting, lineterminator, dialect

Example CSV file:

sepal_length, sepal_width, petal_length, petal_width, species
5.8, 2.7, 5.1, 1.9, virginica
5.1, 3.5, 1.4, 0.2, setosa
5.7, 2.8, 4.1, 1.3, versicolor
6.3, 2.9, 5.6, 1.8, virginica
6.4, 3.2, 4.5, 1.5, versicolor
4.7, 3.2, 1.3, 0.2, setosa
7.0, 3.2, 4.7, 1.4, versicolor
7.6, 3.0, 6.6, 2.1, virginica
4.9, 3.0, 1.4, 0.2, setosa
4.9, 2.5, 4.5, 1.7, virginica

7.1.2. Delimiter

  • csv module expects delimeter to be 1-character in length

delimiter=', ':

sepal_length, sepal_width, petal_length, petal_width, species
5.8, 2.7, 5.1, 1.9, virginica
5.1, 3.5, 1.4, 0.2, setosa
5.7, 2.8, 4.1, 1.3, versicolor

delimiter=',':

sepal_length,sepal_width,petal_length,petal_width,species
5.8,2.7,5.1,1.9,virginica
5.1,3.5,1.4,0.2,setosa
5.7,2.8,4.1,1.3,versicolor

delimiter=';':

sepal_length;sepal_width;petal_length;petal_width;species
5.8;2.7;5.1;1.9;virginica
5.1;3.5;1.4;0.2;setosa
5.7;2.8;4.1;1.3;versicolor

delimiter=':':

root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
nobody:x:99:99:Nobody:/:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
watney:x:1000:1000:Mark Watney:/home/watney:/bin/bash
lewis:x:1001:1001:Melissa Lewis:/home/lewis:/bin/bash
martinez:x:1002:1002:Rick Martinez:/home/martinez:/bin/bash

delimiter='|':

| Firstname | Lastname | Role      |
|-----------|----------|-----------|
| Mark      | Watney   | Botanist  |
| Melissa   | Lewis    | Commander |
| Rick      | Martinez | Pilot     |

delimiter='\t':

sepal_length        sepal_width     petal_length    petal_width     species
5.8 2.7     5.1     1.9     virginica
5.1 3.5     1.4     0.2     setosa
5.7 2.8     4.1     1.3     versicolor

7.1.3. Quotechar

  • " - quote char (best)

  • ' - apostrophe

quotechar='"':

"sepal_length", "sepal_width", "petal_length", "petal_width", "species"
"5.8", "2.7", "5.1", "1.9", "virginica"
"5.1", "3.5", "1.4", "0.2", "setosa"
"5.7", "2.8", "4.1", "1.3", "versicolor"

quotechar="'":

'sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'
'5.8', '2.7', '5.1', '1.9', 'virginica'
'5.1', '3.5', '1.4', '0.2', 'setosa'
'5.7', '2.8', '4.1', '1.3', 'versicolor'

quotechar='|':

|sepal_length|, |sepal_width|, |petal_length|, |petal_width|, |species|
|5.8|, |2.7|, |5.1|, |1.9|, |virginica|
|5.1|, |3.5|, |1.4|, |0.2|, |setosa|
|5.7|, |2.8|, |4.1|, |1.3|, |versicolor|

quotechar='/':

/sepal_length/, /sepal_width/, /petal_length/, /petal_width/, /species/
/5.8/, /2.7/, /5.1/, /1.9/, /virginica/
/5.1/, /3.5/, /1.4/, /0.2/, /setosa/
/5.7/, /2.8/, /4.1/, /1.3/, /versicolor/

7.1.4. Quoting

  • csv.QUOTE_ALL (safest)

  • csv.QUOTE_MINIMAL

  • csv.QUOTE_NONE

  • csv.QUOTE_NONNUMERIC

quoting=csv.QUOTE_ALL:

"sepal_length", "sepal_width", "petal_length", "petal_width", "species"
"5.8", "2.7", "5.1", "1.9", "virginica"
"5.1", "3.5", "1.4", "0.2", "setosa"
"5.7", "2.8", "4.1", "1.3", "versicolor"

quoting=csv.QUOTE_MINIMAL:

sepal_length, sepal_width, petal_length, petal_width, species
5.8, 2.7, 5.1, 1.9, virginica
5.1, 3.5, 1.4, 0.2, setosa
5.7, 2.8, 4.1, 1.3, versicolor

quoting=csv.QUOTE_NONE:

sepal_length, sepal_width, petal_length, petal_width, species
5.8, 2.7, 5.1, 1.9, virginica
5.1, 3.5, 1.4, 0.2, setosa
5.7, 2.8, 4.1, 1.3, versicolor

quoting=csv.QUOTE_NONNUMERIC:

"sepal_length", "sepal_width", "petal_length", "petal_width", "species"
5.8, 2.7, 5.1, 1.9, "virginica"
5.1, 3.5, 1.4, 0.2, "setosa"
5.7, 2.8, 4.1, 1.3, "versicolor"

7.1.5. Lineterminator

  • \r\n - New line on Windows

  • \n - New line on *nix

  • *nix operating systems: Linux, macOS, BSD and other POSIX compliant OSes (excluding Windows)

7.1.6. Decimal Separator

  • 0.1 - Decimal point

  • 0,1 - Decimal comma

../../_images/l10n-decimal-separator.png
sepal_length, sepal_width, petal_length, petal_width, species
5.8; 2.7; 5.1; 1.9; virginica
5.1; 3.5; 1.4; 0.2; setosa
5.7; 2.8; 4.1; 1.3; versicolor
sepal_length, sepal_width, petal_length, petal_width, species
5,8; 2,7; 5,1; 1,9; virginica
5,1; 3,5; 1,4; 0,2; setosa
5,7; 2,8; 4,1; 1,3; versicolor

7.1.7. Thousands Separator

  • 1000000 - None

  • 1'000'000 - Apostrophe

  • 1 000 000 - Space, the internationally recommended thousands separator

  • 1.000.000 - Period, used in many non-English speaking countries

  • 1,000,000 - Comma, used in most English-speaking countries

7.1.8. Date and Time

>>> date = '1961-04-12'
>>> date = '12.4.1961'
>>> date = '12.04.1961'
>>> date = '12-04-1961'
>>> date = '12/04/1961'
>>> date = '4/12/61'
>>> date = '4.12.1961'
>>> date = 'Apr 12, 1961'
>>> date = 'Apr 12th, 1961'
>>> time = '12:00:00'
>>> time = '12:00'
>>> time = '12:00 pm'
>>> duration = '04:30:00'
>>> duration = '4h 30m'
>>> duration = '4 hours 30 minutes'

7.1.9. Encoding

  • utf-8 - international standard (should be always used!)

  • iso-8859-1 - ISO standard for Western Europe and USA

  • iso-8859-2 - ISO standard for Central Europe (including Poland)

  • cp1250 or windows-1250 - Central European encoding on Windows

  • cp1251 or windows-1251 - Eastern European encoding on Windows

  • cp1252 or windows-1252 - Western European encoding on Windows

  • ASCII - ASCII characters only

with open(FILE, encoding='utf-8') as file:
    ...

7.1.10. Dialects

import csv

csv.list_dialects()
# ['excel', 'excel-tab', 'unix']
  • Microsoft Excel 2016-2020:

    • quoting=csv.QUOTE_MINIMAL

    • quotechar='"'

    • delimiter=',' or delimiter=';' depending on Windows locale decimal separator

    • lineterminator='\r\n'

    • encoding='...' - depends on Windows locale typically windows-*

  • Microsoft Excel macOS:

    • quoting=csv.QUOTE_MINIMAL

    • quotechar='"'

    • delimiter=','

    • lineterminator='\r\n'

    • encoding='utf-8'

  • Microsoft export options:

../../_images/csv-standard-dialects.png
$ file utf8.csv
utf8.csv: CSV text

$ cat utf8.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,zażółć gęślą jaźń
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment
$ file standard.csv
standard.csv: CSV text

$ cat standard.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,za_?__ g__l_ ja__
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment
$ file dos.csv
dos.csv: CSV text

$ cat dos.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,za_?__ g__l_ ja__
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment
$ file macintosh.csv
macintosh.csv: Non-ISO extended-ASCII text, with CR line terminators

$ cat macintosh.csv
,,"21,5",Some; Comment

7.1.11. Good Practices

Always specify:

  • delimiter=',' to csv.DictReader() object

  • quotechar='"' to csv.DictReader() object

  • quoting=csv.QUOTE_ALL to csv.DictReader() object

  • lineterminator='\n' to csv.DictReader() object

  • encoding='utf-8' to open() function (especially when working with Microsoft Excel)