Unexpected behaviour of csv module

John Machin sjmachin at lexicon.net
Mon Sep 25 03:37:27 CEST 2006


Andrew McLean wrote:
> I have a bunch of csv files that have the following characteristics:
>
> - field delimiter is a comma
> - all fields quoted with double quotes
> - lines terminated by a *space* followed by a newline
>
> What surprised me was that the csv reader included the trailing space in
> the final field value returned, even though it is outside of the quotes.
>
>
> I've produced a test program (see below) that demonstrates this. There
> is a workaround, which is to not pass the csv reader the file iterator,
> but rather a generator that returns lines from the file with the
> trailing space stripped.
>
> Interestingly, the same behaviour is seen if there are spaces before the
> field separator. They are also included in the preceding field value,
> even if they are outside the quotations. My workaround wouldn't help here.

A better workaround IMHO is to strip each *field* after it is received
from the csv reader. In fact, it is very rare that leading or trailing
space in CSV fields is of any significance at all. Multiple spaces
ditto. Just do this all the time:

row = [' '.join(x.split()) for x in row]

>
> Anyway is this a bug or a feature? If it is a feature then I'm curious
> as to why it is considered desirable behaviour.

IMHO, a bug. In that state, it should be expecting another quotechar, a
delimiter, or a lineterminator. A case could be made for either (a)
ignore space characters (b) raise an exception (c) a or b depending on
an arg ..., ignore_trailing_space=False.

But it gets even more bizarre; see output from revised test script:

DOS_prompt>cat amclean2.py
import csv
filename = "test_data.csv"

# Generate a test file - note the spaces before the newlines
fout = open(filename, "w")
fout.write('"Field1","Field2","Field3" \n')
fout.write('"a","b","c" \n')
fout.write('"d" ,"e","f" \n')
fout.write('"g"xxx,"h" yyy,"i"zzz \n')
fout.write('Fred "Supercoder" Nerk,p,q\n')
fout.write('Fred "Supercoder\' Nerk,p,q\n')
fout.write('Fred \'Supercoder" Nerk,p,q\n')
fout.write('"Fred "Supercoder" Nerk",p,q\n')
fout.write('"Fred "Supercoder\' Nerk",p,q\n')
fout.write('"Fred \'Supercoder" Nerk",p,q\n')
fout.write('"Emoh Ruo", 123 Smith St, Sometown,p,q\n')
fout.write('""Emoh Ruo", 123 Smith St, Sometown","p","q"\n')
fout.close()

# Function to test a reader
def read_and_print(reader):
     for line in reader:
         # print ",".join(['"%s"' % field for field in line])
         # sheesh
         print repr(line)

# Read the test file - and print the output
reader = csv.reader(open("test_data.csv", "rb"))
read_and_print(reader)

DOS_prompt>\python25\python amclean2.py
['Field1', 'Field2', 'Field3 ']
['a', 'b', 'c ']
['d ', 'e', 'f ']
['gxxx', 'h yyy', 'izzz ']
['Fred "Supercoder" Nerk', 'p', 'q']
['Fred "Supercoder\' Nerk', 'p', 'q']
['Fred \'Supercoder" Nerk', 'p', 'q']
['Fred Supercoder" Nerk"', 'p', 'q']
['Fred Supercoder\' Nerk"', 'p', 'q']
['Fred \'Supercoder Nerk"', 'p', 'q']
['Emoh Ruo', ' 123 Smith St', ' Sometown', 'p', 'q']
['Emoh Ruo"', ' 123 Smith St', ' Sometown"', 'p', 'q']

Input like the 4th line (and subsequent lines) in the test file can not
have been produced by code which was following the usual algorithm for
quoting CSV fields. Either it is *concatenating* properly-quoted
segments (unlikely) or it is not doing CSV quoting at all or it is
blindly wrapping quotes around the field without doubling internal
quotes.

IMHO such problems should not be silently ignored.

> # Try using lineterminator instead - it doesn't work
> reader = csv.reader(open("test_data.csv", "rb"), lineterminator=" \r\n")

lineterminator is silently ignored by the reader. 

Cheers,
John




More information about the Python-list mailing list