csv.reader has trouble with comma inside quotes inside brackets

Tue Jun 9 17:03:30 EDT 2009

Bret <bretrouse <at> gmail.com> writes:

> 
> i have a csv file like so:
> row1,field1,[field2][text in field2 "quote, quote"],field3,field
> row2,field1,[field2]text in field2 "quote, quote",field3,field
> 
> using csv.reader to read the file, the first row is broken into two
> fields:
> [field2][text in field2 "quote
> and
>  quote"
> 
> while the second row is read correctly with:
> [field2]text in field2 "quote, quote"
> being one field.
> 
> any ideas how to make csv.reader work correctly for the first case?
> the problem is the comma inside the quote inside the brackets, ie:
> [","]

I can't reproduce the behaviour that you describe based on these reasonable
assumptions: Python 2.6, delimiter=',', quotechar='"':

C:\junk\weird_csv>\python26\python
Python 2.6.2 (r262:71605, Apr 14 2009, 22:40:02) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
|>>> import pprint, csv
|>>> open('weird.csv', 'rb').read()
'row1,field1,[field2][text in field2 "quote, quote"],field3,field\r\nrow2,field1
,[field2]text in field2 "quote, quote",field3,field\r\n'
>>> pprint.pprint(list(csv.reader(open('weird.csv', 'rb'))))
[['row1',
  'field1',
  '[field2][text in field2 "quote',
  ' quote"]',
  'field3',
  'field'],
 ['row2',
  'field1',
  '[field2]text in field2 "quote',
  ' quote"',
  'field3',
  'field']]
|>>>

As you can see, it treats both lines in the same way. Opening the file with each
of Microsoft Excel 2003, OpenOffice.org Calc 3, and Gnumeric has exactly the
same effect.

The problem is that your data has not been prepared using the generally accepted
rules for quoting in CSV files:
[pseudocode]
   if field contains any of quotechar, delimiter, newline, maybe others:
      field = (quotechar
               + field.replace(quotechar, quotechar + quotechar)
               + quotechar)
which would change your first line from
  row1,field1,[field2][text in field2 "quote, quote"],field3,field
to
  row1,field1,"[field2][text in field2 ""quote, quote""]",field3,field

There's no option in the csv module to get around this, AFAICT. You'd have to
roll your own, something along these lines (assumes no embedded newlines etc):

8<--- parse_unquoted_csv.py
def parse_unquoted_csv(line, delimiter=',', quotechar='"'):
    line = line.strip()
    inside_quotes = False
    fields = []
    field = ''
    for c in line:
        if inside_quotes:
            if c == quotechar:
                inside_quotes = False
            field += c
        else:
            if c == delimiter:
                fields.append(field)
                field = ''
            else:
                if c == quotechar:
                    inside_quotes = True
                field += c
    if inside_quotes:
        print repr(line)
        print fields
        print repr(field)
        raise Exception("Quotes not balanced")
    fields.append(field)
    return fields

if __name__ == "__main__":
    tests = [
        'row1,field1,[field2][text in field2 "quote, quote"],field3,field',
        'row2,field1,[field2]text in field2 "quote, quote",field3,field',
        'a,b,c',
        'a,b,',
        '',
        'Look,here"s,a,typo',
        ]
    for test in tests:
        print repr(test)
        print parse_unquoted_csv(test)
        print
8<---

HTH,
John