crimes in Python

Tim Peters tim_one at email.msn.com
Thu Mar 9 04:53:02 CET 2000


[Kragen Sitaker]
> ...
> Well, the lines are things like this:
>
> 2911.02,"ROBBERY - FORCE, THR",foo,bar,baz
>
> And I want something like ['2911.02', 'ROBBERY - FORCE, THR', 'foo',
> 'bar', 'baz'] as a result.  Note that the comma in the robbery field
> doesn't split the field.

The problem with these kinds of formats is that nobody defines them
carefully enough to know what to do.  For example, what if there's a leading
comma?  A trailing comma?  Whitespace before or after commas?  Can a record
spill across lines?  Is there an escape convention to allow quotes embedded
*within* "..." strings?  And so on.  It's just vague.  Try this on for size:


import re

csvre = re.compile(r"""
    (?:  ^ | ,)      # start of string or field separator (comma)
    (   " [^"]* "    # simple double-quote delimited
    |   [^,]*        # or sequence of non-commas
    )
""", re.VERBOSE)

test = '2911.02,"ROBBERY - FORCE, THR",foo,bar,baz,, the ,end'
print re.findall(csvre, test)

Which prints

['2911.02', '"ROBBERY - FORCE, THR"', 'foo', 'bar',
 'baz', '', ' the ', 'end']

This guy's answers to the questions above are:  creates an empty field;
creates an empty field; included as part of the adjacent field; more yes
than no, but the full truth is subtle; no.  Fiddle accordingly.

> ...
> (Note to self: try not to use regexen in Python unless you need them.
> Nobody will understand you.)

Na, it's really that regexps are so appallingly brittle that we *pretend*
not to understand you -- it's for your own good, you know <wink>.

paternalistically y'rs  - tim






More information about the Python-list mailing list