crimes in Python
tim_one at email.msn.com
Thu Mar 9 04:53:02 CET 2000
> Well, the lines are things like this:
> 2911.02,"ROBBERY - FORCE, THR",foo,bar,baz
> And I want something like ['2911.02', 'ROBBERY - FORCE, THR', 'foo',
> 'bar', 'baz'] as a result. Note that the comma in the robbery field
> doesn't split the field.
The problem with these kinds of formats is that nobody defines them
carefully enough to know what to do. For example, what if there's a leading
comma? A trailing comma? Whitespace before or after commas? Can a record
spill across lines? Is there an escape convention to allow quotes embedded
*within* "..." strings? And so on. It's just vague. Try this on for size:
csvre = re.compile(r"""
(?: ^ | ,) # start of string or field separator (comma)
( " [^"]* " # simple double-quote delimited
| [^,]* # or sequence of non-commas
test = '2911.02,"ROBBERY - FORCE, THR",foo,bar,baz,, the ,end'
print re.findall(csvre, test)
['2911.02', '"ROBBERY - FORCE, THR"', 'foo', 'bar',
'baz', '', ' the ', 'end']
This guy's answers to the questions above are: creates an empty field;
creates an empty field; included as part of the adjacent field; more yes
than no, but the full truth is subtle; no. Fiddle accordingly.
> (Note to self: try not to use regexen in Python unless you need them.
> Nobody will understand you.)
Na, it's really that regexps are so appallingly brittle that we *pretend*
not to understand you -- it's for your own good, you know <wink>.
paternalistically y'rs - tim
More information about the Python-list