crimes in Python
kragen at dnaco.net
Thu Mar 9 05:25:23 CET 2000
In article <000501bf897a$f7a71180$0d2d153f at tim>,
Tim Peters <tim_one at email.msn.com> wrote:
>> Well, the lines are things like this:
>> 2911.02,"ROBBERY - FORCE, THR",foo,bar,baz
>> And I want something like ['2911.02', 'ROBBERY - FORCE, THR', 'foo',
>> 'bar', 'baz'] as a result. Note that the comma in the robbery field
>> doesn't split the field.
>The problem with these kinds of formats is that nobody defines them
>carefully enough to know what to do. For example, what if there's a leading
>comma? A trailing comma? Whitespace before or after commas? Can a record
>spill across lines? Is there an escape convention to allow quotes embedded
>*within* "..." strings? And so on. It's just vague. Try this on for size:
Sure, it's definitely vague. The answers, for this data, are:
- creates a new field
- doesn't matter
- doesn't matter
- it could, but fortunately, it doesn't
- none appears in the file.
Of course, this is output from Excel 97, so I'm sure you could look up
>csvre = re.compile(r"""
> (?: ^ | ,) # start of string or field separator (comma)
> ( " [^"]* " # simple double-quote delimited
> | [^,]* # or sequence of non-commas
This differs from my original expression in the following ways:
- it matches the comma before each field instead of after it
- it doesn't allow x"y,z"w to be treated as one field
- it probably runs lots faster because it uses * instead of (?:|)*
I think I like it better.
>test = '2911.02,"ROBBERY - FORCE, THR",foo,bar,baz,, the ,end'
>print re.findall(csvre, test)
>['2911.02', '"ROBBERY - FORCE, THR"', 'foo', 'bar',
> 'baz', '', ' the ', 'end']
>This guy's answers to the questions above are: creates an empty field;
>creates an empty field; included as part of the adjacent field; more yes
>than no, but the full truth is subtle; no. Fiddle accordingly.
>> (Note to self: try not to use regexen in Python unless you need them.
>> Nobody will understand you.)
>Na, it's really that regexps are so appallingly brittle that we *pretend*
>not to understand you -- it's for your own good, you know <wink>.
>paternalistically y'rs - tim
Thank you, Python Papa. :)
It would be nice to make this a little less brittle --- e.g. by
complaining if the RE didn't match every character in the string. Is
there a way to do this, other than writing the corresponding state
machine in Python?
<kragen at pobox.com> Kragen Sitaker <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08. Hurrah!
The power didn't go out on 2000-01-01 either. :)
More information about the Python-list