crimes in Python

Kragen Sitaker kragen at dnaco.net
Wed Mar 8 23:25:23 EST 2000


In article <000501bf897a$f7a71180$0d2d153f at tim>,
Tim Peters <tim_one at email.msn.com> wrote:
>[Kragen Sitaker]
>> ...
>> Well, the lines are things like this:
>>
>> 2911.02,"ROBBERY - FORCE, THR",foo,bar,baz
>>
>> And I want something like ['2911.02', 'ROBBERY - FORCE, THR', 'foo',
>> 'bar', 'baz'] as a result.  Note that the comma in the robbery field
>> doesn't split the field.
>
>The problem with these kinds of formats is that nobody defines them
>carefully enough to know what to do.  For example, what if there's a leading
>comma?  A trailing comma?  Whitespace before or after commas?  Can a record
>spill across lines?  Is there an escape convention to allow quotes embedded
>*within* "..." strings?  And so on.  It's just vague.  Try this on for size:

Sure, it's definitely vague.  The answers, for this data, are:
- creates a new field
- doesn't matter
- doesn't matter
- it could, but fortunately, it doesn't
- none appears in the file.

Of course, this is output from Excel 97, so I'm sure you could look up
the answers.

>import re
>
>csvre = re.compile(r"""
>    (?:  ^ | ,)      # start of string or field separator (comma)
>    (   " [^"]* "    # simple double-quote delimited
>    |   [^,]*        # or sequence of non-commas
>    )
>""", re.VERBOSE)

This differs from my original expression in the following ways:
- it matches the comma before each field instead of after it
- it doesn't allow x"y,z"w to be treated as one field
- it probably runs lots faster because it uses []* instead of (?:|)*

I think I like it better.

>test = '2911.02,"ROBBERY - FORCE, THR",foo,bar,baz,, the ,end'
>print re.findall(csvre, test)
>
>Which prints
>
>['2911.02', '"ROBBERY - FORCE, THR"', 'foo', 'bar',
> 'baz', '', ' the ', 'end']
>
>This guy's answers to the questions above are:  creates an empty field;
>creates an empty field; included as part of the adjacent field; more yes
>than no, but the full truth is subtle; no.  Fiddle accordingly.

Thanks!

>> (Note to self: try not to use regexen in Python unless you need them.
>> Nobody will understand you.)
>
>Na, it's really that regexps are so appallingly brittle that we *pretend*
>not to understand you -- it's for your own good, you know <wink>.
>
>paternalistically y'rs  - tim

Thank you, Python Papa.  :)

It would be nice to make this a little less brittle --- e.g. by
complaining if the RE didn't match every character in the string.  Is
there a way to do this, other than writing the corresponding state
machine in Python?
-- 
<kragen at pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
The power didn't go out on 2000-01-01 either.  :)



More information about the Python-list mailing list