[Python-Dev] csv module TODO list

Andrew McNamara andrewm at object-craft.com.au
Wed Jan 5 11:33:05 CET 2005


>> Yep, although that means we wear the cost of decoding and encoding for
>> all 8 bit input.
>
>Right, but it makes the code very clean and straight forward.

I agree it makes for a very clean solution, and 99% of the time I'd
chose that option.

>Again, it depends on what you need. If performance is critical
>then you probably need a C version written using the same trick
>as _sre.c...
>
>> What does the _sre.c code do?
>
>It comes in two versions: one for 8-bit the other for Unicode.

That's what I thought. I think the motivations here are similar to those
that drove the _sre developers.

>> We are routinely dealing with multi-gigabyte csv files - which is why the
>> original 2001 vintage csv module was written as a C state machine. 
>
>I see, but are you sure that the typical Python user will have
>the same requirements to make it worth the effort (and
>complexity) ?

This is open source, so I scratch my own itch (and that of my employers) - 
we need fast csv parsing more than we need unicode... 8-)

Okay, assuming we go the "produce two versions via evil macro tricks"
path, it's still not quite the same situation as _sre.c, which only has
to deal with the internal unicode representation.

One way to approach this would be to add an "encoding" keyword argument
to the readers and writers. If given, the parser would decode the input
stream to the internal representation before passing it through the
unicode state machine, which would yield tuples of unicode objects.

That leaves us with a bit of a problem where the source is already unicode
(eg, a list of unicode strings)... hmm.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


More information about the Python-Dev mailing list