[Csv] Module question...

Thu Jan 30 09:54:16 CET 2003

> From: Andrew McNamara
>
> >> The way we've speced it, the module only deals with file objects. I
> >> wonder if there's any need to deal with strings, rather than files?
>
> BTW, I'm asking this because it's something that will come back to haunt
> us if we get it wrong - it's something we need to make the right call on.

Agreed, in fact I'm now reconsidering my position.

> >One other possibility would be for the parser to only deal with
> one row at a
> >time, leaving it up to the user code to feed the parser the row
> strings. But
> >given the various possible line endings for a row of data and
> the fact that
> >a column of a row may contain a line ending, not to mention all the other
> >escape character issues we've discussed, this would be error-prone.
>
> This is the way the Object Craft module has worked - it works well enough,
> and the universal end-of-line stuff in 2.3 makes it more seamless. Not
> saying I'm wedded to this scheme, but I'd just like to have clear why
> we've chosen one over the other.

I'm tempted to agree that maybe your original way would be better, but I
haven't caught up on some of the discussion the last couple of days. Skip
and Cliff can probably argue effectively for not doing it that way if they
really want.

> I'm trying to think of an example where operating on a file-like object
> would be too restricting, and I can't - oh, here's one: what if you
> wanted to do some pre-processing on the data (say it was uuencoded)?

That seems to be stretching things a bit, but even then wouldn't you simply
pass the uuencoded file-like object to uu.decode and then pass the out_file
file-like object to the parser? I haven't used uu myself, so maybe that
wouldn't work. Regardless, the cvs module should be focused on one task.

> >The solution was to simply accept a file-like object and let the
> parser do
> >the interpretation of a record. By having the parser present an iterable
> >interface, the user code still gets the convenience of
> processing per row if
> >needed or if no processing is desired a result list can easily
> be obtained.
> >
> >This should provide the most flexibility while still being easy to use.
>
> Should the object just be defined as an iteratable, and leave closing,
> etc, up to the user of the module? One downside of this is you can't
> rewind an iterator, so things like the sniffer would be SOL. We can't
> ensure that the passed file is rewindable either. Hmmm.

Given a file-like object, you might not be able to rewind anyway. This might
be another argument for just parsing line by line, but does that make using
the module too complex and error-prone?

We probably have to provide some use-case examples. Putting the whole
operation in a try/except/finally block with the file close in finally is
probably the safe way to do this type of operation.

In the PEP we need to make it clear the benefits of the csv module over a
user simply trying to use split(',') and such, which I think Skip has
already done to a certain extent. We are also trying to address export as
well which is actually quite important. If people simply try and export with
only a simplistic understanding of the edge cases, then they potentially end
up with unusable csv files.

This is the same kind of thing you see with XML where people start writing
out <tag>data</tag> or whatever thinking that is all there is to it and then
they end up with something that isn't really XML. I wouldn't be surprised if
there is more invalid XML out there than valid.

In our case I think we are identifying some pretty clearly defined dialects
of csv, that if you use those you are going to be in good shape. We will
also be able to tell someone whether in fact a file is well-formed and/or
throw an exception if it doesn't match the chosen dialect, which again,
seems simple, but that's a pretty big deal.

Ugh, I need sleep, any stupidity above is just me being tired ;-)

ka