First Cut at CSV PEP

Tue Jan 28 22:45:21 CET 2003

On Mon, 2003-01-27 at 21:50, Kevin Altis wrote:
> > From: Dave Cole
> >
> > >>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:
> >
> > I only have one issue with the PEP as it stands.  It is still aiming
> > too low.  One of the things that we support in our parser is the
> > ability to handle CSV without quote characters.
> >
> >         field1,field2,field3\, field3,field4
> 
> Excel certainly can't handle that, nor do I think Access can. If a field
> contains a comma, then the field must be quoted. Now, that isn't to say that
> we shouldn't be able to support the idea of escaped characters, but when
> exporting if you do want something that a tool like Excel could read, you
> would need to generate an exception if quoting wasn't specified. The same
> would probably apply for embedded newlines in a field without quoting.
> 
> Being able to generate exceptions on import and export operations could be
> one of the big benefits of this module. You won't accidentally export
> something that someone on the other end won't be able to use and you'll know
> on import that you have garbage before you try and use it. For example, when
> I first started trying to import Access data that was tab-separated, I
> didn't realize there were embedded newlines until much later, at which point
> I was able to go back and export as CSV with quote delimitters and the data
> became usable.

Perhaps a "strict" option?  I'm not sure this is necessary though.  It
seems that if a *programmer* specifies dialect="excel2000" and then
changes some other default, that's his problem.  There's a danger in too
much hand-holding in added complexity and arbitrary limitations.

> > I think that we need some way to handle a potentially different set of
> > options on each dialect.
> 
> I'm not real comfortable with the dialect idea, it doesn't seem to add any
> value over simply specifying a separator and delimiter.

Except that it gives a programmer a way to be certain that, if he does
nothing else, the file will be compatible with the specified dialect.

> We aren't dealing with encodings, so anything other than 7-bit ASCII unless
> specified as a delimiter or separator would be undefined, yes? The only
> thing that really matters is the delimiter and separator and then how
> quoting is handled of either of those characters and embedded returns and
> newlines within a field. Correct me if I'm wrong, but I don't think the MS
> CSV formats can deal with embedded CR or LF unless fields are quoted and
> that will be done with a " character.

But then MS isn't the only potential target, just our initial (and
primary) target.  foobar87 may allow export of escaped newlines and put
a extraneous space after every delimiter and we don't want someone to
have to write another csv importer to deal with it.

> Now with Access, you are actually given more control. See the attached
> screenshot. Ignorning everything except the top File format section you
> have:
> Delimited or Fixed Width. If Delimited you have a Field Delimiter choice of
> comma, semi-colon, tab and space or a user-specified character and the text
> qualifier can be double-quote, apostrophe, or None.

And this only deals with the variations the *user* is allowed to make. 
Applications themselves may introduce variations that we need to have
the flexibility to deal with.

> The universal readlines support in Python 2.3 may impact the use of a file
> reader/writer when processing different text files, but would returns or
> newlines within a field be impacted? Should the PEP and API specify that the
> record delimiter can be either CR, LF, or CR/LF, but use of those characters
> inside a field requires the field to be quoted or an exception will be
> thrown?

The idea of raising an exception brings up an interesting problem that I
had to deal with in DSV.  I've run across files that were missing fields
and just had a callback so the programmer could decide how to deal with
it.  This can be the result of corrupted data, but it's also possible
for an application to only export fields that actually contain data, for
instance:

1,2,3,4,5
1,2,3
1,2,3,4

This could very well be a valid csv file.  I'm not aware of any
requirement that rows all be the same length.  We'll need to have some
fairly flexible error-handling to allow for this type of thing when
required or raise an exception when it indicates corrupt/invalid data. 
In DSV I allowed custom error-handlers so the programmer could indicate
whether to process the line as normal, discard it, etc.

> ka
-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308