First Cut at CSV PEP

Dave Cole djc at object-craft.com.au
Wed Jan 29 00:28:49 CET 2003


>>>>> "Kevin" == Kevin Altis <altis at semi-retired.com> writes:

>> From: Dave Cole
>> 
>> >>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:
>> 
>> I only have one issue with the PEP as it stands.  It is still
>> aiming too low.  One of the things that we support in our parser is
>> the ability to handle CSV without quote characters.
>> 
>> field1,field2,field3\, field3,field4

Kevin> Excel certainly can't handle that, nor do I think Access
Kevin> can. If a field contains a comma, then the field must be
Kevin> quoted. Now, that isn't to say that we shouldn't be able to
Kevin> support the idea of escaped characters, but when exporting if
Kevin> you do want something that a tool like Excel could read, you
Kevin> would need to generate an exception if quoting wasn't
Kevin> specified. The same would probably apply for embedded newlines
Kevin> in a field without quoting.

Kevin> Being able to generate exceptions on import and export
Kevin> operations could be one of the big benefits of this module. You
Kevin> won't accidentally export something that someone on the other
Kevin> end won't be able to use and you'll know on import that you
Kevin> have garbage before you try and use it. For example, when I
Kevin> first started trying to import Access data that was
Kevin> tab-separated, I didn't realize there were embedded newlines
Kevin> until much later, at which point I was able to go back and
Kevin> export as CSV with quote delimitters and the data became
Kevin> usable.

I suppose that exporting should raise an exception if you specify any
variation on the dialect in the writer function.

    csvwriter = csv.writer(file("newnastiness.csv", "w"),
                           dialect='excel2000', delimiter='"')

That should raise an exception.

This probably shouldn't raise an exception though:

    csvwriter = csv.writer(file("newnastiness.csv", "w"),
                           dialect='excel2000')
    csvwriter.setparams(delimiter='"')

>> I think that we need some way to handle a potentially different set
>> of options on each dialect.

Kevin> I'm not real comfortable with the dialect idea, it doesn't seem
Kevin> to add any value over simply specifying a separator and
Kevin> delimiter.

It makes thing *a lot* easier for module users who are not fully
conversant in the vagaries of CSV.

Kevin> We aren't dealing with encodings, so anything other than 7-bit
Kevin> ASCII unless specified as a delimiter or separator would be
Kevin> undefined, yes? The only thing that really matters is the
Kevin> delimiter and separator and then how quoting is handled of
Kevin> either of those characters and embedded returns and newlines
Kevin> within a field. Correct me if I'm wrong, but I don't think the
Kevin> MS CSV formats can deal with embedded CR or LF unless fields
Kevin> are quoted and that will be done with a " character.

We are not just trying to deal with MS CSV formats though.

Kevin> Note If your workbook contains special font characters such as
Kevin> a copyright symbol (C), and you will be using the converted
Kevin> text file on a computer with a different operating system, save
Kevin> the workbook in the text file format appropriate for that
Kevin> system. For example, if you are using Windows and want to use
Kevin> the text file on a Macintosh computer, save the file in the CSV
Kevin> (Macintosh) format. If you are using a Macintosh computer and
Kevin> want to use the text file on a system running Windows or
Kevin> Windows NT, save the file in the CSV (Windows) format."

Kevin> The CR, CR/LF, and LF line endings probably have something to
Kevin> do with saving in Mac format, but it may also do some 8-bit
Kevin> character translation.

Should we be trying to handle unicode.  I think we should since Python
is now unicode capable.

Kevin> The universal readlines support in Python 2.3 may impact the
Kevin> use of a file reader/writer when processing different text
Kevin> files, but would returns or newlines within a field be
Kevin> impacted? Should the PEP and API specify that the record
Kevin> delimiter can be either CR, LF, or CR/LF, but use of those
Kevin> characters inside a field requires the field to be quoted or an
Kevin> exception will be thrown?

Should we raise an exception or just pass the data through?

If it is not a newline, then it is not a newline.

- Dave

-- 
http://www.object-craft.com.au




More information about the Csv mailing list