First Cut at CSV PEP

Kevin Altis altis at semi-retired.com
Tue Jan 28 06:50:20 CET 2003


> From: Dave Cole
>
> >>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:
>
> I only have one issue with the PEP as it stands.  It is still aiming
> too low.  One of the things that we support in our parser is the
> ability to handle CSV without quote characters.
>
>         field1,field2,field3\, field3,field4

Excel certainly can't handle that, nor do I think Access can. If a field
contains a comma, then the field must be quoted. Now, that isn't to say that
we shouldn't be able to support the idea of escaped characters, but when
exporting if you do want something that a tool like Excel could read, you
would need to generate an exception if quoting wasn't specified. The same
would probably apply for embedded newlines in a field without quoting.

Being able to generate exceptions on import and export operations could be
one of the big benefits of this module. You won't accidentally export
something that someone on the other end won't be able to use and you'll know
on import that you have garbage before you try and use it. For example, when
I first started trying to import Access data that was tab-separated, I
didn't realize there were embedded newlines until much later, at which point
I was able to go back and export as CSV with quote delimitters and the data
became usable.

> I think that we need some way to handle a potentially different set of
> options on each dialect.

I'm not real comfortable with the dialect idea, it doesn't seem to add any
value over simply specifying a separator and delimiter.

We aren't dealing with encodings, so anything other than 7-bit ASCII unless
specified as a delimiter or separator would be undefined, yes? The only
thing that really matters is the delimiter and separator and then how
quoting is handled of either of those characters and embedded returns and
newlines within a field. Correct me if I'm wrong, but I don't think the MS
CSV formats can deal with embedded CR or LF unless fields are quoted and
that will be done with a " character.

Now with Access, you are actually given more control. See the attached
screenshot. Ignorning everything except the top File format section you
have:
Delimited or Fixed Width. If Delimited you have a Field Delimiter choice of
comma, semi-colon, tab and space or a user-specified character and the text
qualifier can be double-quote, apostrophe, or None.

> When you CSV export from Excel, do you have the ability to use a
> delimiter other than comma?  Do you have the ability to change the
> quotechar?

No, but there are a variety of text formats supported.

The Excel 2000 help file for Text file formats:

"Text (Tab-delimited) (*.txt) (Windows)
Text (Macintosh)
Text (OS/2 or MS-DOS)
CSV (comma delimited) (*.csv) (Windows)
CSV (Macintosh)
CSV (OS/2 or MS-DOS)

If you are saving a workbook as a tab-delimited or comma-delimited text file
for use on another operating system, select the appropriate converter to
ensure that tab characters, line breaks, and other characters are
interpreted correctly."

The Excel 2000 help file for CSV:

"CSV (Comma delimited) format
The CSV (Comma delimited) file format saves only the text and values as they
are displayed in cells of the active worksheet. All rows and all characters
in each cell are saved. Columns of data are separated by commas, and each
row of data ends in a carriage return. If a cell contains a comma, the cell
contents are enclosed in double quotation marks.

If cells display formulas instead of formula values, the formulas are
converted as text. All formatting, graphics, objects, and other worksheet
contents are lost.

Note   If your workbook contains special font characters such as a copyright
symbol (C), and you will be using the converted text file on a computer with
a different operating system, save the workbook in the text file format
appropriate for that system. For example, if you are using Windows and want
to use the text file on a Macintosh computer, save the file in the CSV
(Macintosh) format. If you are using a Macintosh computer and want to use
the text file on a system running Windows or Windows NT, save the file in
the CSV (Windows) format."

The CR, CR/LF, and LF line endings probably have something to do with saving
in Mac format, but it may also do some 8-bit character translation.

The universal readlines support in Python 2.3 may impact the use of a file
reader/writer when processing different text files, but would returns or
newlines within a field be impacted? Should the PEP and API specify that the
record delimiter can be either CR, LF, or CR/LF, but use of those characters
inside a field requires the field to be quoted or an exception will be
thrown?

ka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: access_export.png
Type: image/png
Size: 9504 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20030127/7594f034/attachment.png 


More information about the Csv mailing list