First Cut at CSV PEP

Tue Jan 28 05:20:23 CET 2003

I'm ready to toddle off to bed, so I'm stopping here for tonight.  Attached
is what I've come up with so far in the way of a PEP.  Feel free to flesh
out, rewrite or add new sections.  After a brief amount of cycling, I'll
check it into CVS.

Skip

-------------- next part --------------
PEP: NNN
Title: CSV file API
Version: $Revision: $
Last-Modified: $Date: $
Author: Skip Montanaro <skip at pobox.com>,
        Kevin Altis <altis at semi-retired.com>,
        Cliff Wells <LogiplexSoftware at earthlink.net>
Status: Active
Type: Draft
Content-Type: text/x-rst
Created: 26-Jan-2003
Python-Version: 2.3
Post-History: 

Abstract
========

The Comma Separated Values (CSV) file format is the most common import
and export format for spreadsheets and databases.  Although many CSV
files are simple to parse, the format is not formally defined by a
stable specification and is subtle enough that parsing lines of a CSV
file with something like ``line.split(",")`` is bound to fail.  This
PEP defines an API for reading and writing CSV files which should make
it possible for programmers to select a CSV module which meets their
requirements.

Existing Modules
================

Three widely available modules enable programmers to read and write
CSV files:

- Dave Cole's csv module [1]_

- Cliff Wells's Python-DSV module [2]_

- Laurence Tratt's ASV module [3]_

They have different APIs, making it somewhat difficult for programmers
to switch between them.  More of a problem may be that they interpret
some of the CSV corner cases differently, so even after surmounting
the differences in the module APIs, the programmer has to also deal
with semantic differences between the packages.

Rationale
=========

By defining common APIs for reading and writing CSV files, we make it
easier for programmers to choose an appropriate module to suit their
needs, and make it easier to switch between modules if their needs
change.  This PEP also forms a set of requirements for creation of a
module which will hopefully be incorporated into the Python
distribution.

Module Interface
================

The module supports two basic APIs, one for reading and one for
writing.  The reading interface is::

    reader(fileobj [, dialect='excel2000']
                   [, quotechar='"']
                   [, delimiter=',']
                   [, skipinitialspace=False])

A reader object is an iterable which takes a file-like object opened
for reading as the sole required parameter.  It also accepts four
optional parameters (discussed below).  Readers are typically used as
follows::

    csvreader = csv.parser(file("some.csv"))
    for row in csvreader:
        process(row)

The writing interface is similar::

    writer(fileobj [, dialect='excel2000']
                   [, quotechar='"']
                   [, delimiter=',']
                   [, skipinitialspace=False])

A writer object is a wrapper around a file-like object opened for
writing.  It accepts the same four optional parameters as the reader
constructor.  Writers are typically used as follows::

    csvwriter = csv.writer(file("some.csv", "w"))
    for row in someiterable:
        csvwriter.write(row)

Optional Parameters
-------------------

Both the reader and writer constructors take four optional keyword
parameters::

- dialect is an easy way of specifying a complete set of format
  constraints for a reader or writer.  Most people will know what
  application generated a CSV file or what application will process
  the CSV file they are generating, but not the precise settings
  necessary.  The only dialect defined initially is "excel2000".  The
  dialect parameter is interpreted in a case-insensitive manner.

- quotechar specifies a one-character string to use as the quoting
  character.  It defaults to '"'.

- delimiter specifies a one-character string to use as the field
  separator.  It defaults to ','.

- skipinitialspace specifies how to interpret whitespace which
  immediately follows a selimiter.  It defaults to False, which means
  that whitespace immediate following a delimiter is part of the
  following field.

When processing a dialect setting and one or more of the other
optional parameters, the dialect parameter is processed first, then
the others are processed.  This makes it easy to choose a dialect,
then override one or more of the settings.  For example, if a CSV file
was generated by Excel 2000 using single quotes as the quote
character, you could create a reader like::

    csvreader = csv.parser(file("some.csv"), dialect="excel2000",
                           quotechar="'")

Testing
=======

TBD.

Issues
======

- Should a parameter control how consecutive delimiters are
  interpreted?  (My thought is "no".)

References
==========

.. [1] csv module, Object Craft
   (http://www.object-craft.com.au/projects/csv) 

.. [2] Python-DSV module, Wells
   (http://sourceforge.net/projects/python-dsv/) 

.. [3] ASV module, Tratt
   (http://tratt.net/laurie/python/asv/)

Copyright
=========

This document has been placed in the public domain.

..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   End: