[Csv] Status

Thu Jan 30 03:48:59 CET 2003

It would appear we are converging on dialects as data-only classes
(subclassable but with no methods).  I'll update the PEP.  Many other ideas
have been floating through the list, and while I haven't been deleting the
messages, I haven't been adding them to the PEP either.  Can someone help
with that?

I'd like to get the wording in the PEP to converge on our current thoughts
and announce it on c.l.py and python-dev sometime tomorrow.  I think we will
get a lot of feedback from both camps, hopefully some of it useful. ;-)

Sound like a plan?

I just finished making a pass through the messages I hadn't deleted (and
then saved them to a csv mbox file since the list appears to still not be
archiving).  Here's what I think we've concluded:

    * Dialects are a set of defaults, probably implemented as classes (which
      allows subclassing, whereas dicts wouldn') and the default dialect
      named as something like csv.dialects.excel or "excel" if we allow
      string specifiers.  (I think strings work well at the API, simply
      because they are shorter and can more easily be presented in GUI
      tools.)

    * A csvutils module should be at least scoped out which might do a fair
      number of things:

      - Implements one or more sniffers for parameter types

      - Validates CSV files (e.g., constant number of columns, type
        constraints on column values, compares against given dialect)

      - Generate a sniffer from a CSV file

    * These individual parameters are necessary (hopefully the names will be
      enough clue as to there meaning): quote_char, quoting ("auto",
      "always", "nonnumeric", "never"), delimiter, line_terminator,
      skip_whitespace, escape_char, hard_return.  Are there others?

    * We're still undecided about None (I certainly don't think it's a valid
      value to be writing to CSV files)

    * Rows can have variable numbers of columns and the application is
      responsible for deciding on and enforcing max_rows or max_cols.

    * Don't raise exceptions needlessly.  For example, specifying
      quoting="never" and not specifying a value for escape_char would be
      okay until you encounter a field when writing which contains the
      delimiter.

    * Files have to be opened in binary mode (we can check the mode
      attribute I believe) so we can do the right thing with line
      terminators.

    * Data values should always be returned as strings, even if they are
      valid numbers.  Let the application do data conversion.

Other stuff we haven't talked about much:

    * Unicode.  I think we punt on this for now and just pretend that
      passing codecs.open(csvfile, mode, encoding) is sufficient.  I'm sure
      Martin von Löwis will let us know if it isn't. ;-) Dave said, "The low
      level parser (C code) is probably going to need to handle unicode."
      Let's wait and see how well codecs.open() works for us.

    * We know we need tests but haven't talked much about them.  I vote for
      PyUnit as much as possible, though a certain amount of manual testing
      using existing spreadsheets and databases will be required.

    * Exceptions.  We know we need some.  We should start with CSVError and
      try to avoid getting carried away with things.  If need be, we can add
      a code field to the class.  I don't like the idea of having 17
      different subclasses of CSVError though.  It's too much complexity for
      most users.

Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv