[Csv] Status

Thu Jan 30 18:57:45 CET 2003

On Wed, 2003-01-29 at 18:48, Skip Montanaro wrote:
> It would appear we are converging on dialects as data-only classes
> (subclassable but with no methods).  I'll update the PEP.  Many other ideas
> have been floating through the list, and while I haven't been deleting the
> messages, I haven't been adding them to the PEP either.  Can someone help
> with that?

A comment on the dialect classes:  I think a validate() method would be
good in the base dialect class.  A separate validate function would do
just as well, but it seems logical to make it part of the class.

> I'd like to get the wording in the PEP to converge on our current thoughts
> and announce it on c.l.py and python-dev sometime tomorrow.  I think we will
> get a lot of feedback from both camps, hopefully some of it useful. ;-)

Undoubtedly Timothy Rue will inform us that we are wasting our time as
the VIC will solve this problem as well (after all, input->9
commands->output), but if you think you can live with that, sure.

> I just finished making a pass through the messages I hadn't deleted (and
> then saved them to a csv mbox file since the list appears to still not be
> archiving).  Here's what I think we've concluded:
> 
>     * Dialects are a set of defaults, probably implemented as classes (which
>       allows subclassing, whereas dicts wouldn') and the default dialect
>       named as something like csv.dialects.excel or "excel" if we allow
>       string specifiers.  (I think strings work well at the API, simply
>       because they are shorter and can more easily be presented in GUI
>       tools.)

Agreed.  Just to clarify, these strings will still be stored in a
dictionary ("settings" or "dialects")?

>     * A csvutils module should be at least scoped out which might do a fair
>       number of things:
> 
>       - Implements one or more sniffers for parameter types
> 
>       - Validates CSV files (e.g., constant number of columns, type
>         constraints on column values, compares against given dialect)
> 
>       - Generate a sniffer from a CSV file
> 
>     * These individual parameters are necessary (hopefully the names will be
>       enough clue as to there meaning): quote_char, quoting ("auto",
>       "always", "nonnumeric", "never"), delimiter, line_terminator,
>       skip_whitespace, escape_char, hard_return.  Are there others?
> 
>     * We're still undecided about None (I certainly don't think it's a valid
>       value to be writing to CSV files)

IMO, None should be mapped to '', so [None, None, None] would be saved
as ,, or "","","" if quoting="always".  I can't think of any reasonable
alternative.  However, it is arguable whether reading ,, should return
[None,None,None] or ['','',''].  I'd vote for the latter since we
explicitly are not doing conversions between strings and Python types
('6' doesn't become 6).

>     * Rows can have variable numbers of columns and the application is
>       responsible for deciding on and enforcing max_rows or max_cols.
> 
>     * Don't raise exceptions needlessly.  For example, specifying
>       quoting="never" and not specifying a value for escape_char would be
>       okay until you encounter a field when writing which contains the
>       delimiter.
>
>     * Files have to be opened in binary mode (we can check the mode
>       attribute I believe) so we can do the right thing with line
>       terminators.
> 
>     * Data values should always be returned as strings, even if they are
>       valid numbers.  Let the application do data conversion.
> 
> Other stuff we haven't talked about much:
> 
>     * Unicode.  I think we punt on this for now and just pretend that
>       passing codecs.open(csvfile, mode, encoding) is sufficient.  I'm sure
>       Martin von Löwis will let us know if it isn't. ;-) Dave said, "The low
>       level parser (C code) is probably going to need to handle unicode."
>       Let's wait and see how well codecs.open() works for us.
> 
>     * We know we need tests but haven't talked much about them.  I vote for
>       PyUnit as much as possible, though a certain amount of manual testing
>       using existing spreadsheets and databases will be required.

+1.  Testing all the corner cases is going to take some care.

>     * Exceptions.  We know we need some.  We should start with CSVError and
>       try to avoid getting carried away with things.  If need be, we can add
>       a code field to the class.  I don't like the idea of having 17
>       different subclasses of CSVError though.  It's too much complexity for
>       most users.

I can only count to 12 (or was it 11?), so this would be good for me as
well.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308