[Csv] Status
Skip Montanaro
skip at pobox.com
Thu Jan 30 03:48:59 CET 2003
It would appear we are converging on dialects as data-only classes
(subclassable but with no methods). I'll update the PEP. Many other ideas
have been floating through the list, and while I haven't been deleting the
messages, I haven't been adding them to the PEP either. Can someone help
with that?
I'd like to get the wording in the PEP to converge on our current thoughts
and announce it on c.l.py and python-dev sometime tomorrow. I think we will
get a lot of feedback from both camps, hopefully some of it useful. ;-)
Sound like a plan?
I just finished making a pass through the messages I hadn't deleted (and
then saved them to a csv mbox file since the list appears to still not be
archiving). Here's what I think we've concluded:
* Dialects are a set of defaults, probably implemented as classes (which
allows subclassing, whereas dicts wouldn') and the default dialect
named as something like csv.dialects.excel or "excel" if we allow
string specifiers. (I think strings work well at the API, simply
because they are shorter and can more easily be presented in GUI
tools.)
* A csvutils module should be at least scoped out which might do a fair
number of things:
- Implements one or more sniffers for parameter types
- Validates CSV files (e.g., constant number of columns, type
constraints on column values, compares against given dialect)
- Generate a sniffer from a CSV file
* These individual parameters are necessary (hopefully the names will be
enough clue as to there meaning): quote_char, quoting ("auto",
"always", "nonnumeric", "never"), delimiter, line_terminator,
skip_whitespace, escape_char, hard_return. Are there others?
* We're still undecided about None (I certainly don't think it's a valid
value to be writing to CSV files)
* Rows can have variable numbers of columns and the application is
responsible for deciding on and enforcing max_rows or max_cols.
* Don't raise exceptions needlessly. For example, specifying
quoting="never" and not specifying a value for escape_char would be
okay until you encounter a field when writing which contains the
delimiter.
* Files have to be opened in binary mode (we can check the mode
attribute I believe) so we can do the right thing with line
terminators.
* Data values should always be returned as strings, even if they are
valid numbers. Let the application do data conversion.
Other stuff we haven't talked about much:
* Unicode. I think we punt on this for now and just pretend that
passing codecs.open(csvfile, mode, encoding) is sufficient. I'm sure
Martin von Löwis will let us know if it isn't. ;-) Dave said, "The low
level parser (C code) is probably going to need to handle unicode."
Let's wait and see how well codecs.open() works for us.
* We know we need tests but haven't talked much about them. I vote for
PyUnit as much as possible, though a certain amount of manual testing
using existing spreadsheets and databases will be required.
* Exceptions. We know we need some. We should start with CSVError and
try to avoid getting carried away with things. If need be, we can add
a code field to the class. I don't like the idea of having 17
different subclasses of CSVError though. It's too much complexity for
most users.
Skip
_______________________________________________
Csv mailing list
Csv at mail.mojam.com
http://manatee.mojam.com/mailman/listinfo/csv
More information about the Csv
mailing list