DSVWizard.py
Skip Montanaro
skip at pobox.com
Mon Jan 27 18:36:26 CET 2003
(Dave, should we continue to use the csv at object-craft address for you or
your djc email?)
>> I suppose the first step would be to catalogue all of common the CSV
>> variations and give them names. Naming variations after the
>> applications which produce them could be the best way.
Cliff> That doesn't sound like a bad idea, but the task of cataloging
Cliff> all those applications seems a bit daunting, especially since I
Cliff> suspect between all of us we can probably only account for a
Cliff> handful of them.
I think we should aim for Excel2000 compatibility as a bare minimum, and at
least document any supported extensions and try to tie them to specific
other applications. It is indeed unfortunate that the CSV file format is
only operationally defined.
Wild-ass idea: Maybe the API should include a query function or a data
attribute which lists (as strings) the variants of CSV supported by a module
(which should be supported by test cases)? The default variant would be
listed first, and the constructor would take any of the listed variants as
an optional argument. Something like:
variants = csv.get_variants()
csvl = csv.parser(variant="lotus123")
csve = csv.parser(variant="excel2000")
We could create an informal "registry" of valid variant names. If support
for an existing variant is added, you use that name. If support for an
unknown variant is added, you register a string.
Cliff> ... despite there being no real standard, there seems to be only
Cliff> minor differences between each format: delimiter, quote style,
Cliff> allowed spaces around quotes.
That's true. Perhaps selecting by variant name would do nothing more than
set those specific values behind the scenes, much the same way that when you
choose a particular C coding style in Emacs a number of low-level variable
values are set.
Cliff> Another problem with specifying styles by application name is
Cliff> that many apps allow the user to specify portions of the style
Cliff> (usually the delimiter), so that's not set in stone either.
Yes, but there's still usually a default. Some of the stuff (like space
after delimiters, newlines inside fields or CRLF/LF/CR line endings) isn't
user-settable and isn't obvious without inspecting the CSV file. You might
have
csve2 = csv.parser(variant="excel2000", delimiter=';')
to specify user-settable parameters or use "sniffing" code like DSV does to
figure out what the best choice is.
Cliff> I think what I'm leaning towards at this time, if everyone is in
Cliff> agreement, is for Dave or myself to reimplement Dave's code (and
Cliff> API) in Python so that there is a pure Python implementation, and
Cliff> then provide Dave's C module as a faster alternative (much like
Cliff> Pickle and cPickle). The heuristics of DSV would be an optional
Cliff> feature, along with the GUI.
This sounds like a reasonable idea. I also agree the GUI stuff will
probably not make it into the core.
Cliff> As far as DSV's current API, I'm not too attached to it, and I
Cliff> think that it could be mimicked sufficiently by adding a
Cliff> parser.parseall() method to Dave's API so the programmer would
Cliff> have the option of getting the entire file as a list without
Cliff> having to write a loop.
Skip
More information about the Csv
mailing list