DSVWizard.py

Mon Jan 27 18:36:26 CET 2003

(Dave, should we continue to use the csv at object-craft address for you or
your djc email?)

    >> I suppose the first step would be to catalogue all of common the CSV
    >> variations and give them names.  Naming variations after the
    >> applications which produce them could be the best way.

    Cliff> That doesn't sound like a bad idea, but the task of cataloging
    Cliff> all those applications seems a bit daunting, especially since I
    Cliff> suspect between all of us we can probably only account for a
    Cliff> handful of them.

I think we should aim for Excel2000 compatibility as a bare minimum, and at
least document any supported extensions and try to tie them to specific
other applications.  It is indeed unfortunate that the CSV file format is
only operationally defined.

Wild-ass idea: Maybe the API should include a query function or a data
attribute which lists (as strings) the variants of CSV supported by a module
(which should be supported by test cases)?  The default variant would be
listed first, and the constructor would take any of the listed variants as
an optional argument.  Something like:

    variants = csv.get_variants()

    csvl = csv.parser(variant="lotus123")
    csve = csv.parser(variant="excel2000")

We could create an informal "registry" of valid variant names.  If support
for an existing variant is added, you use that name.  If support for an
unknown variant is added, you register a string.

    Cliff> ... despite there being no real standard, there seems to be only
    Cliff> minor differences between each format: delimiter, quote style,
    Cliff> allowed spaces around quotes.

That's true.  Perhaps selecting by variant name would do nothing more than
set those specific values behind the scenes, much the same way that when you
choose a particular C coding style in Emacs a number of low-level variable
values are set.

    Cliff> Another problem with specifying styles by application name is
    Cliff> that many apps allow the user to specify portions of the style
    Cliff> (usually the delimiter), so that's not set in stone either.

Yes, but there's still usually a default.  Some of the stuff (like space
after delimiters, newlines inside fields or CRLF/LF/CR line endings) isn't
user-settable and isn't obvious without inspecting the CSV file.  You might
have

    csve2 = csv.parser(variant="excel2000", delimiter=';')

to specify user-settable parameters or use "sniffing" code like DSV does to
figure out what the best choice is.

    Cliff> I think what I'm leaning towards at this time, if everyone is in
    Cliff> agreement, is for Dave or myself to reimplement Dave's code (and
    Cliff> API) in Python so that there is a pure Python implementation, and
    Cliff> then provide Dave's C module as a faster alternative (much like
    Cliff> Pickle and cPickle).  The heuristics of DSV would be an optional
    Cliff> feature, along with the GUI.

This sounds like a reasonable idea.  I also agree the GUI stuff will
probably not make it into the core.

    Cliff> As far as DSV's current API, I'm not too attached to it, and I
    Cliff> think that it could be mimicked sufficiently by adding a
    Cliff> parser.parseall() method to Dave's API so the programmer would
    Cliff> have the option of getting the entire file as a list without
    Cliff> having to write a loop.

Skip