DSVWizard.py

Mon Jan 27 20:42:22 CET 2003

On Mon, 2003-01-27 at 09:36, Skip Montanaro wrote:
> (Dave, should we continue to use the csv at object-craft address for you or
> your djc email?)
> 
>     >> I suppose the first step would be to catalogue all of common the CSV
>     >> variations and give them names.  Naming variations after the
>     >> applications which produce them could be the best way.
> 
>     Cliff> That doesn't sound like a bad idea, but the task of cataloging
>     Cliff> all those applications seems a bit daunting, especially since I
>     Cliff> suspect between all of us we can probably only account for a
>     Cliff> handful of them.
> 
> I think we should aim for Excel2000 compatibility as a bare minimum, and at
> least document any supported extensions and try to tie them to specific
> other applications.  It is indeed unfortunate that the CSV file format is
> only operationally defined.
> 
> Wild-ass idea: Maybe the API should include a query function or a data
> attribute which lists (as strings) the variants of CSV supported by a module
> (which should be supported by test cases)?  The default variant would be
> listed first, and the constructor would take any of the listed variants as
> an optional argument.  Something like:
> 
>     variants = csv.get_variants()
> 
>     csvl = csv.parser(variant="lotus123")
>     csve = csv.parser(variant="excel2000")
> 
> We could create an informal "registry" of valid variant names.  If support
> for an existing variant is added, you use that name.  If support for an
> unknown variant is added, you register a string.

Sounds reasonable, but I think the variant should be customizable in the
method call:

csvl = csv.parser(variant = "lotus123", delimiter = '\t')

So assuming "lotus123" was defined to use commas by default, it would
follow all the rules of the lotus variant except for the delimiter. 
This would allow for some flexibility in case the user saved the csv
file from Lotus but changed an option or two.

>     Cliff> ... despite there being no real standard, there seems to be only
>     Cliff> minor differences between each format: delimiter, quote style,
>     Cliff> allowed spaces around quotes.
> 
> That's true.  Perhaps selecting by variant name would do nothing more than
> set those specific values behind the scenes, much the same way that when you
> choose a particular C coding style in Emacs a number of low-level variable
> values are set.

That's what I was thinking.  In this case the "variant" could just be a
dictionary or simple class with a few attributes.

>     Cliff> Another problem with specifying styles by application name is
>     Cliff> that many apps allow the user to specify portions of the style
>     Cliff> (usually the delimiter), so that's not set in stone either.
> 
> Yes, but there's still usually a default.  Some of the stuff (like space
> after delimiters, newlines inside fields or CRLF/LF/CR line endings) isn't
> user-settable and isn't obvious without inspecting the CSV file.  You might
> have
> 
>     csve2 = csv.parser(variant="excel2000", delimiter=';')

Oh.  Guess I should have read the entire message before replying ;)  At
least it looks like we are on the same page =)

> to specify user-settable parameters or use "sniffing" code like DSV does to
> figure out what the best choice is.

The "sniffing" code in DSV is best used in conjunction with some sort of
confirmation from the user.  I've seen it guess incorrectly on some
files (although not very often).  Mostly stuff that has repeating
patterns of other characters (colons and slashes in dates and times). 
However, given these types of files, it defaults to the more common
delimiter (i.e. given a file that has both repeating colons and commas,
the comma will be chosen) which weeds out the majority of false
positives.  Nevertheless, it would seem foolhardy for a programmer to
rely on it without some sort of user intervention.  It could be perhaps
made a little smarter, but it's a difficult problem and I'd be reluctant
to use it alone.  This is why the GUI code is rather part-and-parcel
with the heuristics.  Nevertheless, having a separate project for
maintaining the GUI solves this and the programmer can always roll his
own if need be.

>     Cliff> I think what I'm leaning towards at this time, if everyone is in
>     Cliff> agreement, is for Dave or myself to reimplement Dave's code (and
>     Cliff> API) in Python so that there is a pure Python implementation, and
>     Cliff> then provide Dave's C module as a faster alternative (much like
>     Cliff> Pickle and cPickle).  The heuristics of DSV would be an optional
>     Cliff> feature, along with the GUI.
> 
> This sounds like a reasonable idea.  I also agree the GUI stuff will
> probably not make it into the core.

Anyone else?  BTW, where are we planning on hosting this project?  Under
one of the existing projects or somewhere else?

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308