[CSV] Re: First Cut at CSV PEP

Wed Jan 29 03:01:27 CET 2003

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>> Here we go again with a potentially bad idea...

Andrew> *-)

>> I think that there are two things we need to have for each dialect;
>> a set of low level parser configuration, and a set of user
>> tweakables (which correspond to options presented by the
>> application).  The set of user tweakables may not necessarily map
>> one-to-one with low level parser configuration items.

Andrew> This seems to add a fair bit of complexity to the
Andrew> implementation, without simplifying the interface much. In
Andrew> particular, it makes it difficult for the user to move to an
Andrew> alternate dialect (because they'll need to change all the
Andrew> config options). It also makes it harder for third parties to
Andrew> implement their own dialects (or maintain the base ones). And
Andrew> it makes the documenation and tests harder. KISS.

OK.  Yes, it was a bad idea which achieved full potential.

>> Any sniffer would have to be able to traverse the set of dialects
>> implemented in the CSV module and look inside them to understand
>> which options are available to a dialect.

Andrew> It might be enough to look at the first N lines of the file,
Andrew> and do some basic stats (tabs per line, commas per line,
Andrew> etc). Whether it guesses a dialect, or just tries to set
Andrew> individual options is another question.

Just to make your heads hurt a bit more...

In a previous job (at a stock broker) I had to read some CSV data
which had been exported by the MS SQL Server BCP program.  The
excellent BCP program happily exported comma separated data without
quoting fields which contained commas.  Nasty!

I ended up writing some code which post-processed the parsed records
based upon the number of fields.  The post-processing had high level
knowledge of the type of each column so applied heuristics to join
fields back together to get the correct field count.

I remember that the code knew which columns were text, numeric, dates,
times and bit.  The code worked from left to right and tried joining
text columns with trailing fields then asserted that the remaining
fields were consistent with their respective columns.  This continued
until the field count matched the table column count.

All of this was complicated further by the fact that it had to handle
archived data and the table definition changed over time...

- Dave

-- 
http://www.object-craft.com.au