[Csv] What's our status?

Thu Feb 27 18:07:58 CET 2003

On Thu, 2003-02-27 at 04:12, John Machin wrote:
> On 26 Feb 2003 14:10:48 -0800, Cliff Wells <LogiplexSoftware at earthlink.net> 
> wrote:
> 
> >
> > However, for the following I am so far unable to come up with a way to
> > determine the delimiter:
> >
> > all,work,and,no,play,makes,jack,a,dull,boy
> > all,work,and,no,play,makes,jack,a,dull
> > boy
> > all,work,and,no,play,makes,jack,a
> [snip]
> 
> >
> > Anyone have a suggestion?  All work and no play makes jack a dull boy.
> 
> [Warning: late at night, OTTOMH, may contain babblings]

I started babbling yesterday while working on this.  Luckily the interns
came and gave me my injection.  However, it's difficult to type with
these leather straps on and I can't quite reach the buckles with my
teeth...

> Errrmmm, maybe I've missed the plot or lost the point or whatever, but a 
> good start would be assuming that only in pathological cases would the 
> delimiter or the quote be an alphanumeric character i.e. the file has been 
> produced by an ordinary user, not a red-team tester.

I'm willing to make that assumption for this case, but read on...

> Try the most frequent two non-alphanumeric characters as the candidates for 
> the delimiter and the quotechar? If there's only 1 non-alphanumeric 
> character, then it's the delimiter.

If we have a quotechar, then the problem is solved.  Unfortunately the
situation I expect here is that there will be more than one
non-alphanumeric character per line.  It's quite common to see
dates/timestamps in *every* row of a csv file:

data,2003/02/27,08:51:00
data,2003/02/27,08:52:00
data,2003/02/27,08:53:00
data,2003/02/27,08:54:00

In this case it is difficult to know whether ,/ or : is the delimiter. 
It's not entirely unreasonable to use a "preferred" list of delimiters
but it's not entirely safe either ;)  In fact, the current
implementation will resort to a preferred list in this example and
return , as the delimiter.  However, given the following:

2003/02/27,08:51:00
data,2003/02/27,08:52:00
08:53:00
data,2003/02/27,08:54:00

It would most likely (without testing) return ":" as the delimiter as it
occurs equally consistently with "/", but is higher in the preferred
list.  This is wrong as the delimiter is clearly ",".  That being said,
I would simply consider this file as being unsniffable as it has no real
pattern.

> If there aren't any non-AN chars [an example in one of your messages], then 
> there's only one field per record.

Hm.  That might actually be useful.

> Where there are two or more candidates for the delimiter and quotechar, you 
> could use some plausibility  heuristics e.g. " and ' are more likely to be 
> quotes than delimiters however tab, comma, semicolon, colon, vertical bar, 
> and tilde are plausible delimiters.

As I mentioned earlier, quotes are already handled.  If quotes are
present, I think the current implementation is good enough to handle
most files.

> Some cautions:
> 
> (1) "Warning -- Europeans here";1,234;5,678

So you see my point =)

> (2) Joe Blow~'The Vaults',456 Main 
> St,Snowtown,SA,5999~31/12/1999~01/04/2000
> # delimiter (tilde) occurs 3 times, no quotechar at all, data characters 
> comma and slash occur 4 times each (more than delimiter).

Yes, I've already decided that frequency by itself isn't a useful
measurement.  This particular example is invalid though, as 

~'The Vaults',456

is an error (IMHO).  'The Vaults' appears quoted but isn't followed by a
delimiter or a space.

> In any case, it appears to me that you can't pronounce on the result until 
> you've parsed a large chunk of the file with each plausible hypothesis, 
> especially if the hypothesis admits (quoted) newlines inside the data. Some 
> possible decision criteria are (1) percentage of syntax errors (2) standard 
> deviation of number of columns ...

Actually, the existing implementation is able to make a pronouncement
after sniffing only a small portion of the file.  I'm going to get it
into the sandbox today so others can take a look at it.  The only real
snag is the exact scenario I mentioned earlier (no quoted data with
varying numbers of fields per row).

BTW, I'm +1 on Skip's suggestion to make the utils a package (cvs.utils)
and will check it into CVS as such.  Anyone object?

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308