[Csv] What's our status?
Cliff Wells
LogiplexSoftware at earthlink.net
Thu Feb 27 18:07:58 CET 2003
On Thu, 2003-02-27 at 04:12, John Machin wrote:
> On 26 Feb 2003 14:10:48 -0800, Cliff Wells <LogiplexSoftware at earthlink.net>
> wrote:
>
> >
> > However, for the following I am so far unable to come up with a way to
> > determine the delimiter:
> >
> > all,work,and,no,play,makes,jack,a,dull,boy
> > all,work,and,no,play,makes,jack,a,dull
> > boy
> > all,work,and,no,play,makes,jack,a
> [snip]
>
> >
> > Anyone have a suggestion? All work and no play makes jack a dull boy.
>
> [Warning: late at night, OTTOMH, may contain babblings]
I started babbling yesterday while working on this. Luckily the interns
came and gave me my injection. However, it's difficult to type with
these leather straps on and I can't quite reach the buckles with my
teeth...
> Errrmmm, maybe I've missed the plot or lost the point or whatever, but a
> good start would be assuming that only in pathological cases would the
> delimiter or the quote be an alphanumeric character i.e. the file has been
> produced by an ordinary user, not a red-team tester.
I'm willing to make that assumption for this case, but read on...
> Try the most frequent two non-alphanumeric characters as the candidates for
> the delimiter and the quotechar? If there's only 1 non-alphanumeric
> character, then it's the delimiter.
If we have a quotechar, then the problem is solved. Unfortunately the
situation I expect here is that there will be more than one
non-alphanumeric character per line. It's quite common to see
dates/timestamps in *every* row of a csv file:
data,2003/02/27,08:51:00
data,2003/02/27,08:52:00
data,2003/02/27,08:53:00
data,2003/02/27,08:54:00
In this case it is difficult to know whether ,/ or : is the delimiter.
It's not entirely unreasonable to use a "preferred" list of delimiters
but it's not entirely safe either ;) In fact, the current
implementation will resort to a preferred list in this example and
return , as the delimiter. However, given the following:
2003/02/27,08:51:00
data,2003/02/27,08:52:00
08:53:00
data,2003/02/27,08:54:00
It would most likely (without testing) return ":" as the delimiter as it
occurs equally consistently with "/", but is higher in the preferred
list. This is wrong as the delimiter is clearly ",". That being said,
I would simply consider this file as being unsniffable as it has no real
pattern.
> If there aren't any non-AN chars [an example in one of your messages], then
> there's only one field per record.
Hm. That might actually be useful.
> Where there are two or more candidates for the delimiter and quotechar, you
> could use some plausibility heuristics e.g. " and ' are more likely to be
> quotes than delimiters however tab, comma, semicolon, colon, vertical bar,
> and tilde are plausible delimiters.
As I mentioned earlier, quotes are already handled. If quotes are
present, I think the current implementation is good enough to handle
most files.
> Some cautions:
>
> (1) "Warning -- Europeans here";1,234;5,678
So you see my point =)
> (2) Joe Blow~'The Vaults',456 Main
> St,Snowtown,SA,5999~31/12/1999~01/04/2000
> # delimiter (tilde) occurs 3 times, no quotechar at all, data characters
> comma and slash occur 4 times each (more than delimiter).
Yes, I've already decided that frequency by itself isn't a useful
measurement. This particular example is invalid though, as
~'The Vaults',456
is an error (IMHO). 'The Vaults' appears quoted but isn't followed by a
delimiter or a space.
> In any case, it appears to me that you can't pronounce on the result until
> you've parsed a large chunk of the file with each plausible hypothesis,
> especially if the hypothesis admits (quoted) newlines inside the data. Some
> possible decision criteria are (1) percentage of syntax errors (2) standard
> deviation of number of columns ...
Actually, the existing implementation is able to make a pronouncement
after sniffing only a small portion of the file. I'm going to get it
into the sandbox today so others can take a look at it. The only real
snag is the exact scenario I mentioned earlier (no quoted data with
varying numbers of fields per row).
BTW, I'm +1 on Skip's suggestion to make the utils a package (cvs.utils)
and will check it into CVS as such. Anyone object?
--
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308 (800) 735-0555 x308
More information about the Csv
mailing list