[Csv] What's our status?

Thu Feb 27 13:12:16 CET 2003

On 26 Feb 2003 14:10:48 -0800, Cliff Wells <LogiplexSoftware at earthlink.net> 
wrote:

>
> However, for the following I am so far unable to come up with a way to
> determine the delimiter:
>
> all,work,and,no,play,makes,jack,a,dull,boy
> all,work,and,no,play,makes,jack,a,dull
> boy
> all,work,and,no,play,makes,jack,a
[snip]

>
> Anyone have a suggestion?  All work and no play makes jack a dull boy.

[Warning: late at night, OTTOMH, may contain babblings]

Errrmmm, maybe I've missed the plot or lost the point or whatever, but a 
good start would be assuming that only in pathological cases would the 
delimiter or the quote be an alphanumeric character i.e. the file has been 
produced by an ordinary user, not a red-team tester.

Try the most frequent two non-alphanumeric characters as the candidates for 
the delimiter and the quotechar? If there's only 1 non-alphanumeric 
character, then it's the delimiter.
If there aren't any non-AN chars [an example in one of your messages], then 
there's only one field per record.

Where there are two or more candidates for the delimiter and quotechar, you 
could use some plausibility  heuristics e.g. " and ' are more likely to be 
quotes than delimiters however tab, comma, semicolon, colon, vertical bar, 
and tilde are plausible delimiters.

Some cautions:

(1) "Warning -- Europeans here";1,234;5,678

(2) Joe Blow~'The Vaults',456 Main 
St,Snowtown,SA,5999~31/12/1999~01/04/2000
# delimiter (tilde) occurs 3 times, no quotechar at all, data characters 
comma and slash occur 4 times each (more than delimiter).

In any case, it appears to me that you can't pronounce on the result until 
you've parsed a large chunk of the file with each plausible hypothesis, 
especially if the hypothesis admits (quoted) newlines inside the data. Some 
possible decision criteria are (1) percentage of syntax errors (2) standard 
deviation of number of columns ...

Hope this helps,
John