[Csv] What's our status?
John Machin
sjmachin at lexicon.net
Thu Feb 27 13:12:16 CET 2003
On 26 Feb 2003 14:10:48 -0800, Cliff Wells <LogiplexSoftware at earthlink.net>
wrote:
>
> However, for the following I am so far unable to come up with a way to
> determine the delimiter:
>
> all,work,and,no,play,makes,jack,a,dull,boy
> all,work,and,no,play,makes,jack,a,dull
> boy
> all,work,and,no,play,makes,jack,a
[snip]
>
> Anyone have a suggestion? All work and no play makes jack a dull boy.
[Warning: late at night, OTTOMH, may contain babblings]
Errrmmm, maybe I've missed the plot or lost the point or whatever, but a
good start would be assuming that only in pathological cases would the
delimiter or the quote be an alphanumeric character i.e. the file has been
produced by an ordinary user, not a red-team tester.
Try the most frequent two non-alphanumeric characters as the candidates for
the delimiter and the quotechar? If there's only 1 non-alphanumeric
character, then it's the delimiter.
If there aren't any non-AN chars [an example in one of your messages], then
there's only one field per record.
Where there are two or more candidates for the delimiter and quotechar, you
could use some plausibility heuristics e.g. " and ' are more likely to be
quotes than delimiters however tab, comma, semicolon, colon, vertical bar,
and tilde are plausible delimiters.
Some cautions:
(1) "Warning -- Europeans here";1,234;5,678
(2) Joe Blow~'The Vaults',456 Main
St,Snowtown,SA,5999~31/12/1999~01/04/2000
# delimiter (tilde) occurs 3 times, no quotechar at all, data characters
comma and slash occur 4 times each (more than delimiter).
In any case, it appears to me that you can't pronounce on the result until
you've parsed a large chunk of the file with each plausible hypothesis,
especially if the hypothesis admits (quoted) newlines inside the data. Some
possible decision criteria are (1) percentage of syntax errors (2) standard
deviation of number of columns ...
Hope this helps,
John
More information about the Csv
mailing list