From skip at pobox.com  Wed Dec 28 17:15:25 2005
From: skip at pobox.com (skip at pobox.com)
Date: Wed, 28 Dec 2005 10:15:25 -0600
Subject: [Csv] Sniffer empty delimiter
Message-ID: <17330.47645.32970.405332@montanaro.dyndns.org>


In this bug report:

    http://python.org/sf/1157169

Neil Schemenauer reports a problem with this code:

    >>> d = csv.Sniffer().sniff('abc', ['\t', ','])
    >>> csv.reader(['abc'], d) 
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    TypeError: bad argument type for built-in operation

In his Sniffer case it is clear that neither TAB nor comma are an explicit
delimiter.  It's also not clear what the delimiter is.  The generated
dialect has a resulting empty delimiter.  I can see three possible remedies:

    1. raise csv.Error from Sniffer.sniff

    2. return comma as the "standard" delimiter or because the sample
       appears to only have a single comma

    3. return TAB as it's first in the delimiters list.

I't sure there are other candidates ("b" because it separates "a" and "c"?)

Any thoughts about the best "remedy" to this problem?  It's clear that
letting the empty delimiter escape into the wild is a problem.

Skip


From sjmachin at lexicon.net  Wed Dec 28 23:24:15 2005
From: sjmachin at lexicon.net (John Machin)
Date: Thu, 29 Dec 2005 09:24:15 +1100
Subject: [Csv] Sniffer empty delimiter
In-Reply-To: <17330.47645.32970.405332@montanaro.dyndns.org>
References: <17330.47645.32970.405332@montanaro.dyndns.org>
Message-ID: <43B3108F.2010403@lexicon.net>

skip at pobox.com wrote:
> In this bug report:
> 
>     http://python.org/sf/1157169
> 
> Neil Schemenauer reports a problem with this code:
> 
>     >>> d = csv.Sniffer().sniff('abc', ['\t', ','])
>     >>> csv.reader(['abc'], d) 
>     Traceback (most recent call last):
>     File "<stdin>", line 1, in ?
>     TypeError: bad argument type for built-in operation
> 
> In his Sniffer case it is clear that neither TAB nor comma are an explicit
> delimiter.  It's also not clear what the delimiter is.  The generated
> dialect has a resulting empty delimiter.  I can see three possible remedies:
> 
>     1. raise csv.Error from Sniffer.sniff
> 
>     2. return comma as the "standard" delimiter or because the sample
>        appears to only have a single comma
> 
>     3. return TAB as it's first in the delimiters list.
> 
> I't sure there are other candidates ("b" because it separates "a" and "c"?)
> 
> Any thoughts about the best "remedy" to this problem?  It's clear that
> letting the empty delimiter escape into the wild is a problem.
> 
> Skip


Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import csv
>>> d = csv.Sniffer().sniff('a|b|c|d|e', ['\t', ','])
>>> d.delimiter
''
>>> d = csv.Sniffer().sniff('a|b|c|d|e')
>>> d.delimiter
'a'
>>>

Skip,

Some thoughts:

(1) IMHO it should *NEVER* return an alphabetic or numeric character as
the delimiter.

(2) If there is insufficient sample to determine the dialect's
attributes, then it shouldn't pluck them out of the air, with no
indication to the caller that there might be a problem. IOW I don't like 
the "remedies" of "return standard delimiter" and "return first 
delimiter". It should raise csv.Error; the discerning caller can then 
take appropriate action.

(3) Some documentation on how the 2nd arg is used would be a good idea,
as would be an explanation of the relationship with the undocumented
"preferred" attribute:

>>> csv.Sniffer().preferred
[',', '\t', ';', ' ', ':']
>>>

(4) Too late to change now, but having a class with no args to its
constructor and only one other method has a whiff of some other language :-)

(5) But the doco is not correct, there are 2 non-constructor methods:

>>> csv.Sniffer().has_header("x")
True

Cheers,
John




From skip at pobox.com  Thu Dec 29 01:07:50 2005
From: skip at pobox.com (skip at pobox.com)
Date: Wed, 28 Dec 2005 18:07:50 -0600
Subject: [Csv] Sniffer empty delimiter
In-Reply-To: <43B3108F.2010403@lexicon.net>
References: <17330.47645.32970.405332@montanaro.dyndns.org>
	<43B3108F.2010403@lexicon.net>
Message-ID: <17331.10454.131254.852426@montanaro.dyndns.org>


    Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import csv
    >>> d = csv.Sniffer().sniff('a|b|c|d|e', ['\t', ','])
    >>> d.delimiter
    ''
    >>> d = csv.Sniffer().sniff('a|b|c|d|e')
    >>> d.delimiter
    'a'

Both of these seem wrong to me at some level.  I tend to agree with you that
if the delimiter fails it should raise an exception, certainly if the
delimiters argument defines a set of characters from which the actual
delimiter must be chosen (does it?).  The second has to be considered a bug
doesn't it?

    John> (1) IMHO it should *NEVER* return an alphabetic or numeric
    John>     character as the delimiter.

Probably a good rule of thumb.

    John> (2) If there is insufficient sample to determine the dialect's
    John>     attributes, then it shouldn't pluck them out of the air, with
    John>     no indication to the caller that there might be a problem. IOW
    John>     I don't like the "remedies" of "return standard delimiter" and
    John>     "return first delimiter". It should raise csv.Error; the
    John>     discerning caller can then take appropriate action.

If I have a csv file that happens to only have one column and I'm using the
sniffer (presumably because I have an app that processes somewhat arbitrary
csv files) I'd hate for it to fail in that one case.  For that case maybe we
can define an optional default arg that is a single character.  Failing all
other tests, the default is returned.

    John> (3) Some documentation on how the 2nd arg is used would be a good
    John>     idea, as would be an explanation of the relationship with the
    John>     undocumented "preferred" attribute:

Agreed.  I seem to recall you're the author.  Got some text? <wink>

    >>> csv.Sniffer().preferred
    [',', '\t', ';', ' ', ':']

    John> (4) Too late to change now, but having a class with no args to its
    John>     constructor and only one other method has a whiff of some
    John>     other language :-)

It's not too late to add an optional preferred arg to the constructor.

    John> (5) But the doco is not correct, there are 2 non-constructor
    John>     methods:

Yeah, I already noticed and fixed that.  That was easy. ;-)

Skip


From fdrake at acm.org  Thu Dec 29 03:19:49 2005
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 28 Dec 2005 21:19:49 -0500
Subject: [Csv] Sniffer empty delimiter
In-Reply-To: <17331.10454.131254.852426@montanaro.dyndns.org>
References: <17330.47645.32970.405332@montanaro.dyndns.org>
	<43B3108F.2010403@lexicon.net>
	<17331.10454.131254.852426@montanaro.dyndns.org>
Message-ID: <200512282119.50120.fdrake@acm.org>

On Wednesday 28 December 2005 19:07, skip at pobox.com wrote:
 > arbitrary csv files) I'd hate for it to fail in that one case.  For that
 > case maybe we can define an optional default arg that is a single
 > character.  Failing all other tests, the default is returned.

The default shouldn't be type-checked (including string length), but should 
simply be returned if provided.  This allows the caller to determine the 
significance of getting back the passed-in value.  I guess you could think of 
it as similar to the third argument of getattr().  :-)


  -Fred

-- 
Fred L. Drake, Jr.   <fdrake at acm.org>


From fdrake at acm.org  Thu Dec 29 03:19:49 2005
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 28 Dec 2005 21:19:49 -0500
Subject: [Csv] Sniffer empty delimiter
In-Reply-To: <17331.10454.131254.852426@montanaro.dyndns.org>
References: <17330.47645.32970.405332@montanaro.dyndns.org>
	<43B3108F.2010403@lexicon.net>
	<17331.10454.131254.852426@montanaro.dyndns.org>
Message-ID: <200512282119.50120.fdrake@acm.org>

On Wednesday 28 December 2005 19:07, skip at pobox.com wrote:
 > arbitrary csv files) I'd hate for it to fail in that one case.  For that
 > case maybe we can define an optional default arg that is a single
 > character.  Failing all other tests, the default is returned.

The default shouldn't be type-checked (including string length), but should 
simply be returned if provided.  This allows the caller to determine the 
significance of getting back the passed-in value.  I guess you could think of 
it as similar to the third argument of getattr().  :-)


  -Fred

-- 
Fred L. Drake, Jr.   <fdrake at acm.org>


From skip at pobox.com  Thu Dec 29 05:59:14 2005
From: skip at pobox.com (skip at pobox.com)
Date: Wed, 28 Dec 2005 22:59:14 -0600
Subject: [Csv] Sniffer empty delimiter
In-Reply-To: <200512282119.50120.fdrake@acm.org>
References: <17330.47645.32970.405332@montanaro.dyndns.org>
	<43B3108F.2010403@lexicon.net>
	<17331.10454.131254.852426@montanaro.dyndns.org>
	<200512282119.50120.fdrake@acm.org>
Message-ID: <17331.27938.293758.849779@montanaro.dyndns.org>


    >> For that case maybe we can define an optional default arg that is a
    >> single character.  Failing all other tests, the default is returned.

    Fred> The default shouldn't be type-checked (including string length),
    Fred> but should simply be returned if provided.  This allows the caller
    Fred> to determine the significance of getting back the passed-in value.

Hmmm...  To preserve current (incorrect?) behavior I think the default
almost has to be "".  To be useful though, it has to be a single-character
string given the current limitations of the module.

Skip


From skip at pobox.com  Thu Dec 29 05:59:14 2005
From: skip at pobox.com (skip at pobox.com)
Date: Wed, 28 Dec 2005 22:59:14 -0600
Subject: [Csv] Sniffer empty delimiter
In-Reply-To: <200512282119.50120.fdrake@acm.org>
References: <17330.47645.32970.405332@montanaro.dyndns.org>
	<43B3108F.2010403@lexicon.net>
	<17331.10454.131254.852426@montanaro.dyndns.org>
	<200512282119.50120.fdrake@acm.org>
Message-ID: <17331.27938.293758.849779@montanaro.dyndns.org>


    >> For that case maybe we can define an optional default arg that is a
    >> single character.  Failing all other tests, the default is returned.

    Fred> The default shouldn't be type-checked (including string length),
    Fred> but should simply be returned if provided.  This allows the caller
    Fred> to determine the significance of getting back the passed-in value.

Hmmm...  To preserve current (incorrect?) behavior I think the default
almost has to be "".  To be useful though, it has to be a single-character
string given the current limitations of the module.

Skip


From fdrake at acm.org  Thu Dec 29 07:23:28 2005
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Thu, 29 Dec 2005 01:23:28 -0500
Subject: [Csv] Sniffer empty delimiter
In-Reply-To: <17331.27938.293758.849779@montanaro.dyndns.org>
References: <17330.47645.32970.405332@montanaro.dyndns.org>
	<200512282119.50120.fdrake@acm.org>
	<17331.27938.293758.849779@montanaro.dyndns.org>
Message-ID: <200512290123.29482.fdrake@acm.org>

On Wednesday 28 December 2005 23:59, skip at pobox.com wrote:
 > Hmmm...  To preserve current (incorrect?) behavior I think the default
 > almost has to be "".  To be useful though, it has to be a single-character
 > string given the current limitations of the module.

That's a reasonable requirement for a delimiter used for parsing, and I'm not 
suggesting that that not be a requirement for that.  But if it's a marker 
object so the caller can determine that no delimiter was determined, then 
it's still up to the caller to check for that and either not parse or deal 
with it some other way (ask the user, for instance).

I'm not sure it's a big deal, but that's my thought on the matter at any 
rate.  ;-)


  -Fred

-- 
Fred L. Drake, Jr.   <fdrake at acm.org>


From fdrake at acm.org  Thu Dec 29 07:23:28 2005
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Thu, 29 Dec 2005 01:23:28 -0500
Subject: [Csv] Sniffer empty delimiter
In-Reply-To: <17331.27938.293758.849779@montanaro.dyndns.org>
References: <17330.47645.32970.405332@montanaro.dyndns.org>
	<200512282119.50120.fdrake@acm.org>
	<17331.27938.293758.849779@montanaro.dyndns.org>
Message-ID: <200512290123.29482.fdrake@acm.org>

On Wednesday 28 December 2005 23:59, skip at pobox.com wrote:
 > Hmmm...  To preserve current (incorrect?) behavior I think the default
 > almost has to be "".  To be useful though, it has to be a single-character
 > string given the current limitations of the module.

That's a reasonable requirement for a delimiter used for parsing, and I'm not 
suggesting that that not be a requirement for that.  But if it's a marker 
object so the caller can determine that no delimiter was determined, then 
it's still up to the caller to check for that and either not parse or deal 
with it some other way (ask the user, for instance).

I'm not sure it's a big deal, but that's my thought on the matter at any 
rate.  ;-)


  -Fred

-- 
Fred L. Drake, Jr.   <fdrake at acm.org>


From sjmachin at lexicon.net  Thu Dec 29 08:25:28 2005
From: sjmachin at lexicon.net (John Machin)
Date: Thu, 29 Dec 2005 18:25:28 +1100
Subject: [Csv] Sniffer empty delimiter
In-Reply-To: <17331.10454.131254.852426@montanaro.dyndns.org>
References: <17330.47645.32970.405332@montanaro.dyndns.org>
	<43B3108F.2010403@lexicon.net>
	<17331.10454.131254.852426@montanaro.dyndns.org>
Message-ID: <43B38F68.9030603@lexicon.net>

skip at pobox.com wrote:
>     Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on win32
>     Type "help", "copyright", "credits" or "license" for more information.
>     >>> import csv
>     >>> d = csv.Sniffer().sniff('a|b|c|d|e', ['\t', ','])
>     >>> d.delimiter
>     ''
>     >>> d = csv.Sniffer().sniff('a|b|c|d|e')
>     >>> d.delimiter
>     'a'
> 
> Both of these seem wrong to me at some level.  I tend to agree with you that
> if the delimiter fails it should raise an exception, certainly if the
> delimiters argument defines a set of characters from which the actual
> delimiter must be chosen (does it?). 

I've got no idea what the delimiters argument is for. That's why I 
suggested it be documented. Contrary to your recollection, I am *not* 
the author of any part of the csv module.


> The second has to be considered a bug
> doesn't it?

Yes. I regard the notion of an alphanumeric character being a delimiter 
as utterly preposterous.


> 
>     John> (1) IMHO it should *NEVER* return an alphabetic or numeric
>     John>     character as the delimiter.
> 
> Probably a good rule of thumb.
> 
>     John> (2) If there is insufficient sample to determine the dialect's
>     John>     attributes, then it shouldn't pluck them out of the air, with
>     John>     no indication to the caller that there might be a problem. IOW
>     John>     I don't like the "remedies" of "return standard delimiter" and
>     John>     "return first delimiter". It should raise csv.Error; the
>     John>     discerning caller can then take appropriate action.
> 
> If I have a csv file that happens to only have one column and I'm using the
> sniffer (presumably because I have an app that processes somewhat arbitrary
> csv files) I'd hate for it to fail in that one case.  For that case maybe we
> can define an optional default arg that is a single character.  Failing all
> other tests, the default is returned.

Optional default arg *plus* an exception? Holy redundancy, Batman!

Caller can do this:

try:
     d = csv.Sniffer().sniff(sample)
except csv.Error:
     d = my_default_dialect

> 
>     John> (3) Some documentation on how the 2nd arg is used would be a good
>     John>     idea, as would be an explanation of the relationship with the
>     John>     undocumented "preferred" attribute:
> 
> Agreed.  I seem to recall you're the author.  Got some text? <wink>

Not so. In fact I'd not even used the sniffer before today.

> 
>     >>> csv.Sniffer().preferred
>     [',', '\t', ';', ' ', ':']
> 
>     John> (4) Too late to change now, but having a class with no args to its
>     John>     constructor and only one other method has a whiff of some
>     John>     other language :-)
> 
> It's not too late to add an optional preferred arg to the constructor.


Maybe it's even not too late get some feedback from the actual users and 
to spec out the sniffer a bit more rigorously and then ensure it meets 
that spec.

Cheers,
John


From skip at pobox.com  Thu Dec 29 14:32:47 2005
From: skip at pobox.com (skip at pobox.com)
Date: Thu, 29 Dec 2005 07:32:47 -0600
Subject: [Csv] Sniffer empty delimiter
In-Reply-To: <43B38F68.9030603@lexicon.net>
References: <17330.47645.32970.405332@montanaro.dyndns.org>
	<43B3108F.2010403@lexicon.net>
	<17331.10454.131254.852426@montanaro.dyndns.org>
	<43B38F68.9030603@lexicon.net>
Message-ID: <17331.58751.825698.196812@montanaro.dyndns.org>


    John> Contrary to your recollection, I am *not* the author of any part
    John> of the csv module.

Ah, sorry about that.  I'm not the sniffer's author (never used it in fact).

    John> Optional default arg *plus* an exception? Holy redundancy, Batman!

Well, yeah, it's an either/or sort of thing.  I was thinking out loud.

    John> Caller can do this:

    John> try:
    John>      d = csv.Sniffer().sniff(sample)
    John> except csv.Error:
    John>      d = my_default_dialect

Yeah, but today that code would be written

    d = csv.Sniffer().sniff(sample)
    try:
        rdr = csv.reader(f, d)
    except TypeError:
        blah blah blah

so there's a backwards compatibility problem since the exception is raised
by the reader class, not the sniffer.

    John> (3) Some documentation on how the 2nd arg is used would be a good
    John> idea, as would be an explanation of the relationship with the
    John> undocumented "preferred" attribute:
    >> 
    >> Agreed.  I seem to recall you're the author.  Got some text? <wink>

    John> Not so. In fact I'd not even used the sniffer before today.

Unfortunately, neither have I.

    John> Maybe it's even not too late get some feedback from the actual
    John> users and to spec out the sniffer a bit more rigorously and then
    John> ensure it meets that spec.

That sounds good as well.  If the API is going to change, might as well
change it in a useful, non-speculative direction.

Skip



From skip at pobox.com  Fri Dec 30 06:19:28 2005
From: skip at pobox.com (skip at pobox.com)
Date: Thu, 29 Dec 2005 23:19:28 -0600
Subject: [Csv] improvement(?) to Sniffer._guess_delimiter()
Message-ID: <17332.50016.909334.960552@montanaro.dyndns.org>

I just checked in a change to csv.py (svn revision 41849).  Previously, the
sniffer returned "a" as the delimiter for this sample

    a|b|c\r\nd|e|f\r\n

Now it correctly returns "|", but I don't know if my code is any better than
the original.  The description of what _guess_delimiter() does (which I
don't really understand) is in its doc string.  The key change is in the
"punt" section of the code at the end.  All other attempts to select a
delimiter have failed.  I just punt differently than the original code:

-        # finally, just return the first damn character in the list
-        delim = delims.keys()[0]
+        # nothing else indicates a preference, pick the character that
+        # dominates(?)
+        items = [(v,k) for (k,v) in delims.items()]
+        items.sort()
+        delim = items[-1][1]

Is my change an actual improvement or just serendipity?

Thx,

Skip