From skip at pobox.com  Sat Feb  1 00:18:13 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 31 Jan 2003 17:18:13 -0600
Subject: [Csv] RE: [Python-Dev] PEP 305 - CSV File API
In-Reply-To: <000101c2c97d$19cf82c0$8901a8c0@ERICDESKTOP>
References: <15930.61900.995242.11815@montanaro.dyndns.org>
        <000101c2c97d$19cf82c0$8901a8c0@ERICDESKTOP>
Message-ID: <15931.1077.597442.713603@montanaro.dyndns.org>

    eric> Travis Oliphant made a nice package for reading and writing
    eric> numeric arrays in scipy called scipy.io....  I wanted everyone
    eric> aware of the available alternative solutions so we can minimize
    eric> duplicated effort.

Eric,

Thanks for the heads up.  Travis, why don't you subscribe to the
csv at mail.mojam.com mailing list and join the fun?  We're already considering
how the csv module will interface with DB-API-based modules, and of course,
Excel is central to our thoughts.  It would be good to have the perspective
of someone used to slinging scientific data around.

The csv mailing list page is at

    http://manatee.mojam.com/mailman/listinfo/csv

Skip

From djc at object-craft.com.au  Sat Feb  1 06:20:23 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 01 Feb 2003 16:20:23 +1100
Subject: [Csv] csv.QUOTE_NEVER?
In-Reply-To: <15930.60672.18719.407166@montanaro.dyndns.org>
References: <15930.60672.18719.407166@montanaro.dyndns.org>
Message-ID: <m38yx0shk8.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Skip> The three quoting constants are currently defined as
Skip> QUOTE_MINIMAL, QUOTE_ALL and QUOTE_NONNUMERIC.  Didn't we decide
Skip> there would be a QUOTE_NEVER constant as well?

I was going to define QUOTE_NEVER then realised that all you have to
do is set quotechar to None.  Why add the effort of implementing two
ways to achieve the same thing.

- Dave

-- 
http://www.object-craft.com.au

From djc at object-craft.com.au  Sat Feb  1 06:26:11 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 01 Feb 2003 16:26:11 +1100
Subject: [Csv] Access Products sample
In-Reply-To: <1044037040.15753.190.camel@software1.logiplex.internal>
References: <KJEOLDOPMIDKCMJDCNDPGEDDCNAA.altis@semi-retired.com>
	<1043957410.16012.122.camel@software1.logiplex.internal>
	<15929.37687.44696.305338@montanaro.dyndns.org>
	<m3adhif596.fsf@ferret.object-craft.com.au>
	<1044037040.15753.190.camel@software1.logiplex.internal>
Message-ID: <m34r7oshak.fsf@ferret.object-craft.com.au>

>>>>> "Cliff" == Cliff Wells <LogiplexSoftware at earthlink.net> writes:

Cliff> On Thu, 2003-01-30 at 18:00, Dave Cole wrote:
>> >>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:
>> 
>> >>> The currency column in the table is actually written out with
>> >>> formatting ($5.66 instead of just 5.66). Note that when Excel
>> >>> exports this column it has a trailing space for some reason >>>
>> (,$5.66 ,).
>> 
Cliff> So we've actually found an application that puts an extraneous
Cliff> space around the data, and it's our primary target.  Figures.
>>
Skip> So we just discovered we need an "access" dialect. ;-)
>>  Not really.  Python has no concept of currency types (last time I
>> looked).  The '$5.66 ' thing is an artifact of converting currency
>> to string, not float to string.

Cliff> I'm not sure what you mean.  A trailing space is a trailing
Cliff> space, regardless of data type.  In this case, it isn't too
Cliff> important as the data isn't quoted (we can just consider the
Cliff> space part of the data), but it shows that extraneous spaces
Cliff> might not be outside the scope of our problem.

In my typically clumsy way I was trying to say that Excel has more
type information available to it regarding the data being exported.
The fact that the data has been formatted as currency tells Excel that
it is not just a float, it is a money.  Python does not have a money
type.

It seems that Excel then exports the money in a way which allows it to
restore the formatting/type on import.  Mind you I have not tried
export/import on Excel, I am just guessing that the type is restored
on import.

- Dave

-- 
http://www.object-craft.com.au

From skip at pobox.com  Sat Feb  1 16:05:27 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 1 Feb 2003 09:05:27 -0600
Subject: [Csv] csv.QUOTE_NEVER?
In-Reply-To: <m38yx0shk8.fsf@ferret.object-craft.com.au>
References: <15930.60672.18719.407166@montanaro.dyndns.org>
        <m38yx0shk8.fsf@ferret.object-craft.com.au>
Message-ID: <15931.57911.857151.359281@montanaro.dyndns.org>

    Dave> I was going to define QUOTE_NEVER then realised that all you have
    Dave> to do is set quotechar to None.  Why add the effort of
    Dave> implementing two ways to achieve the same thing.

I think there's a certain uniformity in having the full spectrum of quote
behaviors defined (from QUOTE_ALL ... QUOTE_NEVER).  I skimmed the _csv.c
source quickly just now but didn't see self->quoting used anywhere.

Skip

From skip at pobox.com  Sat Feb  1 16:12:00 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 1 Feb 2003 09:12:00 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15931.58304.338512.44007@montanaro.dyndns.org>

Fool that I am, when I announced PEP 305 I didn't set my Reply-To header to
this list.  I'm forwarding a few responses that have turned up on c.l.py.

Skip

-------------- next part --------------
An embedded message was scrubbed...
From: Andrew Dalke <adalke at mindspring.com>
Subject: Re: PEP 305 - CSV File API
Date: Fri, 31 Jan 2003 17:17:48 -0700
Size: 7894
Url: http://mail.python.org/pipermail/csv/attachments/20030201/6af945ac/attachment.mht 

From skip at pobox.com  Sat Feb  1 16:12:09 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 1 Feb 2003 09:12:09 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15931.58313.445471.334543@montanaro.dyndns.org>

An embedded message was scrubbed...
From: Ian Bicking <ianb at colorstudy.com>
Subject: Re: PEP 305 - CSV File API
Date: 31 Jan 2003 20:03:10 -0600
Size: 5991
Url: http://mail.python.org/pipermail/csv/attachments/20030201/a519514f/attachment.mht 

From skip at pobox.com  Sat Feb  1 16:14:01 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 1 Feb 2003 09:14:01 -0600
Subject: [Csv] RE: [Python-Dev] PEP 305 - CSV File API (fwd)
Message-ID: <15931.58425.521404.154286@montanaro.dyndns.org>

Passing this along as well.  Travis Oliphant from the SciPy bunch joined the
group.  He's the author of scipy.io which includes facilities to read and
write data in various formats.  I haven't looked at the package.  I'll let
Travis summarize its relevant capabilities.

Skip

-------------- next part --------------
An embedded message was scrubbed...
From: Travis Oliphant <oliphant.travis at ieee.org>
Subject: RE: [Python-Dev] PEP 305 - CSV File API
Date: 31 Jan 2003 19:55:33 -0700
Size: 5100
Url: http://mail.python.org/pipermail/csv/attachments/20030201/924db701/attachment.mht 

From skip at pobox.com  Sat Feb  1 16:14:11 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 1 Feb 2003 09:14:11 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15931.58435.789383.690793@montanaro.dyndns.org>

An embedded message was scrubbed...
From: Max M <maxm at mxm.dk>
Subject: Re: PEP 305 - CSV File API
Date: Sat, 01 Feb 2003 13:43:01 +0100
Size: 4518
Url: http://mail.python.org/pipermail/csv/attachments/20030201/0592e1ed/attachment.mht 

From skip at pobox.com  Sat Feb  1 16:14:18 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 1 Feb 2003 09:14:18 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15931.58442.615341.935260@montanaro.dyndns.org>

An embedded message was scrubbed...
From: Roman Suzi <rnd at onego.ru>
Subject: Re: PEP 305 - CSV File API
Date: Sat, 1 Feb 2003 16:50:12 +0300 (MSK)
Size: 4160
Url: http://mail.python.org/pipermail/csv/attachments/20030201/397dff5b/attachment.mht 

From djc at object-craft.com.au  Sun Feb  2 10:42:59 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 02 Feb 2003 20:42:59 +1100
Subject: [Csv] csv.QUOTE_NEVER?
In-Reply-To: <15931.57911.857151.359281@montanaro.dyndns.org>
References: <15930.60672.18719.407166@montanaro.dyndns.org>
	<m38yx0shk8.fsf@ferret.object-craft.com.au>
	<15931.57911.857151.359281@montanaro.dyndns.org>
Message-ID: <m3r8arkogs.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Dave> I was going to define QUOTE_NEVER then realised that all you
Dave> have to do is set quotechar to None.  Why add the effort of
Dave> implementing two ways to achieve the same thing.

Skip> I think there's a certain uniformity in having the full spectrum
Skip> of quote behaviors defined (from QUOTE_ALL ... QUOTE_NEVER).  I
Skip> skimmed the _csv.c source quickly just now but didn't see
Skip> self->quoting used anywhere.

Not implemented yet.  The options on quoting are extensions to the
current module behaviour.

- Dave

-- 
http://www.object-craft.com.au

From djc at object-craft.com.au  Sun Feb  2 12:04:42 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 02 Feb 2003 22:04:42 +1100
Subject: [Csv] Added code to implement quoting styles
Message-ID: <m3isw3kkol.fsf@ferret.object-craft.com.au>

>>> import _csv
>>> 
>>> p = _csv.parser(escapechar='\\')
>>> l = ('a',2,'hello, there')
>>> 
>>> for i in range(4):
...     p.quoting = i
...     print p.join(l)
... 
a,2,"hello, there"
"a","2","hello, there"
"a",2,"hello, there"
a,2,hello\, there
>>> p.escapechar = None
>>> print p.join(l)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
_csv.Error: delimter must be quoted or escaped

Ooops - just noticed the spelling error - I will fix that.

- Dave

-- 
http://www.object-craft.com.au

From djc at object-craft.com.au  Sun Feb  2 12:15:45 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 02 Feb 2003 22:15:45 +1100
Subject: [Csv] Implemented skipinitialspace
Message-ID: <m365s3kk66.fsf@ferret.object-craft.com.au>

Well that was easy, just one extra test.

>>> import _csv
>>> p = _csv.parser()
>>> s = '"quoted", "not quoted, but this ""field"" has delimiters and quotes"'
>>> p.skipinitialspace = 0
>>> p.parse(s)
['quoted', ' "not quoted', ' but this ""field"" has delimiters and quotes"']
>>> p.skipinitialspace = 1
>>> p.parse(s)
['quoted', 'not quoted, but this "field" has delimiters and quotes']

- Dave

-- 
http://www.object-craft.com.au

From djc at object-craft.com.au  Sun Feb  2 13:00:08 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 02 Feb 2003 23:00:08 +1100
Subject: [Csv] Implemented lineterminator
Message-ID: <m31y2qlwon.fsf@ferret.object-craft.com.au>

The _csv.parser.join() now appends the lineterminator to the resulting
record.

>>> import _csv
>>> p = _csv.parser()
>>> p.join([1,2,3])
'1,2,3\r\n'
>>> p.lineterminator = '\n'
>>> p.join([1,2,3])
'1,2,3\n'

I have not put any code into the parser to detect and report/fix
fields which contain newlines which do not match the lineterminator.

What should be happening there?

- Dave

-- 
http://www.object-craft.com.au

From djc at object-craft.com.au  Sun Feb  2 13:26:11 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 02 Feb 2003 23:26:11 +1100
Subject: [Csv] Made some small changes to the PEP
Message-ID: <m3wukikgws.fsf@ferret.object-craft.com.au>

Here is the commit message:

Changed the csv.reader() fileobj argument to interable.  This give us
much more flexibility in processing filtered data.
Made the example excel dialect match the dialect in csv.py.
Added explanation of doublequote.
Added explanation of csv.QUOTE_NONE.

- Dave

-- 
http://www.object-craft.com.au

From skip at pobox.com  Sun Feb  2 15:22:58 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 08:22:58 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.10690.465556.671158@montanaro.dyndns.org>

Passing along so we get it in the list archive...

Skip

-------------- next part --------------
An embedded message was scrubbed...
From: Tyler Eaves <tyler at cg1.org>
Subject: Re: PEP 305 - CSV File API
Date: Sun, 02 Feb 2003 03:14:17 GMT
Size: 7256
Url: http://mail.python.org/pipermail/csv/attachments/20030202/cf2350b1/attachment.mht 

From skip at pobox.com  Sun Feb  2 15:25:37 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 08:25:37 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.10849.138459.812300@montanaro.dyndns.org>

Another for the archive.

-------------- next part --------------
An embedded message was scrubbed...
From: Jack Diederich <jack at performancedrivers.com>
Subject: Re: PEP 305 - CSV File API
Date: Sat, 1 Feb 2003 22:43:37 -0500
Size: 4895
Url: http://mail.python.org/pipermail/csv/attachments/20030202/5fecd21c/attachment.mht 

From skip at pobox.com  Sun Feb  2 18:35:02 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 11:35:02 -0600
Subject: [Csv] Re: [Python-Dev] PEP 305 - CSV File API
Message-ID: <15933.22214.952419.308149@montanaro.dyndns.org>

I don't see this one in the archives.  I think Travis sent it to me but
meant to send to the entire list.

Skip

-------------- next part --------------
An embedded message was scrubbed...
From: Travis Oliphant <oliphant.travis at ieee.org>
Subject: Re: [Csv] RE: [Python-Dev] PEP 305 - CSV File API (fwd)
Date: 01 Feb 2003 22:27:35 -0700
Size: 7775
Url: http://mail.python.org/pipermail/csv/attachments/20030202/1d774a74/attachment.mht 

From skip at pobox.com  Sun Feb  2 19:31:07 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 12:31:07 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.25579.554586.898615@montanaro.dyndns.org>

Pushing over to archive

-------------- next part --------------
An embedded message was scrubbed...
From: Jarek Zgoda <jzgoda at usun.gazeta.pl>
Subject: Re: PEP 305 - CSV File API
Date: Sun, 2 Feb 2003 07:42:32 +0000 (UTC)
Size: 5335
Url: http://mail.python.org/pipermail/csv/attachments/20030202/36a3a238/attachment.mht 

From skip at pobox.com  Sun Feb  2 19:32:18 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 12:32:18 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.25650.724660.838965@montanaro.dyndns.org>

for the archives

-------------- next part --------------
An embedded message was scrubbed...
From: Dave Cole <djc at object-craft.com.au>
Subject: Re: PEP 305 - CSV File API
Date: 02 Feb 2003 20:04:20 +1100
Size: 6567
Url: http://mail.python.org/pipermail/csv/attachments/20030202/9e36fe4f/attachment.mht 

From skip at pobox.com  Mon Feb  3 00:20:10 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 17:20:10 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.42922.225695.699946@montanaro.dyndns.org>

archive...
-------------- next part --------------
An embedded message was scrubbed...
From: Dave Cole <djc at object-craft.com.au>
Subject: Re: PEP 305 - CSV File API
Date: 02 Feb 2003 20:40:34 +1100
Size: 6638
Url: http://mail.python.org/pipermail/csv/attachments/20030202/5c6d20a7/attachment.mht 

From skip at pobox.com  Mon Feb  3 00:22:46 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 17:22:46 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.43078.350784.454286@montanaro.dyndns.org>

archive...
-------------- next part --------------
An embedded message was scrubbed...
From: Ian Bicking <ianb at colorstudy.com>
Subject: Re: PEP 305 - CSV File API
Date: 02 Feb 2003 04:01:42 -0600
Size: 7540
Url: http://mail.python.org/pipermail/csv/attachments/20030202/f7d53ca3/attachment.mht 

From skip at pobox.com  Mon Feb  3 00:28:04 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 17:28:04 -0600
Subject: [Csv] Added code to implement quoting styles
In-Reply-To: <m3isw3kkol.fsf@ferret.object-craft.com.au>
References: <m3isw3kkol.fsf@ferret.object-craft.com.au>
Message-ID: <15933.43396.469377.935888@montanaro.dyndns.org>

Looks good.  To avoid thinking of us having two ways to specify don't quote,
I was thinking of quotechar as what to quote with if quoting isn't
QUOTE_NEVER.

Skip

From skip at pobox.com  Mon Feb  3 00:17:58 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 17:17:58 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.42790.923347.973573@montanaro.dyndns.org>

More for the archives...

-------------- next part --------------
An embedded message was scrubbed...
From: Dave Cole <djc at object-craft.com.au>
Subject: Re: PEP 305 - CSV File API
Date: 02 Feb 2003 20:25:38 +1100
Size: 7760
Url: http://mail.python.org/pipermail/csv/attachments/20030202/90188c61/attachment.mht 

From skip at pobox.com  Mon Feb  3 03:13:09 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 20:13:09 -0600
Subject: [Csv] weird default dialects
Message-ID: <15933.53301.891154.795964@montanaro.dyndns.org>

I know the behavior is reasonable, but this code

    class Dialect:
        delimiter = ','
        quotechar = '"'
        escapechar = None
        doublequote = True
        skipinitialspace = False
        lineterminator = '\r\n'
        quoting = QUOTE_MINIMAL

    class excel(Dialect):
        pass

looks really weird to me.  I'd prefer it if the Dialect class simply defined
the various parameters, but gave them invalid values like None or
NotImplemented and then have the excel class fill it its values:

    class Dialect:
        delimiter = None
        quotechar = None
        escapechar = None
        doublequote = None
        skipinitialspace = None
        lineterminator = None
        quoting = None

    class excel(Dialect):
        delimiter = ','
        quotechar = '"'
        escapechar = None
        doublequote = True
        skipinitialspace = False
        lineterminator = '\r\n'
        quoting = QUOTE_MINIMAL

I know that's a bit more verbose, but people probably shouldn't be able to
use Dialect directly, and if they subclass incompletely from Dialect, I
think they should get exceptions.  If what they want is "just like Excel
except ...", they shouldn't be able to get away with subclassing Dialect.
They should have to subclass excel.

I suggested NotImplemented as a possible default value because None *is* a
valid value for at least one of the parameters.

Make sense?

Skip

From andrewm at object-craft.com.au  Mon Feb  3 03:46:45 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 03 Feb 2003 13:46:45 +1100
Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv
	csv.py,1.4,1.5 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15930.26577.952898.246807@montanaro.dyndns.org> 
References: <E18eSSt-0003fu-00@sc8-pr-cvs1.sourceforge.net>
	<15930.26577.952898.246807@montanaro.dyndns.org> 
Message-ID: <20030203024645.70B6E3C1F4@coffee.object-craft.com.au>

>    andrew> Rename dialects from excel2000 to excel. Rename Error to be
>    andrew> CSVError.  Explicity fetch iterator in reader class, rather than
>    andrew> simply calling next() (which only works for self-iterators).
>
>Minor nit.  I think Error was fine.  That's the standard for most extension
>modules.  I would normally import csv then reference its objects through it.
>csv.CSVError looks redundant to me.  I'm not a "from csv import CSVError"
>kind of guy however, so I can understand the desire to make the name more
>explicit when considered alone.

I'm inclined to agree, although "Error" tends to be a bit of a
show-stopper for people who want to do "from csv import ..."

Anyone object to me changing it back to "Error"?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Mon Feb  3 04:14:10 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 21:14:10 -0600
Subject: [Csv] Various changes
Message-ID: <15933.56962.107343.919881@montanaro.dyndns.org>

Folks, 

I made a number of changes this evening.

    * renamed set_dialect() to register_dialect()

    * defined the public API using csv.__all__

    * hid "dialects" and "OCcsv" with leading underscores so it's clear
      (even without __all__) that they are not part of the public API

    * added a first stab at a section for the library reference manual

    * added a couple conditional macro def'ns to _csv.c so it would compile
      using Python 2.2.2

    * added a few test cases for dialects and writing array.array objects

You might want to "csv up". ;-)

Skip

From skip at pobox.com  Mon Feb  3 04:23:30 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 21:23:30 -0600
Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv _csv.c,1.8,1.9
In-Reply-To: <E18fIit-0003ET-00@sc8-pr-cvs1.sourceforge.net>
References: <E18fIit-0003ET-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <15933.57522.334574.306027@montanaro.dyndns.org>

    dave> Modified Files:
    dave>       _csv.c 
    dave> Log Message:
    dave> Fixed refcount bug in constructor regarding lineterminator string.
    dave> Implemented lineterminator functionality - appends lineterminator
    dave> to end of joined record.  Not sure what to do with \n which do not
    dave> match the lineterminator string...

I'm not sure what you mean with that last sentence.  Are you worried about
distinguishing the line terminator from a hard return?

Skip

From skip at pobox.com  Mon Feb  3 04:24:28 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 21:24:28 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.57580.49194.72218@montanaro.dyndns.org>

for the archive...
-------------- next part --------------
An embedded message was scrubbed...
From: Roman Suzi <rnd at onego.ru>
Subject: Re: PEP 305 - CSV File API
Date: Sun, 2 Feb 2003 13:54:57 +0300 (MSK)
Size: 5250
Url: http://mail.python.org/pipermail/csv/attachments/20030202/a449ff4f/attachment.mht 

From skip at pobox.com  Mon Feb  3 04:28:52 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 21:28:52 -0600
Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv
	_csv.c,1.9,1.10
In-Reply-To: <E18fJ2y-0004VH-00@sc8-pr-cvs1.sourceforge.net>
References: <E18fJ2y-0004VH-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <15933.57844.794060.305738@montanaro.dyndns.org>

    dave> Oops - forgot to check for '+-.' when quoting is QUOTE_NONNUMERIC.

Looking at the code, I wonder if when quoting is set to NONNUMERIC a single
attempt to call PyFloat_FromString(field) should be made and the result used
to identify the field as numeric or not.  (Not for performance, but for
accuracy of the setting.)

Skip

From skip at pobox.com  Mon Feb  3 04:29:35 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 21:29:35 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.57887.991404.553688@montanaro.dyndns.org>

for the archive.
-------------- next part --------------
An embedded message was scrubbed...
From: Dave Cole <djc at object-craft.com.au>
Subject: Re: PEP 305 - CSV File API
Date: 02 Feb 2003 23:46:26 +1100
Size: 7499
Url: http://mail.python.org/pipermail/csv/attachments/20030202/cf5c4281/attachment.mht 

From skip at pobox.com  Mon Feb  3 04:32:39 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 21:32:39 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.58071.69450.29922@montanaro.dyndns.org>

archive...
-------------- next part --------------
An embedded message was scrubbed...
From: Skip Montanaro <skip at pobox.com>
Subject: Re: PEP 305 - CSV File API
Date: Sat, 1 Feb 2003 20:41:59 -0600
Size: 5208
Url: http://mail.python.org/pipermail/csv/attachments/20030202/ca6223ad/attachment.mht 

From skip at pobox.com  Mon Feb  3 04:35:25 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 21:35:25 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.58237.872242.701037@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: Andrew Dalke <adalke at mindspring.com>
Subject: Re: PEP 305 - CSV File API
Date: Sun, 02 Feb 2003 11:51:47 -0700
Size: 10225
Url: http://mail.python.org/pipermail/csv/attachments/20030202/450bddf7/attachment.mht 

From skip at pobox.com  Mon Feb  3 04:38:24 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 21:38:24 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.58416.721311.156005@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: Ian Bicking <ianb at colorstudy.com>
Subject: Re: PEP 305 - CSV File API
Date: 02 Feb 2003 15:01:04 -0600
Size: 6300
Url: http://mail.python.org/pipermail/csv/attachments/20030202/19030e85/attachment.mht 

From skip at pobox.com  Mon Feb  3 04:39:28 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 21:39:28 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.58480.217758.128918@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: Alex Martelli <aleax at aleax.it>
Subject: Re: PEP 305 - CSV File API
Date: Sun, 02 Feb 2003 22:41:02 GMT
Size: 4159
Url: http://mail.python.org/pipermail/csv/attachments/20030202/f17c3392/attachment.mht 

From skip at pobox.com  Mon Feb  3 04:40:01 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 21:40:01 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.58513.404485.292538@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: Dennis Lee Bieber <wlfraed at ix.netcom.com>
Subject: Re: PEP 305 - CSV File API
Date: Sun, 02 Feb 2003 13:55:21 -0800
Size: 6560
Url: http://mail.python.org/pipermail/csv/attachments/20030202/4e9b6fe0/attachment.mht 

From skip at pobox.com  Mon Feb  3 04:40:23 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 21:40:23 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.58535.543168.298725@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: Ian Bicking <ianb at colorstudy.com>
Subject: Re: PEP 305 - CSV File API
Date: 02 Feb 2003 17:07:44 -0600
Size: 5146
Url: http://mail.python.org/pipermail/csv/attachments/20030202/0273ee91/attachment.mht 

From skip at pobox.com  Mon Feb  3 04:42:01 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 21:42:01 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15933.58633.201457.636414@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: Carlos Ribeiro <cribeiro at mail.inet.com.br>
Subject: Re: PEP 305 - CSV File API
Date: Mon, 3 Feb 2003 00:22:52 +0000
Size: 5956
Url: http://mail.python.org/pipermail/csv/attachments/20030202/7a67e779/attachment.mht 

From andrewm at object-craft.com.au  Mon Feb  3 04:51:02 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 03 Feb 2003 14:51:02 +1100
Subject: [Csv] csv.QUOTE_NEVER? 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
	<m38yx0shk8.fsf@ferret.object-craft.com.au> 
References: <15930.60672.18719.407166@montanaro.dyndns.org>
	<m38yx0shk8.fsf@ferret.object-craft.com.au> 
Message-ID: <20030203035102.34B183C1F4@coffee.object-craft.com.au>

>Skip> The three quoting constants are currently defined as
>Skip> QUOTE_MINIMAL, QUOTE_ALL and QUOTE_NONNUMERIC.  Didn't we decide
>Skip> there would be a QUOTE_NEVER constant as well?
>
>I was going to define QUOTE_NEVER then realised that all you have to
>do is set quotechar to None.  Why add the effort of implementing two
>ways to achieve the same thing.

"quotechar" as None probably should be illegal in the new module, and the 
"quoting" parameter used exclusively. This would be consistent with the
direction we've taken with other parameters.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Mon Feb  3 05:15:01 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 03 Feb 2003 15:15:01 +1100
Subject: [Csv] The writer class
Message-ID: <20030203041501.3F6FC3C1F4@coffee.object-craft.com.au>

We document this as a wrapper around a file-like object - I'd assumed
it should be providing a file-like interface itself (in particular,
I have it a close() method, and was attempting to close the file when
the destructor was called), but I now think this is wrong. I propose to
remove the following code from the writer class:

    def close(self):
        self.fileobj.close()
        del self.fileobj

    def __del__(self):
        if hasattr(self, 'fileobj'):
            try:
                self.close()
            except:
                pass

Comments?

I also noticed some negative comments regarding the choice of the name
"write" for the method that writes fields. The comments essentially said
that this method name is used by other classes where strings are being
writen. I agree - we probably should call it something like "writefields"
or "write_fields". Comments?

What should we call the "writelines" method (that accepts an iterable
and writes multiple "lines") in this case?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Mon Feb  3 06:03:29 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 2 Feb 2003 23:03:29 -0600
Subject: [Csv] The writer class
In-Reply-To: <20030203041501.3F6FC3C1F4@coffee.object-craft.com.au>
References: <20030203041501.3F6FC3C1F4@coffee.object-craft.com.au>
Message-ID: <15933.63521.813128.412444@montanaro.dyndns.org>

    Andrew> .... I propose to remove the following code from the writer
    Andrew> class:
    ...
    Andrew> Comments?

Agreed.  This bothered me as well.

    Andrew> .... we probably should call it something like "writefields" or
    Andrew> "write_fields". Comments?

Someone on c.l.py suggested writerow(s).  I sort of liked that.  As you
noted about write(), both it and append() both carry enough baggage from
other usage.

    Andrew> What should we call the "writelines" method (that accepts an
    Andrew> iterable and writes multiple "lines") in this case?

How about "writerow" for the singular and "writerows" for the plural?

Skip

From andrewm at object-craft.com.au  Mon Feb  3 06:35:16 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 03 Feb 2003 16:35:16 +1100
Subject: [Csv] The writer class 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15933.63521.813128.412444@montanaro.dyndns.org> 
References: <20030203041501.3F6FC3C1F4@coffee.object-craft.com.au>
	<15933.63521.813128.412444@montanaro.dyndns.org> 
Message-ID: <20030203053516.2BD573C1F4@coffee.object-craft.com.au>

>    Andrew> .... I propose to remove the following code from the writer
>    Andrew> class:
>    ...
>    Andrew> Comments?
>
>Agreed.  This bothered me as well.

Done (damn, forgot to mention that in the check-in comment).

>    Andrew> .... we probably should call it something like "writefields" or
>    Andrew> "write_fields". Comments?
>
>Someone on c.l.py suggested writerow(s).  I sort of liked that.  As you
>noted about write(), both it and append() both carry enough baggage from
>other usage.

I like that. Done.

>    Andrew> What should we call the "writelines" method (that accepts an
>    Andrew> iterable and writes multiple "lines") in this case?
>
>How about "writerow" for the singular and "writerows" for the plural?

Yep. Done.

I've also changed CSVError back to just Error for the sake of consistency,
if nothing else.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Mon Feb  3 13:46:51 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 3 Feb 2003 06:46:51 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15934.25787.133963.848679@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: sjmachin at lexicon.net (John Machin)
Subject: Re: PEP 305 - CSV File API
Date: 3 Feb 2003 02:15:17 -0800
Size: 4476
Url: http://mail.python.org/pipermail/csv/attachments/20030203/338781c8/attachment.mht 

From skip at pobox.com  Mon Feb  3 16:31:40 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 3 Feb 2003 09:31:40 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15934.35676.989162.259027@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: "John Roth" <johnroth at ameritech.net>
Subject: Re: PEP 305 - CSV File API
Date: Mon, 3 Feb 2003 09:28:16 -0500
Size: 5305
Url: http://mail.python.org/pipermail/csv/attachments/20030203/f701eda4/attachment.mht 

From skip at pobox.com  Mon Feb  3 16:38:04 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 3 Feb 2003 09:38:04 -0600
Subject: [Csv] Re: PEP 305 - CSV File API
In-Reply-To: <v3sv11esgjpje2@news.supernews.com>
References: <mailman.1044050476.7110.python-list@python.org>
        <9ng0h-nh3.ln1@beastie.ix.netcom.com>
        <mailman.1044231844.17120.python-list@python.org>
        <c76ff6fc.0302030215.82eb073@posting.google.com>
        <v3sv11esgjpje2@news.supernews.com>
Message-ID: <15934.36060.434826.450769@montanaro.dyndns.org>

I think I have the attributions right.

    Carlos> that happen to be problematic, and that are locale-related:
    Carlos> - reading dates from a CSV file

    JohnM> Certainly dates are a problem ... however, in what way is reading
    JohnM> dates from a CSV-format file any different to reading them from
    JohnM> any other format?

    JohnR> It's not particularly different. What is needed is the ability to
    JohnR> associate the necessary parameters with a date column to do the
    JohnR> application dependent "correct" transformation, based on the
    JohnR> available date libraries.

I will note that the csv module under development makes *no* attempts at any
kind of data conversion when reading CSV files.  Even ints and floats are
returned as strings.  It's left up to the application programmer to perform
type conversions.  On output, the situation is similar.  For some passing
compatibility with the DB-API (which represents SQL NULL values as None),
None is currently being written as the empty string (though this is perhaps
still subject to change).  Other than that, str() is simply called for all
data being written to the file.

Skip

From skip at pobox.com  Mon Feb  3 17:38:18 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 3 Feb 2003 10:38:18 -0600
Subject: [Csv] passing dialects directly - class or instance?
Message-ID: <15934.39674.453212.506375@montanaro.dyndns.org>

I thought users were supposed to pass dialect classes when not using
strings.  I see, however, that _OCcsv.__init__ calls isinstance() instead of
issubclass().  Which is it supposed to be?

Skip

From skip at pobox.com  Mon Feb  3 17:57:04 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 3 Feb 2003 10:57:04 -0600
Subject: [Csv] test coverage
Message-ID: <15934.40800.760370.712420@montanaro.dyndns.org>

Attached is the output of running a gcov-instrumented version of Python and
the _csv module against the current test suite.

FYI.

Skip

-------------- next part --------------
                /* TODO:
                   + Add reader() and writer() functions which return CSV
                     reader/writer objects which implement the PEP interface:

                     csvreader = csv.reader(file("blah.csv", "rb"), kwargs)
                     for row in csvreader:
                         process(row)

                     csvwriter = csv.writer(file("some.csv", "wb"), kwargs)
                     for row in someiter:
                         csvwriter.write(row)

                   + Add CsvWriter.writelines(someiter)
                */

                #include "Python.h"
                #include "structmember.h"

                /* begin 2.2 compatibility macros */
                #ifndef PyDoc_STRVAR
                /* Define macros for inline documentation. */
                #define PyDoc_VAR(name) static char name[]
                #define PyDoc_STRVAR(name,str) PyDoc_VAR(name) = PyDoc_STR(str)
                #ifdef WITH_DOC_STRINGS
                #define PyDoc_STR(str) str
                #else
                #define PyDoc_STR(str) ""
                #endif
                #endif /* ifndef PyDoc_STRVAR */

                #ifndef PyMODINIT_FUNC
                #       if defined(__cplusplus)
                #               define PyMODINIT_FUNC extern "C" void
                #       else /* __cplusplus */
                #               define PyMODINIT_FUNC void
                #       endif /* __cplusplus */
                #endif
                /* end 2.2 compatibility macros */

                static PyObject *error_obj;     /* CSV exception */

                typedef enum {
                        START_RECORD, START_FIELD, ESCAPED_CHAR, IN_FIELD, 
                        IN_QUOTED_FIELD, ESCAPE_IN_QUOTED_FIELD, QUOTE_IN_QUOTED_FIELD
                } ParserState;

                typedef enum {
                        QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC, QUOTE_NONE
                } QuoteStyle;

                typedef struct {
                        PyObject_HEAD

                        int doublequote;        /* is " represented by ""? */
                        char delimiter;         /* field separator */
                        int have_quotechar;     /* is a quotechar defined */
                        char quotechar;         /* quote character */
                        int have_escapechar;    /* is an escapechar defined */
                        char escapechar;        /* escape character */
                        int skipinitialspace;   /* ignore spaces following delimiter? */
                        PyObject *lineterminator; /* string to write between records */
                        QuoteStyle quoting;     /* style of quoting to write */

                        ParserState state;      /* current CSV parse state */
                        PyObject *fields;       /* field list for current record */

                        int autoclear;          /* should fields be cleared on next
                                                   parse() after exception? */
                        int strict;             /* raise exception on bad CSV */

                        int had_parse_error;    /* did we have a parse error? */

                        char *field;            /* build current field in here */
                        int field_size;         /* size of allocated buffer */
                        int field_len;          /* length of current field */

                        char *rec;              /* buffer for parser.join */
                        int rec_size;           /* size of allocated record */
                        int rec_len;            /* length of record */
                        int num_fields;         /* number of fields in record */
                } ParserObj;

                staticforward PyTypeObject Parser_Type;

                static PyObject *
                raise_exception(char *fmt, ...)
           1    {
           1            va_list ap;
           1            char msg[512];
           1            PyObject *pymsg;

           1            va_start(ap, fmt);
                #ifdef _WIN32
                        _vsnprintf(msg, sizeof(msg), fmt, ap);
                #else
           1            vsnprintf(msg, sizeof(msg), fmt, ap);
                #endif
           1            va_end(ap);
           1            pymsg = PyString_FromString(msg);
           1            PyErr_SetObject(error_obj, pymsg);
           1            Py_XDECREF(pymsg);

           1            return NULL;
                }

                static void
                parse_save_field(ParserObj *self)
          39    {
          39            PyObject *field;

          39            field = PyString_FromStringAndSize(self->field, self->field_len);
          39            if (field != NULL) {
          39                    PyList_Append(self->fields, field);
          39                    Py_XDECREF(field);
                        }
          39            self->field_len = 0;
                }

                static int
                parse_grow_buff(ParserObj *self)
          12    {
          12            if (self->field_size == 0) {
          12                    self->field_size = 4096;
          12                    self->field = PyMem_Malloc(self->field_size);
                        }
                        else {
      ######                    self->field_size *= 2;
      ######                    self->field = PyMem_Realloc(self->field, self->field_size);
                        }
          12            if (self->field == NULL) {
      ######                    PyErr_NoMemory();
      ######                    return 0;
                        }
          12            return 1;
                }

                static void
                parse_add_char(ParserObj *self, char c)
         192    {
         192            if (self->field_len == self->field_size && !parse_grow_buff(self))
      ######                    return;
         192            self->field[self->field_len++] = c;
                }

                static void
                parse_prepend_char(ParserObj *self, char c)
      ######    {
      ######            if (self->field_len == self->field_size && !parse_grow_buff(self))
      ######                    return;
      ######            memmove(self->field + 1, self->field, self->field_len);
      ######            self->field[0] = c;
      ######            self->field_len++;
                }

                static void
                parse_process_char(ParserObj *self, char c)
         262    {
         262            switch (self->state) {
                        case START_RECORD:
                                /* start of record */
          17                    if (c == '\0')
                                        /* empty line - return [] */
      ######                            break;
                                /* normal character - handle as START_FIELD */
          17                    self->state = START_FIELD;
                                /* fallthru */
                        case START_FIELD:
                                /* expecting field */
          39                    if (c == '\0') {
                                        /* save empty field - return [fields] */
           3                            parse_save_field(self);
           3                            self->state = START_RECORD;
                                }
          36                    else if (c == self->quotechar) {
                                        /* start quoted field */
          12                            self->state = IN_QUOTED_FIELD;
                                }
          24                    else if (c == self->escapechar) {
                                        /* possible escaped character */
      ######                            self->state = ESCAPED_CHAR;
                                }
          24                    else if (c == self->delimiter) {
                                        /* save empty field */
           2                            parse_save_field(self);
                                }
          22                    else if (c == ' ' && self->skipinitialspace)
                                        /* ignore space at start of field */
                                        ;
                                else {
                                        /* begin new unquoted field */
          22                            parse_add_char(self, c);
          22                            self->state = IN_FIELD;
                                }
          22                    break;

                        case ESCAPED_CHAR:
      ######                    if (c != self->escapechar && c != self->delimiter &&
                                    c != self->quotechar)
      ######                            parse_add_char(self, self->escapechar);
      ######                    parse_add_char(self, c);
      ######                    self->state = IN_FIELD;
      ######                    break;

                        case IN_FIELD:
                                /* in unquoted field */
          42                    if (c == '\0') {
                                        /* end of line - return [fields] */
           8                            parse_save_field(self);
           8                            self->state = START_RECORD;
                                }
          34                    else if (c == self->escapechar) {
                                        /* possible escaped character */
      ######                            self->state = ESCAPED_CHAR;
                                }
          34                    else if (c == self->delimiter) {
                                        /* save field - wait for new field */
          16                            parse_save_field(self);
          16                            self->state = START_FIELD;
                                }
                                else {
                                        /* normal character - save in field */
          18                            parse_add_char(self, c);
                                }
          18                    break;

                        case IN_QUOTED_FIELD:
                                /* in quoted field */
         162                    if (c == '\0') {
                                        /* end of line - save '\n' in field */
           2                            parse_add_char(self, '\n');
                                }
         160                    else if (c == self->escapechar) {
                                        /* Possible escape character */
      ######                            self->state = ESCAPE_IN_QUOTED_FIELD;
                                }
         160                    else if (c == self->quotechar) {
          19                            if (self->doublequote) {
                                                /* doublequote; " represented by "" */
          19                                    self->state = QUOTE_IN_QUOTED_FIELD;
                                        }
                                        else {
                                                /* end of quote part of field */
      ######                                    self->state = IN_FIELD;
                                        }
                                }
                                else {
                                        /* normal character - save in field */
         141                            parse_add_char(self, c);
                                }
         141                    break;

                        case ESCAPE_IN_QUOTED_FIELD:
      ######                    if (c != self->escapechar && c != self->delimiter &&
                                    c != self->quotechar)
      ######                            parse_add_char(self, self->escapechar);
      ######                    parse_add_char(self, c);
      ######                    self->state = IN_QUOTED_FIELD;
      ######                    break;

                        case QUOTE_IN_QUOTED_FIELD:
                                /* doublequote - seen a quote in an quoted field */
          19                    if (self->have_quotechar && c == self->quotechar) {
                                        /* save "" as " */
           7                            parse_add_char(self, c);
           7                            self->state = IN_QUOTED_FIELD;
                                }
          12                    else if (c == self->delimiter) {
                                        /* save field - wait for new field */
           4                            parse_save_field(self);
           4                            self->state = START_FIELD;
                                }
           8                    else if (c == '\0') {
                                        /* end of line - return [fields] */
           6                            parse_save_field(self);
           6                            self->state = START_RECORD;
                                }
           2                    else if (!self->strict) {
           2                            parse_add_char(self, c);
           2                            self->state = IN_FIELD;
                                }
                                else {
                                        /* illegal */
      ######                            self->had_parse_error = 1;
      ######                            raise_exception("%c expected after %c", 
                                                        self->delimiter, self->quotechar);
                                }
                                break;

                        }
                }

                static void
                clear_fields_and_status(ParserObj *self)
      ######    {
      ######            if (self->fields) {
      ######                    Py_XDECREF(self->fields);
                        }
      ######            self->fields = PyList_New(0);
      ######            self->field_len = 0;
      ######            self->state = START_RECORD;

      ######            self->had_parse_error = 0;
                }

                /* ---------------------------------------------------------------- */

                PyDoc_STRVAR(Parser_parse_doc,
                "parse(s) -> list of strings\n"
                "\n"
                "CSV parse the single line in the string s and return a\n"
                "list of string fields.  If the CSV record contains multi-line\n"
                "fields, the function will return None until all lines of the\n"
                "record have been parsed.");

                static PyObject *
                Parser_parse(ParserObj *self, PyObject *args)
          19    {
          19            char *line;

          19            if (!PyArg_ParseTuple(args, "s", &line))
      ######                    return NULL;

          19            if (self->autoclear && self->had_parse_error)
      ######                    clear_fields_and_status(self);

                        /* Process line of text - send '\0' to processing code to
                           represent end of line.  End of line which is not at end of
                           string is an error. */
         262            while (*line) {
         246                    char c;

         246                    c = *line++;
         246                    if (c == '\r') {
      ######                            c = *line++;
      ######                            if (c == '\0')
                                                /* macintosh end of line */
      ######                                    break;
      ######                            if (c == '\n') {
      ######                                    c = *line++;
      ######                                    if (c == '\0')
                                                        /* DOS end of line */
      ######                                            break;
                                        }
      ######                            self->had_parse_error = 1;
      ######                            return raise_exception("newline inside string");
                                }
         246                    if (c == '\n') {
           3                            c = *line++;
           3                            if (c == '\0')
                                                /* unix end of line */
           3                                    break;
      ######                            self->had_parse_error = 1;
      ######                            return raise_exception("newline inside string");
                                }
         243                    parse_process_char(self, c);
         243                    if (PyErr_Occurred())
      ######                            return NULL;
                        }
          19            parse_process_char(self, '\0');

          19            if (self->state == START_RECORD) {
          17                    PyObject *fields = self->fields;
          17                    self->fields = PyList_New(0);
          17                    return fields;
                        }

           2            Py_INCREF(Py_None);
           2            return Py_None;
                }

                /* ---------------------------------------------------------------- */

                PyDoc_STRVAR(Parser_clear_doc,
                "clear() -> None\n"
                "\n"
                "Discard partially parsed record.  This must be called to reset\n"
                "parser state after an exception.");

                static PyObject *
                Parser_clear(ParserObj *self)
      ######    {
      ######            clear_fields_and_status(self);

      ######            Py_INCREF(Py_None);
      ######            return Py_None;
                }

                /* ---------------------------------------------------------------- */
                static void
                join_reset(ParserObj *self)
          11    {
          11            self->rec_len = 0;
          11            self->num_fields = 0;
                }

                #define MEM_INCR 32768

                /* Calculate new record length or append field to record.  Return new
                 * record length.
                 */
                static int
                join_append_data(ParserObj *self, char *field, int quote_empty,
                                 int *quoted, int copy_phase)
         270    {
         270            int i, rec_len;

         270            rec_len = self->rec_len;

                        /* If this is not the first field we need a field separator.
                         */
         270            if (self->num_fields > 0) {
         248                    if (copy_phase)
         124                            self->rec[rec_len] = self->delimiter;
         248                    rec_len++;
                        }
                        /* Handle preceding quote.
                         */
         270            switch (self->quoting) {
                        case QUOTE_ALL:
      ######                    *quoted = 1;
      ######                    if (copy_phase)
      ######                            self->rec[rec_len] = self->quotechar;
      ######                    rec_len++;
      ######                    break;
                        case QUOTE_MINIMAL:
                        case QUOTE_NONNUMERIC:
                                /* We only know about quoted in the copy phase.
                                 */
         270                    if (copy_phase && *quoted) {
           3                            self->rec[rec_len] = self->quotechar;
           3                            rec_len++;
                                }
                                break;
                        case QUOTE_NONE:
         270                    break;
                        }
                        /* Copy/count field data.
                         */
        1090            for (i = 0;; i++) {
        1090                    char c = field[i];

        1090                    if (c == '\0')
         270                            break;
                                /* If in doublequote mode we escape quote chars with a
                                 * quote.
                                 */
         820                    if (self->have_quotechar
                                    && c == self->quotechar && self->doublequote) {
           4                            if (copy_phase)
           2                                    self->rec[rec_len] = self->quotechar;
           4                            *quoted = 1;
           4                            rec_len++;
         816                    } else if (self->quoting == QUOTE_NONNUMERIC && !*quoted
                                           && !(isdigit(c) || c == '+' || c == '-' || c == '.'))
      ######                            *quoted = 1;

                                /* Some special characters need to be escaped.  If we have a
                                 * quote character switch to quoted field instead of escaping
                                 * individual characters.
                                 */
         820                    if (!*quoted
                                    && (c == self->delimiter || c == self->escapechar
                                        || c == '\n' || c == '\r')) {
           2                            if (self->have_quotechar
                                            && self->quoting != QUOTE_NONE)
           2                                    *quoted = 1;
      ######                            else if (self->escapechar) {
      ######                                    if (copy_phase)
      ######                                            self->rec[rec_len] = self->escapechar;
      ######                                    rec_len++;
                                        }
                                        else {
      ######                                    raise_exception("delimiter must be quoted or escaped");
      ######                                    return -1;
                                        }
                                }
                                /* Copy field character into record buffer.
                                 */
         820                    if (copy_phase)
         410                            self->rec[rec_len] = c;
         820                    rec_len++;
                        }

                        /* If field is empty check if it needs to be quoted.
                         */
         270            if (i == 0 && quote_empty && self->have_quotechar)
      ######                    *quoted = 1;

                        /* Handle final quote character on field.
                         */
         270            if (*quoted) {
           6                    if (copy_phase)
           3                            self->rec[rec_len] = self->quotechar;
                                else
                                        /* Didn't know about leading quote until we found it
                                         * necessary in field data - compensate for it now.
                                         */
           3                            rec_len++;
           6                    rec_len++;
                        }

         270            return rec_len;
                }

                static int
                join_check_rec_size(ParserObj *self, int rec_len)
         146    {
         146            if (rec_len > self->rec_size) {
          11                    if (self->rec_size == 0) {
          11                            self->rec_size = (rec_len / MEM_INCR + 1) * MEM_INCR;
          11                            self->rec = PyMem_Malloc(self->rec_size);
                                }
                                else {
      ######                            char *old_rec = self->rec;

      ######                            self->rec_size = (rec_len / MEM_INCR + 1) * MEM_INCR;
      ######                            self->rec = PyMem_Realloc(self->rec, self->rec_size);
      ######                            if (self->rec == NULL)
      ######                                    free(old_rec);
                                }
          11                    if (self->rec == NULL) {
      ######                            PyErr_NoMemory();
      ######                            return 0;
                                }
                        }
         146            return 1;
                }

                static int
                join_append(ParserObj *self, char *field, int quote_empty)
         135    {
         135            int rec_len, quoted;

         135            quoted = 0;
         135            rec_len = join_append_data(self, field, quote_empty, &quoted, 0);
         135            if (rec_len < 0)
      ######                    return 0;

                        /* grow record buffer if necessary */
         135            if (!join_check_rec_size(self, rec_len))
      ######                    return 0;

         135            self->rec_len = join_append_data(self, field, quote_empty, &quoted, 1);
         135            self->num_fields++;

         135            return 1;
                }

                static int
                join_append_lineterminator(ParserObj *self)
          11    {
          11            int terminator_len;

          11            terminator_len = PyString_Size(self->lineterminator);

                        /* grow record buffer if necessary */
          11            if (!join_check_rec_size(self, self->rec_len + terminator_len))
      ######                    return 0;

          11            memmove(self->rec + self->rec_len,
                                PyString_AsString(self->lineterminator), terminator_len);
          11            self->rec_len += terminator_len;

          11            return 1;
                }

                static PyObject *
                join_string(ParserObj *self)
          11    {
          11            return PyString_FromStringAndSize(self->rec, self->rec_len);
                }

                PyDoc_STRVAR(Parser_join_doc,
                "join(sequence) -> string\n"
                "\n"
                "Construct a CSV record from a sequence of fields.  Non-string\n"
                "elements will be converted to string.");

                static PyObject *
                Parser_join(ParserObj *self, PyObject *seq)
          12    {
          12            int len, i;

          12            if (!PySequence_Check(seq))
           1                    return raise_exception("sequence expected");

          11            len = PySequence_Length(seq);
          11            if (len < 0)
      ######                    return NULL;

                        /* Join all fields in internal buffer.
                         */
          11            join_reset(self);
         146            for (i = 0; i < len; i++) {
         135                    PyObject *field;
         135                    int append_ok;

         135                    field = PySequence_GetItem(seq, i);
         135                    if (field == NULL)
      ######                            return NULL;

         135                    if (PyString_Check(field)) {
          59                            append_ok = join_append(self, PyString_AsString(field), len == 1);
          59                            Py_DECREF(field);
                                }
          76                    else if (field == Py_None) {
      ######                            append_ok = join_append(self, "", len == 1);
      ######                            Py_DECREF(field);
                                }
                                else {
          76                            PyObject *str;

          76                            str = PyObject_Str(field);
          76                            Py_DECREF(field);
          76                            if (str == NULL)
      ######                                    return NULL;

          76                            append_ok = join_append(self, PyString_AsString(str), len == 1);
          76                            Py_DECREF(str);
                                }
         135                    if (!append_ok)
      ######                            return NULL;
                        }

                        /* Add line terminator.
                         */
          11            if (!join_append_lineterminator(self))
      ######                    return 0;

          11            return join_string(self);
                }

                static struct PyMethodDef Parser_methods[] = {
                        { "parse", (PyCFunction)Parser_parse, METH_VARARGS,
                          Parser_parse_doc },
                        { "clear", (PyCFunction)Parser_clear, METH_NOARGS,
                          Parser_clear_doc },
                        { "join", (PyCFunction)Parser_join, METH_O,
                          Parser_join_doc },
                        { NULL, NULL }
                };

                static void
                Parser_dealloc(ParserObj *self)
          30    {
          30            if (self->field)
          12                    free(self->field);
          30            Py_XDECREF(self->fields);
          30            Py_XDECREF(self->lineterminator);

          30            if (self->rec)
          11                    free(self->rec);

          30            PyMem_DEL(self);
                }

                #define OFF(x) offsetof(ParserObj, x)

                static struct memberlist Parser_memberlist[] = {
                        { "quotechar",        T_CHAR,   OFF(quotechar) },
                        { "delimiter",        T_CHAR,   OFF(delimiter) },
                        { "escapechar",       T_CHAR,   OFF(escapechar) },
                        { "skipinitialspace", T_INT,    OFF(skipinitialspace) },
                        { "lineterminator",   T_OBJECT, OFF(lineterminator) },
                        { "quoting",          T_INT,    OFF(quoting) },
                        { "doublequote",      T_INT,    OFF(doublequote) },
                        { "fields",           T_OBJECT, OFF(fields) },
                        { "autoclear",        T_INT,    OFF(autoclear) },
                        { "strict",           T_INT,    OFF(strict) },
                        { "had_parse_error",  T_INT,    OFF(had_parse_error), RO },
                        { NULL }
                };

                static PyObject *
                Parser_getattr(ParserObj *self, char *name)
          48    {
          48            PyObject *rv;

          48            if ((strcmp(name, "quotechar") == 0 && !self->have_quotechar)
                            || (strcmp(name, "escapechar") == 0 && !self->have_escapechar)) {
      ######                    Py_INCREF(Py_None);
      ######                    return Py_None;
                        }

          48            rv = PyMember_Get((char *)self, Parser_memberlist, name);
          48            if (rv)
      ######                    return rv;
          48            PyErr_Clear();
          48            return Py_FindMethod(Parser_methods, (PyObject *)self, name);
                }

                static int
                _set_char_attr(char *attr, int *have_attr, PyObject *v)
          60    {
                        /* Special case for constructor - NULL == use default.
                         */
          60            if (v == NULL)
      ######                    return 0;

          60            if (v == Py_None) {
          30                    *have_attr = 0;
          30                    *attr = 0;
          30                    return 0;
                        }
          30            else if (PyString_Check(v) && PyString_Size(v) == 1) {
          30                    *attr = PyString_AsString(v)[0];
          30                    *have_attr = 1;
          30                    return 0;
                        }
                        else {
      ######                    PyErr_BadArgument();
      ######                    return -1;
                        }
                }

                static int
                Parser_setattr(ParserObj *self, char *name, PyObject *v)
      ######    {
      ######            if (v == NULL) {
      ######                    PyErr_SetString(PyExc_AttributeError, "Cannot delete attribute");
      ######                    return -1;
                        }
      ######            if (strcmp(name, "quotechar") == 0)
      ######                    return _set_char_attr(&self->quotechar,
                                                      &self->have_quotechar, v);
      ######            else if (strcmp(name, "escapechar") == 0)
      ######                    return _set_char_attr(&self->escapechar,
                                                      &self->have_escapechar, v);
      ######            else if (strcmp(name, "quoting") == 0 && PyInt_Check(v)) {
      ######                    int n = PyInt_AsLong(v);

      ######                    if (n < 0 || n > QUOTE_NONE) {
      ######                            PyErr_BadArgument();
      ######                            return -1;
                                }
      ######                    if (n == QUOTE_NONE)
      ######                            self->have_quotechar = 0;
      ######                    self->quoting = n;
      ######                    return 0;
                        }
      ######            else if (strcmp(name, "lineterminator") == 0 && !PyString_Check(v)) {
      ######                    PyErr_BadArgument();
      ######                    return -1;
                        }
                        else
      ######                    return PyMember_Set((char *)self, Parser_memberlist, name, v);
                }

                static PyObject *
                csv_parser(PyObject *module, PyObject *args, PyObject *keyword_args);

                PyDoc_STRVAR(Parser_Type_doc, "CSV parser");

                static PyTypeObject Parser_Type = {
                        PyObject_HEAD_INIT(0)
                        0,                      /*ob_size*/
                        "_csv.parser",          /*tp_name*/
                        sizeof(ParserObj),      /*tp_basicsize*/
                        0,                      /*tp_itemsize*/
                        /* methods */
                        (destructor)Parser_dealloc, /*tp_dealloc*/
                        (printfunc)0,           /*tp_print*/
                        (getattrfunc)Parser_getattr, /*tp_getattr*/
                        (setattrfunc)Parser_setattr, /*tp_setattr*/
                        (cmpfunc)0,             /*tp_compare*/
                        (reprfunc)0,            /*tp_repr*/
                        0,                      /*tp_as_number*/
                        0,                      /*tp_as_sequence*/
                        0,                      /*tp_as_mapping*/
                        (hashfunc)0,            /*tp_hash*/
                        (ternaryfunc)0,         /*tp_call*/
                        (reprfunc)0,            /*tp_str*/

                        0L, 0L, 0L, 0L,
                        Parser_Type_doc
                };

                PyDoc_STRVAR(csv_parser_doc,
                "parser(delimiter=',', quotechar='\"', escapechar=None,\n"
                "       doublequote=1, lineterminator='\\r\\n', quoting='minimal',\n"
                "       autoclear=1, strict=0) -> Parser\n"
                "\n"
                "Constructs a CSV parser object.\n"
                "\n"
                "    delimiter\n"
                "        Defines the character that will be used to separate\n"
                "        fields in the CSV record.\n"
                "\n"
                "    quotechar\n"
                "        Defines the character used to quote fields that\n"
                "        contain the field separator or newlines.  If set to None\n"
                "        special characters will be escaped using the escapechar.\n"
                "\n"
                "    escapechar\n"
                "        Defines the character used to escape special\n"
                "        characters.  Only used if quotechar is None.\n"
                "\n"
                "    doublequote\n"
                "        When True, quotes in a field must be doubled up.\n"
                "\n"
                "    skipinitialspace\n"
                "        When True spaces following the delimiter are ignored.\n"
                "\n"
                "    lineterminator\n"
                "        The string used to terminate records.\n"
                "\n"
                "    quoting\n"
                "        Controls the generation of quotes around fields when writing\n"
                "        records.  This is only used when quotechar is not None.\n"
                "\n"
                "    autoclear\n"
                "        When True, calling parse() will automatically call\n"
                "        the clear() method if the previous call to parse() raised an\n"
                "        exception during parsing.\n"
                "\n"
                "    strict\n"
                "        When True, the parser will raise an exception on\n"
                "        malformed fields rather than attempting to guess the right\n"
                "        behavior.\n");

                static PyObject *
                csv_parser(PyObject *module, PyObject *args, PyObject *keyword_args)
          30    {
                        static char *keywords[] = {
                                "quotechar", "delimiter", "escapechar", "skipinitialspace",
                                "lineterminator", "quoting", "doublequote",
                                "autoclear", "strict", 
                                NULL
          30            };
          30            PyObject *quotechar, *escapechar;
          30            ParserObj *self = PyObject_NEW(ParserObj, &Parser_Type);

          30            if (self == NULL)
      ######                    return NULL;

          30            self->quotechar = '"';
          30            self->have_quotechar = 1;
          30            self->delimiter = ',';
          30            self->escapechar = '\0';
          30            self->have_escapechar = 0;
          30            self->skipinitialspace = 0;
          30            self->lineterminator = NULL;
          30            self->quoting = QUOTE_MINIMAL;
          30            self->doublequote = 1;
          30            self->autoclear = 1;
          30            self->strict = 0;

          30            self->state = START_RECORD;
          30            self->fields = PyList_New(0);
          30            if (self->fields == NULL) {
      ######                    Py_DECREF(self);
      ######                    return NULL;
                        }

          30            self->had_parse_error = 0;
          30            self->field = NULL;
          30            self->field_size = 0;
          30            self->field_len = 0;

          30            self->rec = NULL;
          30            self->rec_size = 0;
          30            self->rec_len = 0;
          30            self->num_fields = 0;

          30            quotechar = escapechar = NULL;
          30            if (PyArg_ParseTupleAndKeywords(args, keyword_args, "|OcOiSiiii",
                                                        keywords,
                                                        &quotechar, &self->delimiter,
                                                        &escapechar, &self->skipinitialspace,
                                                        &self->lineterminator, &self->quoting,
                                                        &self->doublequote,
                                                        &self->autoclear, &self->strict)
                            && !_set_char_attr(&self->quotechar,
                                               &self->have_quotechar, quotechar)
                            && !_set_char_attr(&self->escapechar,
                                               &self->have_escapechar, escapechar)) {
          30                    if (self->lineterminator == NULL)
      ######                            self->lineterminator = PyString_FromString("\r\n");
                                else {
          30                            Py_INCREF(self->lineterminator);
                                }

          30                    if (self->quoting < 0 || self->quoting > QUOTE_NONE)
      ######                            PyErr_SetString(PyExc_ValueError, "bad quoting value");
                                else {
          30                            if (self->quoting == QUOTE_NONE)
      ######                                    self->have_quotechar = 0;
          30                            else if (!self->have_quotechar)
      ######                                    self->quoting = QUOTE_NONE;
          30                            return (PyObject*)self;
                                }
                        }

      ######            Py_DECREF(self);
      ######            return NULL;
                }

                static struct PyMethodDef csv_methods[] = {
                        { "parser", (PyCFunction)csv_parser, METH_VARARGS | METH_KEYWORDS,
                          csv_parser_doc },
                        { NULL, NULL }
                };

                PyDoc_STRVAR(csv_module_doc,
                "This module provides class for performing CSV parsing and writing.\n"
                "\n"
                "The CSV parser object (returned by the parser() function) supports the\n"
                "following methods:\n"
                "    clear()\n"
                "        Discards all fields parsed so far.  If autoclear is set to\n"
                "        zero. You should call this after a parser exception.\n"
                "\n"
                "    parse(string) -> list of strings\n"
                "        Extracts fields from the (partial) CSV record in string.\n"
                "        Trailing end of line characters are ignored, so you do not\n"
                "        need to strip the string before passing it to the parser. If\n"
                "        you pass more than a single line of text, a _csv.Error\n"
                "        exception will be raised.\n"
                "\n"
                "    join(sequence) -> string\n"
                "        Construct a CSV record from a sequence of fields. Non-string\n"
                "        elements will be converted to string.\n"
                "\n"
                "Typical usage:\n"
                "\n"
                "    import _csv\n"
                "    p = _csv.parser()\n"
                "    fp = open('afile.csv', 'U')\n"
                "    for line in fp:\n"
                "        fields = p.parse(line)\n"
                "        if not fields:\n"
                "            # multi-line record\n"
                "            continue\n"
                "        # process the fields\n");

                PyMODINIT_FUNC
                init_csv(void)
           1    {
           1            PyObject *mod;
           1            PyObject *dict;
           1            PyObject *rev;

           1            if (PyType_Ready(&Parser_Type) < 0)
      ######                    return;

                        /* Create the module and add the functions */
           1            mod = Py_InitModule3("_csv", csv_methods, csv_module_doc);
           1            if (mod == NULL)
      ######                    return;

                        /* Add version to the module. */
           1            dict = PyModule_GetDict(mod);
           1            if (dict == NULL)
      ######                    return;
           1            rev = PyString_FromString("1.0");
           1            if (rev == NULL)
      ######                    return;
           1            if (PyDict_SetItemString(dict, "__version__", rev) < 0)
      ######                    return;

                        /* Add the CSV exception object to the module. */
           1            error_obj = PyErr_NewException("_csv.Error", NULL, NULL);
           1            if (error_obj == NULL)
      ######                    return;

           1            PyDict_SetItemString(dict, "Error", error_obj);

           1            Py_XDECREF(rev);
           1            Py_XDECREF(error_obj);
                }

From skip at pobox.com  Mon Feb  3 21:23:46 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 3 Feb 2003 14:23:46 -0600
Subject: [Csv] Something's fishy...
Message-ID: <15934.53202.757331.426317@montanaro.dyndns.org>

I rearranged the dialect initialization stuff a bit earlier today, then
started getting segfaults from the interpreter.  From gdb I was able to
track it down to what looked like an invalid keyword_args parameter passed
to csv_parser():

    (gdb) p args
    $1 = (PyObject *) 0x418030
    (gdb) pyo args
    object  : ()
    type    : tuple
    refcount: 1911
    address : 0x418030
    $2 = void
    (gdb) p keyword_args
    $3 = (PyObject *) 0xb3b0c0
    (gdb) pyo keyword_args
    object  : {'delimiter': ',', 'escapechar': None, 'lineterminator': 
    Program received signal EXC_BAD_INSTRUCTION, Illegal instruction/operand.
    0x0067d940 in ?? ()
    The program being debugged was signaled while in a function called from GDB.
    GDB remains in the frame where the signal was received.
    To change this behavior use "set unwindonsignal on"
    Evaluation of the expression containing the function (_PyObject_Dump) will be abandoned.

pyo is a user-defined gdb command:

    define pyo
    print _PyObject_Dump($arg0)
    end

Figuring there was maybe something wrong there, I stuck a print statement in
csv.py:_OCcsv:__init__ just before the last line of the method:

    print ">>", parser_options

The segfault went away.  I was left with a lot of output and one error:

    % python test_csv.py
    >> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    ..>> {'delimiter': '\t', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    ...E>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': None, 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 0, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': '\\', 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 3, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': '\\', 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 3, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': '\\', 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 2, 'doublequote': True}
    .>> {'delimiter': ',', 'escapechar': '\\', 'lineterminator': '\r\n', 'skipinitialspace': False, 'quotechar': '"', 'quoting': 2, 'doublequote': True}
    .
    ======================================================================
    ERROR: test_register (__main__.TestDialects)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "test_csv.py", line 200, in test_register
        csv.unregister_dialect("myexceltsv")
    AttributeError: 'module' object has no attribute 'unregister_dialect'

    ----------------------------------------------------------------------
    Ran 38 tests in 0.314s

    FAILED (errors=1)

I added the missing function.  Everything seems fine once again.  I'm
checking in what I have, but _csv.c should probably be carefully inspected
to see if there's an argument out of place, a missing INCREF, an array
bounds violation, or something similar.  Errors don't just magically go
away.  Whatever was wrong is still wrong, just hiding.

Skip

From skip at pobox.com  Mon Feb  3 21:52:05 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 3 Feb 2003 14:52:05 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15934.54901.429816.726580@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: sjmachin at lexicon.net (John Machin)
Subject: Re: PEP 305 - CSV File API
Date: 3 Feb 2003 12:12:50 -0800
Size: 5533
Url: http://mail.python.org/pipermail/csv/attachments/20030203/63ca6a7f/attachment.mht 

From skip at pobox.com  Mon Feb  3 21:53:03 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 3 Feb 2003 14:53:03 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15934.54959.153280.689174@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: sjmachin at lexicon.net (John Machin)
Subject: Re: PEP 305 - CSV File API
Date: 3 Feb 2003 12:33:25 -0800
Size: 6295
Url: http://mail.python.org/pipermail/csv/attachments/20030203/331df17e/attachment.mht 

From andrewm at object-craft.com.au  Mon Feb  3 23:57:48 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Tue, 04 Feb 2003 09:57:48 +1100
Subject: [Csv] passing dialects directly - class or instance? 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15934.39674.453212.506375@montanaro.dyndns.org> 
References: <15934.39674.453212.506375@montanaro.dyndns.org> 
Message-ID: <20030203225748.766A63C1F4@coffee.object-craft.com.au>

>I thought users were supposed to pass dialect classes when not using
>strings.  I see, however, that _OCcsv.__init__ calls isinstance() instead of
>issubclass().  Which is it supposed to be?

An instance, I think - the PEP needs to be updated.

The code started to look really messy when I allowed it to accept either
a class, an instance, or a string. It seemed a small cost to lose the
"class" option. Thoughts?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Tue Feb  4 00:00:21 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Tue, 04 Feb 2003 10:00:21 +1100
Subject: [Csv] test coverage 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15934.40800.760370.712420@montanaro.dyndns.org> 
References: <15934.40800.760370.712420@montanaro.dyndns.org> 
Message-ID: <20030203230021.61E103C1F4@coffee.object-craft.com.au>

>Attached is the output of running a gcov-instrumented version of Python and
>the _csv module against the current test suite.

That's very handy - I'll have to build a gcov python for myself.

My thought was there would be dialect tests, and a set of tests for the
underlying module. The underly module tests would probably be in a better
position to get coverage.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Tue Feb  4 00:38:14 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 3 Feb 2003 17:38:14 -0600
Subject: [Csv] Re: Something's fishy...
Message-ID: <15934.64870.901035.20516@montanaro.dyndns.org>

FYI, after another little be of dialect reshuffling, the segfault is back:

    % python test_csv.py
    ESegmentation fault

Skip

From skip at pobox.com  Tue Feb  4 05:00:20 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 3 Feb 2003 22:00:20 -0600
Subject: [Csv] Re: PEP 305 - CSV File API
In-Reply-To: <200302040146.27744.cribeiro@mail.inet.com.br>
References: <mailman.1044050476.7110.python-list@python.org>
        <v3sv11esgjpje2@news.supernews.com>
        <c76ff6fc.0302031212.e21706e@posting.google.com>
        <200302040146.27744.cribeiro@mail.inet.com.br>
Message-ID: <15935.15060.878910.808643@montanaro.dyndns.org>

    Carlos> The problem is, almost all my intermediate files have both
    Carlos> 'date' and 'float' columns. This is highly common in business,
    Carlos> specially if you are looking at sales figures and stuff like
    Carlos> that.

    Carlos> To compound my problem, Python writes floats with a period (.)
    Carlos> as a decimal separator. However, my copy of Excel is configured
    Carlos> for the brazilian locale, and it expects a comma (,) as the
    Carlos> decimal separator.

Can't you simply set the locale in your scripts so Python and Excel agree?

    Carlos> Now for the real issue. If I convert my floats to strings
    Carlos> *before* writing the CSV file, It will end up quoted (for
    Carlos> example, '3,1416') - assuming that the CSV library will work as
    Carlos> Skip said. This is not what I would expect, and in fact, it's
    Carlos> not what anyone working with different locale settings would
    Carlos> say.

It would only be quoted if you had comma as the delimiter or had set the
quoting parameter to QUOTE_ALWAYS.  What delimiter do you use in your CSV
files? 

    Carlos> Last, even if Python just wrote floats with the 'right' decimal
    Carlos> separator - comma, in my case - there still would be other
    Carlos> software packages that would expect to get periods. 

How would you like us to handle this?  Sound like a case of being "damned if
we do, damned if we don't".

    Carlos> Or worse, I could try to send my data files to people in other
    Carlos> countries that would be unable to read it. In any event, there
    Carlos> is no automatic solution, but the ability to quickly adjust the
    Carlos> CSV library to get the correct behavior would be highly useful.

We have to come back to the fundamental issue that CSV files as commonly
understood contain no data type information.  It's possible that type
information could be passed in during write operations which would govern
the way the data is formatted when written.  (We've discussed it, but it's
not likely to be in the first release.)

Even if we solve the formatting issue, once the data is written out to the
file, if you ship it out of your locale, no information remains in the file
to indicate that 3,1416 is a number instead of a string containing digits
and a comma.  Similarly, if you choose to write dates out in an ambiguous
format, at the receiving end, the reader won't be able to tell what date
"02/03/03" represents.

Skip

From skip at pobox.com  Tue Feb  4 04:09:18 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 3 Feb 2003 21:09:18 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15935.11998.292771.881378@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: Carlos Ribeiro <cribeiro at mail.inet.com.br>
Subject: Re: PEP 305 - CSV File API
Date: Tue, 4 Feb 2003 01:46:27 +0000
Size: 7000
Url: http://mail.python.org/pipermail/csv/attachments/20030203/94e8d73d/attachment.mht 

From skip at pobox.com  Tue Feb  4 05:04:13 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 3 Feb 2003 22:04:13 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15935.15293.42587.634709@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: Carlos Ribeiro <cribeiro at mail.inet.com.br>
Subject: Re: PEP 305 - CSV File API
Date: Tue, 4 Feb 2003 02:18:03 +0000
Size: 5135
Url: http://mail.python.org/pipermail/csv/attachments/20030203/d37d2220/attachment.mht 

From skip at pobox.com  Tue Feb  4 06:05:31 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 3 Feb 2003 23:05:31 -0600
Subject: [Csv] There's definitely something going on
Message-ID: <15935.18971.504878.413266@montanaro.dyndns.org>

A version of Python from CVS configured using --with-pydebug complains
mightily about the test suite.  Here are some messages:

*** malloc[7685]: Deallocation of a pointer not malloced: 0x437008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x5ef008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x5f8008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x601008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
..*** malloc[7685]: Deallocation of a pointer not malloced: 0x498318; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
..*** malloc[7685]: Deallocation of a pointer not malloced: 0x499338; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x49a358; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x49b378; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x60a008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
..*** malloc[7685]: Deallocation of a pointer not malloced: 0x49c398; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x613008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x49d3b8; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x49e3d8; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x49f3f8; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x4a0418; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x61c008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x4a1438; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x625008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x62e008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x4a2458; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
....*** malloc[7685]: Deallocation of a pointer not malloced: 0x4a3478; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
..*** malloc[7685]: Deallocation of a pointer not malloced: 0x637008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.....*** malloc[7685]: Deallocation of a pointer not malloced: 0x640008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x649008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x4a4698; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x4a56b8; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.*** malloc[7685]: Deallocation of a pointer not malloced: 0x652008; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug
.

Setting MallocHelp displayed a little help about other settable Malloc
variables, but nothing gave any more useful info.

I had a version of _csv.c which exported the QUOTE_* constants (safer than
defining it in two places I think).  That barfed as well, though with a
negative reference count trying to (I think) set the lineterminator
attribute of a Dialect instance.

I'm going to take another look in the morning.

Skip

From andrewm at object-craft.com.au  Tue Feb  4 07:07:57 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Tue, 04 Feb 2003 17:07:57 +1100
Subject: [Csv] Re: Something's fishy... 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15934.64870.901035.20516@montanaro.dyndns.org> 
References: <15934.64870.901035.20516@montanaro.dyndns.org> 
Message-ID: <20030204060757.8EE953CA89@coffee.object-craft.com.au>

>FYI, after another little be of dialect reshuffling, the segfault is back:
>
>    % python test_csv.py
>    ESegmentation fault

Of course, it's working fine here (isn't that always the way). The source
is most likely the C module - what I'd suggest you do is try a dummy
replacement. Presuming that stops the exception, then start exercising
_csv's interface, bit by bit (create parser object, create parser object
with options, etc).

I'll build a version of python with pydebug on.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Tue Feb  4 07:26:53 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Tue, 04 Feb 2003 17:26:53 +1100
Subject: [Csv] Re: Something's fishy... 
In-Reply-To: Your message of "Tue, 04 Feb 2003 17:07:57 +1100."
             <20030204060757.8EE953CA89@coffee.object-craft.com.au> 
Message-ID: <20030204062653.60B8A3CA89@coffee.object-craft.com.au>

>I'll build a version of python with pydebug on.

Do I need to do any more than this?

$ python2.3-pydebug test_csv.py 
......................................
----------------------------------------------------------------------
Ran 38 tests in 0.066s

OK
[10916 refs]
$

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From ianb at colorstudy.com  Tue Feb  4 09:07:04 2003
From: ianb at colorstudy.com (Ian Bicking)
Date: Tue, 4 Feb 2003 02:07:04 -0600
Subject: [Csv] Re: PEP 305 - CSV File API
In-Reply-To: <15935.15060.878910.808643@montanaro.dyndns.org>
Message-ID: <A598420A-3817-11D7-966A-000393C2D67E@colorstudy.com>

On Monday, February 3, 2003, at 10:00 PM, Skip Montanaro wrote:
> We have to come back to the fundamental issue that CSV files as 
> commonly
> understood contain no data type information.  It's possible that type
> information could be passed in during write operations which would 
> govern
> the way the data is formatted when written.  (We've discussed it, but 
> it's
> not likely to be in the first release.)

I think plain strings should be the basic implementation.  I see two 
ways to provide specialization: for most cases you'd use wrappers, like 
a reader that uses the first row as column names.  You could even do 
some type conversion that way, but the exception would be a place where 
you wanted to distinguish between:

"1","Bob"
1,"Bob"

A wrapper could potentially handle some conversion, e.g., a CSV reader 
from Webware reads column headers like "id:int", and then converts that 
column to an integer.  Or it could try to convert everything, and those 
that fail get left as strings.

I guess the alternatives I see for dealing more directly with quotes 
would be (a) having an option to return the string complete with 
quotes, and force quotes in the output or (b) if the reader/write was 
implemented with some sort of class interface, a subclass could 
override the hypothetical quote/unquote methods.

Except for the quoting issue, I think all other customizations would 
best be done with wrappers anyway.  You can't magically get locale 
information into the file, or anything other indication of how to 
handle the file -- providing a robust reader is the best you can do.

   Ian

From skip at pobox.com  Tue Feb  4 14:17:41 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 4 Feb 2003 07:17:41 -0600
Subject: [Csv] Re: Something's fishy... 
In-Reply-To: <20030204062653.60B8A3CA89@coffee.object-craft.com.au>
References: <20030204060757.8EE953CA89@coffee.object-craft.com.au>
        <20030204062653.60B8A3CA89@coffee.object-craft.com.au>
Message-ID: <15935.48501.938077.703028@montanaro.dyndns.org>

    Andrew> Do I need to do any more than this?

I don't believe so.  Did you cvs up your Python tree?  Maybe it's not us at
all.

Skip

From skip at pobox.com  Tue Feb  4 14:19:27 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 4 Feb 2003 07:19:27 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15935.48607.467509.289945@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: Dennis Lee Bieber <wlfraed at ix.netcom.com>
Subject: Re: PEP 305 - CSV File API
Date: Mon, 03 Feb 2003 20:56:40 -0800
Size: 5016
Url: http://mail.python.org/pipermail/csv/attachments/20030204/09e9e76c/attachment.mht 

From skip at pobox.com  Tue Feb  4 14:25:24 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 4 Feb 2003 07:25:24 -0600
Subject: [Csv] Re: PEP 305 - CSV File API
In-Reply-To: <A598420A-3817-11D7-966A-000393C2D67E@colorstudy.com>
References: <15935.15060.878910.808643@montanaro.dyndns.org>
        <A598420A-3817-11D7-966A-000393C2D67E@colorstudy.com>
Message-ID: <15935.48964.695258.758710@montanaro.dyndns.org>

    Ian> Or it could try to convert everything, and those that fail get left
    Ian> as strings.

I think that would be a disaster.  What if data in one column consisted of
hex numbers?  Some would be evaluated as numbers, others left as strings.
The programmer would have to defend against that.

The only way I see to reliably ask the csv module to convert data is to
provide a list of type converters which take a single string as an argument.
There are performance implications of this approach, so it can't be the
default.  One reason I've used Object Craft's csv module up to now is that
it's written in C and is at minimum 5-10x faster than the other options
available.  I routinely read and write 5-10MB CSV files, so I'm sensitive to
performance degradation.

Skip

From skip at pobox.com  Tue Feb  4 15:55:24 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 4 Feb 2003 08:55:24 -0600
Subject: [Csv] RE: PEP 305 - CSV File API (fwd)
Message-ID: <15935.54364.997807.967672@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: Simon Brunning <SBrunning at trisystems.co.uk>
Subject: RE: PEP 305 - CSV File API
Date: Tue, 4 Feb 2003 14:27:50 -0000
Size: 5396
Url: http://mail.python.org/pipermail/csv/attachments/20030204/9890e58d/attachment.mht 

From skip at pobox.com  Tue Feb  4 15:56:03 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 4 Feb 2003 08:56:03 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15935.54403.326360.118880@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: Carlos Ribeiro <cribeiro at mail.inet.com.br>
Subject: Re: PEP 305 - CSV File API
Date: Tue, 4 Feb 2003 14:38:49 +0000
Size: 4819
Url: http://mail.python.org/pipermail/csv/attachments/20030204/e759d7fd/attachment.mht 

From skip at pobox.com  Tue Feb  4 16:02:29 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 4 Feb 2003 09:02:29 -0600
Subject: [Csv] bug fixed
Message-ID: <15935.54789.655700.175234@montanaro.dyndns.org>

I should have paid closer attention to the errors malloc was giving me:

    *** malloc[1859]: Deallocation of a pointer not malloced: 0x4aa5e8; This
        could be a double free(), or free() called with the middle of an
        allocated block; Try setting environment variable MallocHelp to see
        tools to help debug

especially the bit about "free() called with the middle of an allocated
block".  Memory allocated with PyMem_Malloc() was being freed with free().
Since Python now uses its own custom allocator layered on top of malloc,
those calls really need to be balanced.  I've no idea what free() on the
platforms you were using was doing (maybe ignoring, maybe scribbling), but
thankfully free() on my Mac OS X machine complained.

Please "cvs up".

BTW, the Object Craft csv module has the same problem.  Time to release 1.1?
;-)

Cheers,

Skip

From skip at pobox.com  Tue Feb  4 17:37:11 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 4 Feb 2003 10:37:11 -0600
Subject: [Csv] Re: PEP 305 - CSV File API (fwd)
Message-ID: <15935.60471.601265.17928@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: Carlos Ribeiro <cribeiro at mail.inet.com.br>
Subject: Re: PEP 305 - CSV File API
Date: Tue, 4 Feb 2003 01:10:43 +0000
Size: 7825
Url: http://mail.python.org/pipermail/csv/attachments/20030204/9816f639/attachment.mht 

From andrewm at object-craft.com.au  Tue Feb  4 23:08:15 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Feb 2003 09:08:15 +1100
Subject: [Csv] Re: Something's fishy... 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15935.48501.938077.703028@montanaro.dyndns.org> 
References: <20030204060757.8EE953CA89@coffee.object-craft.com.au>
	<20030204062653.60B8A3CA89@coffee.object-craft.com.au>
	<15935.48501.938077.703028@montanaro.dyndns.org> 
Message-ID: <20030204220815.5509A3CA92@coffee.object-craft.com.au>

>    Andrew> Do I need to do any more than this?
>
>I don't believe so.  Did you cvs up your Python tree?  

I did.

>Maybe it's not us at all.

That's what I'm wondering. Are you testing on an x86 platform (the
endianess could effect obscure bugs)?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Tue Feb  4 23:36:36 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Feb 2003 09:36:36 +1100
Subject: [Csv] bug fixed 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15935.54789.655700.175234@montanaro.dyndns.org> 
References: <15935.54789.655700.175234@montanaro.dyndns.org> 
Message-ID: <20030204223636.850A93CA92@coffee.object-craft.com.au>

>especially the bit about "free() called with the middle of an allocated
>block".  Memory allocated with PyMem_Malloc() was being freed with free().
>Since Python now uses its own custom allocator layered on top of malloc,
>those calls really need to be balanced.  I've no idea what free() on the
>platforms you were using was doing (maybe ignoring, maybe scribbling), but
>thankfully free() on my Mac OS X machine complained.

It could even be something that endianess triggered, but most likely a
different malloc library. I guess you've answered my previous question
(re x86).

If I remember correctly, Python 2.3 and Python 2.2 are very different with
regard to memory allocation - most of our testing has been done with 2.2,
PyMem_Malloc is a thin layer on top of the system malloc, is it not?

>BTW, the Object Craft csv module has the same problem.  Time to release 1.1?
>;-)

Funnily enough, I had a weird heap corruption problem in a python
application that used csv a while back - because csv was the only
extension module I was using, I immediatly assumed csv was the source,
and spent many hours trying to find a miss-handled memory allocation. I
eventually decided the problem was elsewhere (can't remember details).
Looks like it might have been csv after all.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Tue Feb  4 23:41:03 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 4 Feb 2003 16:41:03 -0600
Subject: [Csv] Re: Something's fishy... 
In-Reply-To: <20030204220815.5509A3CA92@coffee.object-craft.com.au>
References: <20030204060757.8EE953CA89@coffee.object-craft.com.au>
        <20030204062653.60B8A3CA89@coffee.object-craft.com.au>
        <15935.48501.938077.703028@montanaro.dyndns.org>
        <20030204220815.5509A3CA92@coffee.object-craft.com.au>
Message-ID: <15936.16767.338077.255345@montanaro.dyndns.org>

    >> Maybe it's not us at all.

    Andrew> That's what I'm wondering. Are you testing on an x86 platform
    Andrew> (the endianess could effect obscure bugs)?

Well, it was us. ;-)

At any rate, my day-to-day computer is a Ti Powerbook running Mac OS X.
Hopefully that's different enough than what you all run so we get reasonable
platform coverage.

Skip

From skip at pobox.com  Tue Feb  4 23:56:58 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 4 Feb 2003 16:56:58 -0600
Subject: [Csv] bug fixed 
In-Reply-To: <20030204223636.850A93CA92@coffee.object-craft.com.au>
References: <15935.54789.655700.175234@montanaro.dyndns.org>
        <20030204223636.850A93CA92@coffee.object-craft.com.au>
Message-ID: <15936.17722.20563.750609@montanaro.dyndns.org>

    Andrew> If I remember correctly, Python 2.3 and Python 2.2 are very
    Andrew> different with regard to memory allocation - most of our testing
    Andrew> has been done with 2.2, PyMem_Malloc is a thin layer on top of
    Andrew> the system malloc, is it not?

No, it's the portal into PyMalloc, an efficient, generic small block
allocator.

    Andrew> Funnily enough, I had a weird heap corruption problem in a
    Andrew> python application that used csv a while back - because csv was
    Andrew> the only extension module I was using, I immediatly assumed csv
    Andrew> was the source, and spent many hours trying to find a
    Andrew> miss-handled memory allocation. I eventually decided the problem
    Andrew> was elsewhere (can't remember details).  Looks like it might
    Andrew> have been csv after all.

And the behavior will be different with different mallocs.  free() on Mac OS
X is apparently smart enough to realize it was being handed bad memory and
refused to really free() it.  Other free()'s might blindly charge ahead,
corrupting PyMalloc's memory.  In my case, I think it mostly just caused
memory leaks because those chunks were never freed as far as PyMalloc was
concerned.

Looking at .../include/python2.N/pymem.h, it looks like PyMem_Malloc and
MyMem_Free have needed to be paired up for awhile whenever PyMalloc was
enabled.  From 2.1/2.2:

    extern DL_IMPORT(void *) PyMem_Malloc(size_t);
    extern DL_IMPORT(void *) PyMem_Realloc(void *, size_t);
    extern DL_IMPORT(void) PyMem_Free(void *);

>From 2.3:

    PyAPI_FUNC(void *) PyMem_Malloc(size_t);
    PyAPI_FUNC(void *) PyMem_Realloc(void *, size_t);
    PyAPI_FUNC(void) PyMem_Free(void *);

The difference between 2.1/2.2 and 2.3 is that PyMalloc is enabled by
default in 2.3.

Skip

From andrewm at object-craft.com.au  Wed Feb  5 00:16:22 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Feb 2003 10:16:22 +1100
Subject: [Csv] bug fixed 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15936.17722.20563.750609@montanaro.dyndns.org> 
References: <15935.54789.655700.175234@montanaro.dyndns.org>
	<20030204223636.850A93CA92@coffee.object-craft.com.au>
	<15936.17722.20563.750609@montanaro.dyndns.org> 
Message-ID: <20030204231622.BE98E3CA92@coffee.object-craft.com.au>

>No, it's the portal into PyMalloc, an efficient, generic small block
>allocator.

But only when python is built with --enable-pymalloc - this only became
the default in 2.3 - prior to that, the default was to use the system
malloc.

>>From 2.3:
>
>    PyAPI_FUNC(void *) PyMem_Malloc(size_t);
>    PyAPI_FUNC(void *) PyMem_Realloc(void *, size_t);
>    PyAPI_FUNC(void) PyMem_Free(void *);
>
>The difference between 2.1/2.2 and 2.3 is that PyMalloc is enabled by
>default in 2.3.

And not enabled by default in 2.2... which is approximately the point I was
trying to make in my previous e-mail... 8-)

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Wed Feb  5 00:36:48 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Feb 2003 10:36:48 +1100
Subject: [Csv] QUOTE_* constants
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15935.18971.504878.413266@montanaro.dyndns.org> 
References: <15935.18971.504878.413266@montanaro.dyndns.org> 
Message-ID: <20030204233648.BFCD83CA92@coffee.object-craft.com.au>

>I had a version of _csv.c which exported the QUOTE_* constants (safer than
>defining it in two places I think).  That barfed as well, though with a
>negative reference count trying to (I think) set the lineterminator
>attribute of a Dialect instance.
>
>I'm going to take another look in the morning.

I definitely think this change is a good idea - let me know if you need a
hand to make it work.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Wed Feb  5 02:52:10 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Feb 2003 12:52:10 +1100
Subject: [Csv] _csv bug
Message-ID: <20030205015210.2714C3CA92@coffee.object-craft.com.au>

This is okay:

    >>> p=_csv.parser()
    >>> p.join(['1','2','3,4'])
    '1,2,"3,4"\r\n'
    >>> p=_csv.parser()

As is this:

    >>> p=_csv.parser()
    >>> p.quotechar=None
    >>> p.join(['1','2','3,4'])
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    _csv.Error: delimiter must be quoted or escaped

But this is broken:

    >>> p=_csv.parser(quotechar=None)
    >>> p.quoting
    3
    >>> p.quoting=1
    >>> p.join(['1','2','3,4'])
    '\x001\x00,\x002\x00,\x003,4\x00\r\n'

The obvious fix is to add an additional test to Parser_setattr to disallow
this combination. I've added this:

                if (!self->have_quotechar && n != QUOTE_NONE) {
                    PyErr_BadArgument();
                    return -1;
                }

But I don't entirely like the idea of raising such a generic error. If
anyone has a better suggestion, let me know.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Wed Feb  5 03:05:14 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Feb 2003 13:05:14 +1100
Subject: [Csv] another _csv question
Message-ID: <20030205020514.DC52D3CA92@coffee.object-craft.com.au>

In csv_parser, while validating keyword arguments, we set quoting
to QUOTE_NONE if quotechar is not set - I think we should be raising
an exception in this case (but it must be defered until all keyword
arguments have been parsed). Any objections?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Wed Feb  5 03:07:57 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 4 Feb 2003 20:07:57 -0600
Subject: [Csv] _csv bug
In-Reply-To: <20030205015210.2714C3CA92@coffee.object-craft.com.au>
References: <20030205015210.2714C3CA92@coffee.object-craft.com.au>
Message-ID: <15936.29181.916055.866700@montanaro.dyndns.org>

    Andrew> But this is broken:

    Andrew> [quotechar=None && p.quoting==1]

    Andrew> The obvious fix is to add an additional test to Parser_setattr
    Andrew> to disallow this combination. I've added this:
    ...
    Andrew> But I don't entirely like the idea of raising such a generic
    Andrew> error. If anyone has a better suggestion, let me know.

We never really decided on the split between sanity checkes in csv.py
vs. sanity checks in _csv.c did we?  I've got a change to csv.py ready to
check in which adds __init__ and _validate methods to the Dialect class.  If
we do more elaborate checks there, I think we can get away with coarser
checks and exceptions in _csv.c, bascially just stuff to keep the
interpreter from dumping core.

Skip

From skip at pobox.com  Wed Feb  5 03:10:06 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 4 Feb 2003 20:10:06 -0600
Subject: [Csv] another _csv question
In-Reply-To: <20030205020514.DC52D3CA92@coffee.object-craft.com.au>
References: <20030205020514.DC52D3CA92@coffee.object-craft.com.au>
Message-ID: <15936.29310.765282.986147@montanaro.dyndns.org>

    Andrew> In csv_parser, while validating keyword arguments ...

Before we add a bunch of checks to _csv.c why don't we decide the split
between the Python and C levels as far as validation is concerned?

I have Dialect validation happening at instantiation time.  I suspect we
should provide a __setattr__ that forces Dialect instances to be read-only.

Skip

From andrewm at object-craft.com.au  Wed Feb  5 03:16:19 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Feb 2003 13:16:19 +1100
Subject: [Csv] _csv bug 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15936.29181.916055.866700@montanaro.dyndns.org> 
References: <20030205015210.2714C3CA92@coffee.object-craft.com.au>
	<15936.29181.916055.866700@montanaro.dyndns.org> 
Message-ID: <20030205021619.6AC2C3CA92@coffee.object-craft.com.au>

>We never really decided on the split between sanity checkes in csv.py
>vs. sanity checks in _csv.c did we?  I've got a change to csv.py ready to
>check in which adds __init__ and _validate methods to the Dialect class.  If
>we do more elaborate checks there, I think we can get away with coarser
>checks and exceptions in _csv.c, bascially just stuff to keep the
>interpreter from dumping core.

I'd like the underlying _csv module to be sane in it's own right -
I'd really rather these tests were kept in _csv. It's also where the
parameters have meaning - if you're adding a new parameter to _csv,
then you're more likely to add appropriate tests than if you also have
to update csv.py.

I also suspect we can move more functionalty from csv.py into _csv to
reduce overhead further. Some benchmarking is required - it might be that
we can become significantly faster by _csv talk directly to fileobj when
writing, etc.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Wed Feb  5 03:20:35 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Feb 2003 13:20:35 +1100
Subject: [Csv] another _csv question 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15936.29310.765282.986147@montanaro.dyndns.org> 
References: <20030205020514.DC52D3CA92@coffee.object-craft.com.au>
	<15936.29310.765282.986147@montanaro.dyndns.org> 
Message-ID: <20030205022036.F3E3C3CA92@coffee.object-craft.com.au>

>I suspect we should provide a __setattr__ that forces Dialect instances to
>be read-only.

I think this is an unnecessary restriction. You might want to do something
like:

    class SnifferDialect(csv.Dialect):
        pass

    def sniff(...):
        dialect = SnifferDialect()
        ... try stuff ...
        dialect.delimiter = '\t'
        ... try more stuff ...

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Wed Feb  5 03:35:40 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 4 Feb 2003 20:35:40 -0600
Subject: [Csv] _csv bug 
In-Reply-To: <20030205021619.6AC2C3CA92@coffee.object-craft.com.au>
References: <20030205015210.2714C3CA92@coffee.object-craft.com.au>
        <15936.29181.916055.866700@montanaro.dyndns.org>
        <20030205021619.6AC2C3CA92@coffee.object-craft.com.au>
Message-ID: <15936.30844.628733.452308@montanaro.dyndns.org>

    Andrew> I'd like the underlying _csv module to be sane in it's own right
    Andrew> - I'd really rather these tests were kept in _csv. 

No argument here.  I'm just thinking that the _csv module only has to defend
against rotten inputs.  It can raise a generic error as far as I'm
concerned.

    Andrew> I also suspect we can move more functionalty from csv.py into
    Andrew> _csv to reduce overhead further. Some benchmarking is required -
    Andrew> it might be that we can become significantly faster by _csv talk
    Andrew> directly to fileobj when writing, etc.

What I'm talking about happens once, at Dialect instantiation time, so I
doubt performance is going to be a big issue.  It's also easier to give more
comprehensive feedback in Python.

Skip

From skip at pobox.com  Wed Feb  5 03:39:27 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 4 Feb 2003 20:39:27 -0600
Subject: [Csv] another _csv question 
In-Reply-To: <20030205022036.F3E3C3CA92@coffee.object-craft.com.au>
References: <20030205020514.DC52D3CA92@coffee.object-craft.com.au>
        <15936.29310.765282.986147@montanaro.dyndns.org>
        <20030205022036.F3E3C3CA92@coffee.object-craft.com.au>
Message-ID: <15936.31071.461489.580370@montanaro.dyndns.org>

    >> I suspect we should provide a __setattr__ that forces Dialect
    >> instances to be read-only.

    Andrew> I think this is an unnecessary restriction. You might want to do
    Andrew> something like:

    Andrew>     class SnifferDialect(csv.Dialect):
    Andrew>         pass

    Andrew>     def sniff(...):
    Andrew>         dialect = SnifferDialect()
    Andrew>         ... try stuff ...
    Andrew>         dialect.delimiter = '\t'
    Andrew>         ... try more stuff ...

I can buy that.  Maybe what we need then is some way to force validation
after changes are made, but before the dialect info is tossed over the wall
to the low-level module.

Skip

From andrewm at object-craft.com.au  Wed Feb  5 04:26:17 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Feb 2003 14:26:17 +1100
Subject: [Csv] another _csv question 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15936.31071.461489.580370@montanaro.dyndns.org> 
References: <20030205020514.DC52D3CA92@coffee.object-craft.com.au>
	<15936.29310.765282.986147@montanaro.dyndns.org>
	<20030205022036.F3E3C3CA92@coffee.object-craft.com.au>
	<15936.31071.461489.580370@montanaro.dyndns.org> 
Message-ID: <20030205032617.2D5333CA92@coffee.object-craft.com.au>

>    Andrew> I think this is an unnecessary restriction. You might want to do
>    Andrew> something like:
>
>    Andrew>     class SnifferDialect(csv.Dialect):
>    Andrew>         pass
>
>    Andrew>     def sniff(...):
>    Andrew>         dialect = SnifferDialect()
>    Andrew>         ... try stuff ...
>    Andrew>         dialect.delimiter = '\t'
>    Andrew>         ... try more stuff ...
>
>I can buy that.  Maybe what we need then is some way to force validation
>after changes are made, but before the dialect info is tossed over the wall
>to the low-level module.

I think we're trying too hard - it's acceptable for the validation to only
occur when the reader or writer factories are called, I think.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Wed Feb  5 04:27:31 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Feb 2003 14:27:31 +1100
Subject: [Csv] _csv bug 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15936.30844.628733.452308@montanaro.dyndns.org> 
References: <20030205015210.2714C3CA92@coffee.object-craft.com.au>
	<15936.29181.916055.866700@montanaro.dyndns.org>
	<20030205021619.6AC2C3CA92@coffee.object-craft.com.au>
	<15936.30844.628733.452308@montanaro.dyndns.org> 
Message-ID: <20030205032732.049303CA92@coffee.object-craft.com.au>

>    Andrew> I also suspect we can move more functionalty from csv.py into
>    Andrew> _csv to reduce overhead further. Some benchmarking is required -
>    Andrew> it might be that we can become significantly faster by _csv talk
>    Andrew> directly to fileobj when writing, etc.
>
>What I'm talking about happens once, at Dialect instantiation time, so I
>doubt performance is going to be a big issue.  It's also easier to give more
>comprehensive feedback in Python.

What sort of comprehensive feedback did you have in mind?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Wed Feb  5 04:41:59 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 4 Feb 2003 21:41:59 -0600
Subject: [Csv] _csv bug 
In-Reply-To: <20030205032732.049303CA92@coffee.object-craft.com.au>
References: <20030205015210.2714C3CA92@coffee.object-craft.com.au>
        <15936.29181.916055.866700@montanaro.dyndns.org>
        <20030205021619.6AC2C3CA92@coffee.object-craft.com.au>
        <15936.30844.628733.452308@montanaro.dyndns.org>
        <20030205032732.049303CA92@coffee.object-craft.com.au>
Message-ID: <15936.34823.558627.998337@montanaro.dyndns.org>

    >> It's also easier to give more comprehensive feedback in Python.

    Andrew> What sort of comprehensive feedback did you have in mind?

Stuff like:

    class myexcel(csv.excel):
        quotechar = ','
    ...
    quotechar and delimiter must be different

or

    class myexcel(csv.excel):
        lineterminator = '\n'
    ...
    lineterminator and the hard return character should be different

That sort of thing.  (Speaking of which, we should probably all the user to
specify the hard (embedded) return character.)  It's tough enough in C to
generate really good messages (because it often requires pasting strings
together on-the-fly to provide the necessary context) that it frequently
doesn't get done.  For example, if I pass None instead of an int for
parameters with 'i' format characters, all PyArg_PTAK says is "int was
required".  However, there are nine args to the constructor, five of which
are ints.

Skip

From andrewm at object-craft.com.au  Wed Feb  5 04:49:32 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Feb 2003 14:49:32 +1100
Subject: [Csv] more tests
Message-ID: <20030205034932.3A1C83CA92@coffee.object-craft.com.au>

I've checked in some more tests - while not comprehensive, they get us
close to 90% coverage, as calculated by gcov. The remaining untested
lines are mainly checks for failed memory allocations.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Wed Feb  5 04:54:24 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Feb 2003 14:54:24 +1100
Subject: [Csv] _csv bug 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15936.34823.558627.998337@montanaro.dyndns.org> 
References: <20030205015210.2714C3CA92@coffee.object-craft.com.au>
	<15936.29181.916055.866700@montanaro.dyndns.org>
	<20030205021619.6AC2C3CA92@coffee.object-craft.com.au>
	<15936.30844.628733.452308@montanaro.dyndns.org>
	<20030205032732.049303CA92@coffee.object-craft.com.au>
	<15936.34823.558627.998337@montanaro.dyndns.org> 
Message-ID: <20030205035424.48D253CA92@coffee.object-craft.com.au>

>That sort of thing.  (Speaking of which, we should probably all the user to
>specify the hard (embedded) return character.)  It's tough enough in C to
>generate really good messages (because it often requires pasting strings
>together on-the-fly to provide the necessary context) that it frequently
>doesn't get done.  For example, if I pass None instead of an int for
>parameters with 'i' format characters, all PyArg_PTAK says is "int was
>required".  However, there are nine args to the constructor, five of which
>are ints.

I'm not sure this is a good enough reason to move the checks away from the
"coalface" - with a little more work, we can generate friendly messages
from the C level, while at the same time keeping them tightly coupled
to the implementation. I'd certainly agree the PyArg_PTAK validation
is less than useful in our context - but I think it highlights a more
fundemental problem in the way the C code is structured. I'll talk to
Dave tonight and see if we can come up with something better.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From djc at object-craft.com.au  Wed Feb  5 11:11:21 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 05 Feb 2003 21:11:21 +1100
Subject: [Csv] _csv bug
In-Reply-To: <20030205035424.48D253CA92@coffee.object-craft.com.au>
References: <20030205015210.2714C3CA92@coffee.object-craft.com.au>
	<15936.29181.916055.866700@montanaro.dyndns.org>
	<20030205021619.6AC2C3CA92@coffee.object-craft.com.au>
	<15936.30844.628733.452308@montanaro.dyndns.org>
	<20030205032732.049303CA92@coffee.object-craft.com.au>
	<15936.34823.558627.998337@montanaro.dyndns.org>
	<20030205035424.48D253CA92@coffee.object-craft.com.au>
Message-ID: <m3hebjghpy.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>> That sort of thing.  (Speaking of which, we should probably all the
>> user to specify the hard (embedded) return character.)  It's tough
>> enough in C to generate really good messages (because it often
>> requires pasting strings together on-the-fly to provide the
>> necessary context) that it frequently doesn't get done.  For
>> example, if I pass None instead of an int for parameters with 'i'
>> format characters, all PyArg_PTAK says is "int was required".
>> However, there are nine args to the constructor, five of which are
>> ints.

Andrew> I'm not sure this is a good enough reason to move the checks
Andrew> away from the "coalface" - with a little more work, we can
Andrew> generate friendly messages from the C level, while at the same
Andrew> time keeping them tightly coupled to the implementation. I'd
Andrew> certainly agree the PyArg_PTAK validation is less than useful
Andrew> in our context - but I think it highlights a more fundemental
Andrew> problem in the way the C code is structured. I'll talk to Dave
Andrew> tonight and see if we can come up with something better.

We spoke for a short while and decided that it might make more sense
to remove the PyArg_PTAK stuff altogether and just use the __setattr__
stuff in _csv.

One of the problems in this approach is that PyArg_PTAK allows you to
set multiple attributes simultaneously while __setattr__ is one
attribute at a time.  This means that it is not really feasible to
validate settings in the __setattr__ method - the user would have to
work out a sequence of __setattr__ steps to go from one dialect to the
next without ever having illegal parameter settings.

There are two obvious way around this that I can see.

1.  Mark the parser dirty whenever __setattr__ is called then check
    the dirty flag on the next method call which uses the parser.  If
    parser is dirty, check that the parameter set is valid.

2.  Only check the legality of the parameter set when the user calls
    the check_attrs() (or whatever) method.

- Dave

-- 
http://www.object-craft.com.au

From djc at object-craft.com.au  Wed Feb  5 11:14:04 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 05 Feb 2003 21:14:04 +1100
Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv
	_csv.c,1.9,1.10
In-Reply-To: <15933.57844.794060.305738@montanaro.dyndns.org>
References: <E18fJ2y-0004VH-00@sc8-pr-cvs1.sourceforge.net>
	<15933.57844.794060.305738@montanaro.dyndns.org>
Message-ID: <m3d6m7ghlf.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

dave> Oops - forgot to check for '+-.' when quoting is
dave> QUOTE_NONNUMERIC.

Skip> Looking at the code, I wonder if when quoting is set to
Skip> NONNUMERIC a single attempt to call PyFloat_FromString(field)
Skip> should be made and the result used to identify the field as
Skip> numeric or not.  (Not for performance, but for accuracy of the
Skip> setting.)

You are probably right.  The current code is completely ignorant of
locale settings.

- Dave

-- 
http://www.object-craft.com.au

From djc at object-craft.com.au  Wed Feb  5 11:30:12 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 05 Feb 2003 21:30:12 +1100
Subject: [Csv] csv.QUOTE_NEVER?
In-Reply-To: <20030203035102.34B183C1F4@coffee.object-craft.com.au>
References: <15930.60672.18719.407166@montanaro.dyndns.org>
	<m38yx0shk8.fsf@ferret.object-craft.com.au>
	<20030203035102.34B183C1F4@coffee.object-craft.com.au>
Message-ID: <m38ywvgguj.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

Skip> The three quoting constants are currently defined as
Skip> QUOTE_MINIMAL, QUOTE_ALL and QUOTE_NONNUMERIC.  Didn't we decide
Skip> there would be a QUOTE_NEVER constant as well?
>>  I was going to define QUOTE_NEVER then realised that all you have
>> to do is set quotechar to None.  Why add the effort of implementing
>> two ways to achieve the same thing.

Andrew> "quotechar" as None probably should be illegal in the new
Andrew> module, and the "quoting" parameter used exclusively. This
Andrew> would be consistent with the direction we've taken with other
Andrew> parameters.

OK.

I have made the changes to the _csv module.  I am not sure what to do
with the tests.  It seems a shame to delete them - can you have a look
and see if there is some way you can change the failing tests to
meaningful tests which succeed with the new module?

- Dave

-- 
http://www.object-craft.com.au

From skip at pobox.com  Wed Feb  5 15:28:00 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 5 Feb 2003 08:28:00 -0600
Subject: [Csv] _csv bug
In-Reply-To: <m3hebjghpy.fsf@ferret.object-craft.com.au>
References: <20030205015210.2714C3CA92@coffee.object-craft.com.au>
        <15936.29181.916055.866700@montanaro.dyndns.org>
        <20030205021619.6AC2C3CA92@coffee.object-craft.com.au>
        <15936.30844.628733.452308@montanaro.dyndns.org>
        <20030205032732.049303CA92@coffee.object-craft.com.au>
        <15936.34823.558627.998337@montanaro.dyndns.org>
        <20030205035424.48D253CA92@coffee.object-craft.com.au>
        <m3hebjghpy.fsf@ferret.object-craft.com.au>
Message-ID: <15937.8048.817556.835043@montanaro.dyndns.org>

    Dave> We spoke for a short while and decided that it might make more
    Dave> sense to remove the PyArg_PTAK stuff altogether and just use the
    Dave> __setattr__ stuff in _csv.

Why not replace the 'i' and 'S' flags with 'O' flags in PyArg_PTAK, then
validate on the resulting group of Python objects?

Skip

From skip at pobox.com  Wed Feb  5 15:47:28 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 5 Feb 2003 08:47:28 -0600
Subject: [Csv] _csv bug
In-Reply-To: <m3hebjghpy.fsf@ferret.object-craft.com.au>
References: <20030205015210.2714C3CA92@coffee.object-craft.com.au>
        <15936.29181.916055.866700@montanaro.dyndns.org>
        <20030205021619.6AC2C3CA92@coffee.object-craft.com.au>
        <15936.30844.628733.452308@montanaro.dyndns.org>
        <20030205032732.049303CA92@coffee.object-craft.com.au>
        <15936.34823.558627.998337@montanaro.dyndns.org>
        <20030205035424.48D253CA92@coffee.object-craft.com.au>
        <m3hebjghpy.fsf@ferret.object-craft.com.au>
Message-ID: <15937.9216.15907.519181@montanaro.dyndns.org>

    Dave> We spoke for a short while and decided that it might make more
    Dave> sense to remove the PyArg_PTAK stuff altogether and just use the
    Dave> __setattr__ stuff in _csv.

Why not replace the 'i' and 'S' flags with 'O' flags in PyArg_PTAK, then
validate on the resulting group of Python objects?

Skip

From andrewm at object-craft.com.au  Wed Feb  5 23:35:48 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 06 Feb 2003 09:35:48 +1100
Subject: [Csv] csv.QUOTE_NEVER? 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
	<m38ywvgguj.fsf@ferret.object-craft.com.au> 
References: <15930.60672.18719.407166@montanaro.dyndns.org>
	<m38yx0shk8.fsf@ferret.object-craft.com.au>
	<20030203035102.34B183C1F4@coffee.object-craft.com.au>
	<m38ywvgguj.fsf@ferret.object-craft.com.au> 
Message-ID: <20030205223548.6E3E73CA92@coffee.object-craft.com.au>

>I have made the changes to the _csv module.  I am not sure what to do
>with the tests.  It seems a shame to delete them - can you have a look
>and see if there is some way you can change the failing tests to
>meaningful tests which succeed with the new module?

Nearly all of those tests were no longer relevant when the parameters
were rationalised - they were doing things like checking that "quoting"
was changed to something reasonable when "quotechar" was changed,
or that an exception was raised when their values conflicted. The new
scheme prevents that happening in a far more reasonable manner.

BTW, I think we've introduced an bug when we split some of the variables
into "have_<blah>" and "<blah>", where <blah> could then contain null -
in many places, we assume we're dealing with null terminated C strings,
and I suspect the user might now be able to inject a null in places we
don't expect it.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From cribeiro at mail.inet.com.br  Wed Feb  5 23:52:18 2003
From: cribeiro at mail.inet.com.br (Carlos Ribeiro)
Date: Wed, 5 Feb 2003 22:52:18 +0000
Subject: [Csv] PEP 305 - Comments (really long post)
Message-ID: <200302052252.18694.cribeiro@mail.inet.com.br>

To the CSV'ers,

I was discussing the CSV implementation with Skip 'et alii' over the Python 
list, but i decided to wait a little bit to put my ideas in a better form and 
then contribute to the PEP. Please bear with me as this message is rather 
long and may be confusing at times, but I sincerely hope it helps.

MOTIVATION

I use CSV files (almost) daily. I have written a few CSV parsers in my life, 
and for reasons that should be apparent to anyone who ever worked with them, 
I never managed to put everything in a single generic library. It's at the 
same time simple but 'catchy' - when you think you've done it, there appears 
some other different piece of software that manages it differently.

CONCEPTUAL ISSUES

------------
[1] I know that localization is a hard problem, and that it will not probably 
be directly supported, but let me explain why it is important and what are 
the issues at hand.

One of the biggest issues with reading and writing CSV data is localization. 
In countries where the comma is used as a decimal separator, it is common to 
have some other character to serve as a delimiter. The semi comma is the 
standard choice for MS packages using the Brazilian locale; I'm not aware of 
the settings for other countries. So far, fine, because the csv library can 
be configured for alternative field delimiters. But parsing numbers is a real 
problem. For example, look at these lines:

"row 1";10
"row 2";3,1416

--> I assume that the csv library will parse the first line as ("row 1",10); 
the number 10 will be probably returned as a integer (which is not the 
correct interpretation for this particular file - more on this item [2]).

--> The second line will probably be parsed as ("row 2","3,1416"); it may even 
raise an exception, depending on the implementation details! What do you 
intend to do in this case?

Another point that you should bear in mind: even here in Brazil, some programs 
will use the standard (US) delimiters and number formats, while others will 
use the localized ones. So we end up needing to read/write both formats - for 
example, when reading data from Excel, and then exporting the same data to 
some scientific package that is not locale-aware. So any localization-related 
parameters have to be flexible and easily customizable.

------------
[2] I assume that the csv library will convert any numbers read from the csv 
file to some of numeric types available in Python upon reading. There are 
some issues here.

In most cases, it is important to keep a regular conversion rule inside a 
given 'column' of the csv file. For example, in this file:

"row 1";10;1
"row 2";3,1416;2
"row 3";-1;3

The obvious choice is to parse the column 1 as strings; column 2 as floats; 
and column 3 as integers. But the problem is, how is the csv library supposed 
to know that the second column hold float values, and not integers? Look 
ahead is out of question - after all, the only line containing a decimal 
point may be the last on a 10 GB file.

For this problem, I propose the following semantics:

a) Numbers will be interpreted accordingly to a parameter set in the dialect:
- NUMBER_AS_AUTO: (default value) numeric values will be converted to the 
simplest type available.
- NUMBER_AS_FLOAT: all numeric values will be converted to floats.
- NUMBER_AS_INT: all numeric values will be converted to ints.

b) assuming that the default column types were not supplied, the csv library 
will try to detect the correct values from the ones read from the first line 
of the file, but respecting the parameters mentioned above. If the first line 
contains column headers, then it will use the second line for this purpose.

c) from the second line onwards, the csv library will keep the same conversion 
used for the first line. In case of error (for example, a float is found in a 
integer-only column), the library may take one of these actions:

- raise an exception
- coerce the value to the standard type for that particular column
- return the value as read, even if using a different type

------------
[3] I really liked the concept of the csvreader interface. I liked the fact 
that the constructor takes a file object, not a file name, for the simple 
reason that it leads to a very nice design pattern. It makes easier to 
compound objects and to reuse the csvreader with other stuff that not files 
(just one idea, why not read CSV values directly from the clipboard? it's a 
good application of this design).

------------
[4] That said, I have one concern: setting the line terminator in the CSV 
library (using the dialect class) does not seem right. If I pass a generic 
iterable objects as the CSV file parameter, then it imply that the iterable 
itself should bear the choice on how to break lines. For instance, one should 
be able to write code like this:

# export a file with CR/LF linebreaks (DOS/Windows style)
csvwriter = csv.writer(file_crlf("some.csv", "w"))
for row in mydata:
    csvwriter.write(row)
# export a file with LF linebreaks (Unix style)
csvwriter = csv.writer(file_lf_only("some.csv", "w"))
for row in mydata:
    csvwriter.write(row)

... where 'file_crlf' and 'file_lf_only' are subclasses of 'file' that 
implement different line terminators (it has to be independent of the 
underlying OS and/or C library, of course). 

[Of course, this is religious stuff - the old Unix x DOS/Window line break 
debate.  But please let us avoid it and focus on the problem at hand.]

My point here is that the line terminator in the CSV library will end up being 
useless, as it depends ultimately on the ability of the csvwriter.write() 
method to convince the file object to use the 'correct' line terminator. I'm 
not sure if this can be done in a generic fashion, unless more restrictions 
are placed on the 'file-like' object that can be passed to the constructor.

In other words: IF the csv library takes a file object as a parameter, IN SUCH 
A WAY that all that the csv library sees are entire lines (as strings), THEN 
it has to delegate line termination to the file object. On the other hand, if 
the csv library wants to have full control of line delimiters, then it should 
take a file name only and treat the file as a binary stream.

------------
[5] It is not clear to me what is returned as a row in the example given:

csvreader = csv.reader(file("some.csv"))
for row in csvreader:
    process(row)

It is obvious to assume that 'row' is a sequence, probably a tuple. Anyway, it 
should be clearly stated in the PEP.

------------
[6] Empty strings can be mistaken for NULL fields (or None values in Python). 
How do you think to manage this case, both when reading and writing? Please 
note that, depending on the selection of the quote behavior and due to some 
side effects, it may be impossible for the reader to discern the two cases; 
so the library will need to be informed about the default choice.

For example, for a given quote and delimiter choice, what does the reader do?

Example (a): [quotechar=", quoting=QUOTE_ALL]
  "",,""  -->  ("", None, "")

In this case, the reader can assume safely that the empty field holds 'None', 
because empty strings should be quoted in this case.

Example (b): [quotechar=", quoting=QUOTE_NONE]
  ,,  -->  (None, None, None) or ("","","")?
  "",,""  -->  ("", None, "") or ("", "", "")?

The example shows the ambiguity of empty fields when delimiters are not 
mandatory (as with QUOTE_NONE or QUOTE_MINIMAL). In this example, I still 
think that the reader should interpret empty fields as None; in the second 
case, it's easier to guess, but the first case is open for debate.

My suggestion is to add a parameter in the dialect to set the correct 
behavior:

class excel:
    delimiter = ','
    quotechar = '"'
    escapechar = None
    doublequote = True
    skipinitialspace = False
    lineterminator = '\r\n'
    quoting = QUOTE_MINIMAL
    emptyfield = None              # add this definition to the dialect
    emptyfield = ""              # another possible choice

BTW, the same reasoning may be applied to the decision between returning 
'None' or 'zero' when reading an empty numeric field.

------------
[7] A minor suggestion, why not number the items in the "Issues" section? It 
would make easier to reference comments... For example, 'issue #1', etc...

------------
[8] My comments on the last issue of the list - rows of different lengths:

It depends on the goals of the csv library. If the library is intended to be 
small and simple, and not to do anything automatically, then the reader 
should simply return a sequence of fields read, independent of the length. It 
should be left to the programmer to handle special cases.

On the other hand, if the csv library is being proposed as a more generic 
solution, it may be interesting to study the options presented (raise an 
exception, fill in short lines, etc.). However, in this case, the csv reader 
will need to know more about the particular structure of the csv file in 
order to be able to make the correct choice. It may include things as knowing 
the column type, etc. That's complex, and it is one of the reasons why 
special treatment for floats and dates is being left out of this 
implementation.

------------
[9] A very similar architecture can be used to handle fixed-width text files. 
It can be done in a separate library, but using a similar interface; or it 
could be part of the csv library, either as another class, or by means of a 
proper selection of the parameters passed to the constructor. It would be 
useful as some applications may like best the fixed-width files instead of 
the delimited ones (old COBOL programs are likely to behave this way; this 
format is still common when passing data to/from mainframes).

Carlos Ribeiro
cribeiro at mail.inet.com.br

From andrewm at object-craft.com.au  Thu Feb  6 02:04:09 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 06 Feb 2003 12:04:09 +1100
Subject: [Csv] Dialect passing...
Message-ID: <20030206010409.BE0F23CA92@coffee.object-craft.com.au>

Just to throw the cat amongst the pigeons, it occured to me that my
logic for making the dialect a instance rather than a dict was slightly
bogus: the inheritance can still be done with a dictionary, simply by
copying it:

excel = { 'delimiter': ',' }

excel_tab = excel.copy()
excel_tab['delimiter'] = '\t'

That said, the instance approach looks a little more natural to my eye.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From djc at object-craft.com.au  Thu Feb  6 02:29:04 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 06 Feb 2003 12:29:04 +1100
Subject: [Csv] Dialect passing...
In-Reply-To: <20030206010409.BE0F23CA92@coffee.object-craft.com.au>
References: <20030206010409.BE0F23CA92@coffee.object-craft.com.au>
Message-ID: <m3of5qyz6n.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

Andrew> Just to throw the cat amongst the pigeons, it occured to me
Andrew> that my logic for making the dialect a instance rather than a
Andrew> dict was slightly bogus: the inheritance can still be done
Andrew> with a dictionary, simply by copying it:

Andrew> excel = { 'delimiter': ',' }

Andrew> excel_tab = excel.copy()
Andrew> excel_tab['delimiter'] = '\t'

Noooo.....

I hereby sentence you to 2 weeks hard labour writing Perl so you can
learn the error of your ways!

Andrew> That said, the instance approach looks a little more natural
Andrew> to my eye.

I can see you have expressed remorse.  I commute your punishment
to community work - to be served online.

- Dave

-- 
http://www.object-craft.com.au

From djc at object-craft.com.au  Thu Feb  6 02:29:36 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 06 Feb 2003 12:29:36 +1100
Subject: [Csv] _csv bug
In-Reply-To: <15937.9216.15907.519181@montanaro.dyndns.org>
References: <20030205015210.2714C3CA92@coffee.object-craft.com.au>
	<15936.29181.916055.866700@montanaro.dyndns.org>
	<20030205021619.6AC2C3CA92@coffee.object-craft.com.au>
	<15936.30844.628733.452308@montanaro.dyndns.org>
	<20030205032732.049303CA92@coffee.object-craft.com.au>
	<15936.34823.558627.998337@montanaro.dyndns.org>
	<20030205035424.48D253CA92@coffee.object-craft.com.au>
	<m3hebjghpy.fsf@ferret.object-craft.com.au>
	<15937.9216.15907.519181@montanaro.dyndns.org>
Message-ID: <m3k7geyz5r.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Dave> We spoke for a short while and decided that it might make more
Dave> sense to remove the PyArg_PTAK stuff altogether and just use the
Dave> __setattr__ stuff in _csv.

Skip> Why not replace the 'i' and 'S' flags with 'O' flags in
Skip> PyArg_PTAK, then validate on the resulting group of Python
Skip> objects?

That idea is even more gooder.

- Dave

-- 
http://www.object-craft.com.au

From skip at pobox.com  Thu Feb  6 05:54:11 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 5 Feb 2003 22:54:11 -0600
Subject: [Csv] Please check this out...
Message-ID: <15937.60019.749774.278638@montanaro.dyndns.org>

Gang,

I just checked in an update to csv.py and test/test_csv.py which allows
csv.reader objects to return dicts.  In much the same way that the writer
can write a dict if told what the field name order is, the reader, if given
a list of fieldnames to use as keys can map the incoming list to a
dictionary.

There's just one little hitch.  I see a negative ref count abort in a CVS
debug build *if* there is a typo in the call to csv.reader().  This
short_test.py script demonstrates it on my Mac:

    import sys
    import unittest
    from StringIO import StringIO
    import csv

    class TestDictFields(unittest.TestCase):
        def test_read_short_with_rest(self):
            reader = csv.reader(StringIO("1,2,abc,4,5,6\r\n"), dialect="excel",
                                fieldnames=["f1", "f2"], restfields="_rest")
            self.assertEqual(reader.next(), {"f1": '1', "f2": '2',
                                             "_rest": ["abc", "4", "5", "6"]})

    def _testclasses():
        mod = sys.modules[__name__]
        return [getattr(mod, name) for name in dir(mod) if name.startswith('Test')]

    def suite():
        suite = unittest.TestSuite()
        for testclass in _testclasses():
            suite.addTest(unittest.makeSuite(testclass))
        return suite

    if __name__ == '__main__':
        unittest.main(defaultTest='suite')

Compare the csv.reader() call with the declaration of the __init__ method.
You'll see I've misspelled "restfield", giving it a needless 's'.  This
pushes it into the **options dict, and since that's not an understood
keyword arg, _csv.parser() complains, like so:

    Traceback (most recent call last):
      File "short_test.py", line 9, in test_read_short_with_rest
        fieldnames=["f1", "f2"], restfields="_rest")
      File "/Users/skip/local/lib/python2.2/site-packages/csv.py", line 102, in __init__
        _OCcsv.__init__(self, dialect, **options)
      File "/Users/skip/local/lib/python2.2/site-packages/csv.py", line 93, in __init__
        self.parser = _csv.parser(**parser_options)
    TypeError: 'restfields' is an invalid keyword argument for this function

Under 2.2, all I get is the above traceback (haven't yet tried a 2.2 debug
build).  With the latest CVS and a debug build I get:

    % /usr/local/bin/python short_test.py
    E
    ======================================================================
    ERROR: test_read_short_with_rest (__main__.TestDictFields)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "short_test.py", line 9, in test_read_short_with_rest
        fieldnames=["f1", "f2"], restfields="_rest")
      File "/usr/local/lib/python2.3/site-packages/csv.py", line 102, in __init__
        _OCcsv.__init__(self, dialect, **options)
      File "/usr/local/lib/python2.3/site-packages/csv.py", line 93, in __init__
        self.parser = _csv.parser(**parser_options)
    TypeError: 'restfields' is an invalid keyword argument for this function

    ----------------------------------------------------------------------
    Ran 1 test in 0.029s

    FAILED (errors=1)
    Fatal Python error: Objects/dictobject.c:686 object at 0x476e98 has negative ref count -606348326
    Abort trap

"-606348326" expressed as hex is '0xdbdbdbda' which looks suspiciously like
the 0xdb bytes which debug Pythons scribble in freed memory.

It's time for a long winter's nap here.  I'm sure you'll have it figured out
by the time I check my mail in the morning.  Actually, I'm suspicious
there's a refcounting bug in 2.3a1...

Thx,

Skip

From andrewm at object-craft.com.au  Thu Feb  6 05:56:11 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 06 Feb 2003 15:56:11 +1100
Subject: [Csv] dict argument to writer.writerow
Message-ID: <20030206045611.930023CA92@coffee.object-craft.com.au>

I don't think this belongs in writer.writerow - I'd suggest it belongs
in the as yet unwritten csv.util module. The problem is that it's going
to have an appreciable impact on the normal case of writing a tuple.
There's no need to have the code auto-detect a dictionary - the user
of the module will know before-hand whether they have a dict or tuple,
and can use the appropriate layer.

        # if fields is a dict, we need a valid fieldnames list
        # if self.fieldnames is None we'll get a TypeError in the for stmt
        # if fields is not a dict we'll get an AttributeError on .get()
        try:
            flist = []
            for k in self.fieldnames:
                flist.append(fields.get(k, ""))
            fields = flist
        except (TypeError, AttributeError):
            pass

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Thu Feb  6 06:59:02 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 5 Feb 2003 23:59:02 -0600
Subject: [Csv] Re: dict argument to writer.writerow
In-Reply-To: <20030206045611.930023CA92@coffee.object-craft.com.au>
References: <20030206045611.930023CA92@coffee.object-craft.com.au>
Message-ID: <15937.63910.387970.198505@montanaro.dyndns.org>

    Andrew> I don't think this belongs in writer.writerow - I'd suggest it
    Andrew> belongs in the as yet unwritten csv.util module. The problem is
    Andrew> that it's going to have an appreciable impact on the normal case
    Andrew> of writing a tuple.

Hmmm...  I think of reading/writing dicts as more integration with the DB
API.  I rarely use plain fetchall() when getting rows from a table.
Dictionaries are much saner objects.  Accordingly, I'd like it to be as
painless as possible for people to write them out to CSV files.

Also, one can frequently think of CSV files as a file of dicts with the
simple optimization that the dictionary keys are only written once, in the
first row.

That's not to say my code couldn't have been done differently.  I was trying
hard to avoid testing the type of the object being written.  In retrospect
the code I have will cause an exception to be raised and caught most of the
time.  Perhaps it would be better as:

    if hasattr(fields, "has_key"):
        # if fields is a dict, we need a valid fieldnames list
        # if self.fieldnames is None we'll get a TypeError in the for stmt
        # if fields is not a dict we'll get an AttributeError on .get()
        try:
            flist = []
            for k in self.fieldnames:
                flist.append(fields.get(k, ""))
            fields = flist
        except (TypeError, AttributeError):
            pass

That should lessen the load in the common case (call to hasattr() vs raised
and caught exception).  Alternatively, perhaps a writedict() method makes
sense.  It would be extremely rare (and nearly insane) for a user to write a
mixture of lists and dicts.  The user could either know what type of row to
write and call the proper method or test the type of the first row outside
the loop and assign a variable to the appropriate method.

In any case, I'd like it to be as easy as possible for people to write dicts
to CSV files and read rows into dicts as possible.

Skip

From skip at pobox.com  Thu Feb  6 08:06:02 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 6 Feb 2003 01:06:02 -0600
Subject: [Csv] Re: PEP 305 - Comments (really long post)
In-Reply-To: <200302052252.18694.cribeiro@mail.inet.com.br>
References: <200302052252.18694.cribeiro@mail.inet.com.br>
Message-ID: <15938.2394.88405.59280@montanaro.dyndns.org>

    Carlos> I was discussing the CSV implementation with Skip 'et alii' over
    Carlos> the Python list, but i decided to wait a little bit to put my
    Carlos> ideas in a better form and then contribute to the PEP. Please
    Carlos> bear with me as this message is rather long and may be confusing
    Carlos> at times, but I sincerely hope it helps.

Don't worry.  Most of us read the list, and I've been forwarding all
messages which were sent to c.l.py but not cc'd to the csv list so we at
least have them archived.  Accordingly, we're already familiar with your
plight. ;-)

    Carlos> For example, look at these lines:

    Carlos> "row 1";10
    Carlos> "row 2";3,1416

    Carlos> I assume that the csv library will parse the first line as ("row
    Carlos> 1",10); the number 10 will be probably returned as a integer
    Carlos> (which is not the correct interpretation for this particular
    Carlos> file - more on this item [2]).

You'd be wrong to assume that.  The csv reader will return a list of two
strings, "row 1" and "10".  How to interpret the contents of the strings is
completely up to you.

    Carlos> The second line will probably be parsed as ("row 2","3,1416");
    Carlos> it may even raise an exception, depending on the implementation
    Carlos> details! What do you intend to do in this case?

No exception will be raised.  Assuming you have the quotechar set to '"' and
the delimiter set to ';', you will, as you surmised, get the pair of strings
you indicated.  You are completely free to call int() with "3,1416" as an
argument.  As long as your locale is set correctly, it will work.

    Carlos> Another point that you should bear in mind: even here in Brazil,
    Carlos> some programs will use the standard (US) delimiters and number
    Carlos> formats, while others will use the localized ones. So we end up
    Carlos> needing to read/write both formats - for example, when reading
    Carlos> data from Excel, and then exporting the same data to some
    Carlos> scientific package that is not locale-aware. So any
    Carlos> localization-related parameters have to be flexible and easily
    Carlos> customizable.

I understand this is going to be a problem, however I have no way of solving
it for you in a way that will make everybody happy, so I'm not going to even
try.  The csv module is about abstracting away all the little weirdnesses
which crop up in different dialects of delimited files.

You, as the application programmer, have to be sensitive to the locales in
which your data will be interpreted.  If you expect to dump an Excel
spreadsheet to a CSV file for analysis by a colleague in the US, everyone's
going to be a lot happier if you send the data encoded for either the en_US
or C locales.  If that's not possible, you need to transmit locale
information along with the data.

If you have your locale set appropriately, when writing numeric data, the
csv module should just do the right thing.  It calls str() on all numeric
data to write it out.  I believe str() is locale-sensitive.

    Carlos> [2] I assume that the csv library will convert any numbers read
    Carlos> from the csv file to some of numeric types available in Python
    Carlos> upon reading. There are some issues here.

Nope.  You get strings.

    Carlos> "row 1";10;1
    Carlos> "row 2";3,1416;2
    Carlos> "row 3";-1;3

    Carlos> The obvious choice is to parse the column 1 as strings; column 2
    Carlos> as floats; and column 3 as integers. But the problem is, how is
    Carlos> the csv library supposed to know that the second column hold
    Carlos> float values, and not integers? Look ahead is out of question -
    Carlos> after all, the only line containing a decimal point may be the
    Carlos> last on a 10 GB file.

    Carlos> For this problem, I propose the following semantics:

    ...

Just apply the necessary semantics yourself.  Here's a suggestion.  Suppose
you know you want the first column to be strings, the second floats and the
third ints.  Code your read loop something like so:

    types = (str, float, int)
    reader = csv.reader(myfile)
    for row in reader:
        row = [t(v) for (t,v) in zip(types, row)]
        process(row)

That way you have complete control over the interpretation of the data.
Nobody guesses.  No decisions have to be made at the csv level when a piece
data doesn't fit the mold.

    Carlos> b) assuming that the default column types were not supplied, the
    Carlos>    csv library will try to detect the correct values from the
    Carlos>    ones read from the first line of the file, but respecting the
    Carlos>    parameters mentioned above. If the first line contains column
    Carlos>    headers, then it will use the second line for this purpose.

This is bound to fail.  You showed an example where floats and ints were
mixed up.  What if I had a column containing hex digits?  Heck, make it more
likely to guess wrong and make them base 9 or base 11 digits.  Most of the
time with base 11 numbers the will consist only of the digits 0 through
9.  No 'a' will appear.  With base 9 numbers it's even worse.  They can
always be interpreted as decimal numbers, but that interpretation will
always be incorrect.

    Carlos> c) from the second line onwards, the csv library will keep the
    Carlos>    same conversion used for the first line. In case of error
    Carlos>    (for example, a float is found in a integer-only column), the
    Carlos>    library may take one of these actions:

    Carlos> - raise an exception
    Carlos> - coerce the value to the standard type for that particular column
    Carlos> - return the value as read, even if using a different type

You're asking us to do way too much.  It is just not going to work in the
general case, and you can do a much better job much more simply at the
application level, because you know the properties of your data.  If we
attempted to do something very elaborate, we'd probably get it wrong.  Even
if we managed to get it right, it would probably be slow.

    Carlos> [4] That said, I have one concern: setting the line terminator
    Carlos> in the CSV library (using the dialect class) does not seem
    Carlos> right.

One thing (among many) that's still missing from the PEP is the admonition
that you have to pass in files opened in binary mode.  That lets the csv
module have complete control over line endings using the lineterminator
attribute. 

    Carlos> My point here is that the line terminator in the CSV library
    Carlos> will end up being useless, as it depends ultimately on the
    Carlos> ability of the csvwriter.write() method to convince the file
    Carlos> object to use the 'correct' line terminator. 

That's why we expect you to open files in binary mode.  I plan to make
another pass through the PEP tomorrow.  I will make sure I add this.

    Carlos> ------------
    Carlos> [5] It is not clear to me what is returned as a row in the
    Carlos> example given: 

    Carlos> csvreader = csv.reader(file("some.csv"))
    Carlos> for row in csvreader:
    Carlos>     process(row)

    Carlos> It is obvious to assume that 'row' is a sequence, probably a
    Carlos> tuple. Anyway, it should be clearly stated in the PEP.

Thanks, will do.  I'm trying to twist my colleagues' arms into letting the
reader return dicts and the writer accept dicts under the proper
circumstances, but the default case is that the reader will return lists and
the writer will accept sequences (lists, tuples, strings, unicode objects
and arrays from the standard library, though any other sequence should do as
well).

    Carlos> [6] Empty strings can be mistaken for NULL fields (or None
    Carlos> values in Python).  How do you think to manage this case, both
    Carlos> when reading and writing? Please note that, depending on the
    Carlos> selection of the quote behavior and due to some side effects, it
    Carlos> may be impossible for the reader to discern the two cases; so
    Carlos> the library will need to be informed about the default choice.

I don't like writing None out at all, but my colleagues assure me the SQL
people want SQL's NULL to map to None and that the most reasonable text
representation of None is the empty string.  Quoting doesn't count.  We have
no intention to imply semantics using quotes.  I believe we still have some
thinking to do about whether to allow the user to specify the actual string
representation of None.

    ...

    Carlos> BTW, the same reasoning may be applied to the decision between
    Carlos> returning 'None' or 'zero' when reading an empty numeric field.

Again, don't forget that the csv module does no infer types.  When you read
a row you get a list of strings.  It's up to the application to decide how
to interpret it.

    Carlos> [7] A minor suggestion, why not number the items in the "Issues"
    Carlos> section? It would make easier to reference comments... For
    Carlos> example, 'issue #1', etc...

Thanks, that's a good idea.

    Carlos> [8] My comments on the last issue of the list - rows of
    Carlos> different lengths:

Rows can be returned of different lengths.  How to deal with short or long
rows is the job of the application.

    Carlos> [9] A very similar architecture can be used to handle
    Carlos> fixed-width text files.  It can be done in a separate library,
    Carlos> but using a similar interface; or it could be part of the csv
    Carlos> library, either as another class, or by means of a proper
    Carlos> selection of the parameters passed to the constructor. It would
    Carlos> be useful as some applications may like best the fixed-width
    Carlos> files instead of the delimited ones (old COBOL programs are
    Carlos> likely to behave this way; this format is still common when
    Carlos> passing data to/from mainframes).

We thought about this briefly, but fixed-width data is not what CSV files
are all about.  The csv module is about parsing tabular data which uses
various delimiters, quoting and escaping techniques.  In addition,
fixed-width data is pretty trivial to read anyway, and probably doesn't
deserve a module of its own.  There are no issues of quoting or delimiters.
You just need to read the file in chunks of the row size and split each row
along chunks of the element size.

Thanks for your comments,

Skip

From andrewm at object-craft.com.au  Thu Feb  6 14:01:09 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 07 Feb 2003 00:01:09 +1100
Subject: [Csv] Re: dict argument to writer.writerow 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15937.63910.387970.198505@montanaro.dyndns.org> 
References: <20030206045611.930023CA92@coffee.object-craft.com.au>
	<15937.63910.387970.198505@montanaro.dyndns.org> 
Message-ID: <20030206130109.16FD23CA92@coffee.object-craft.com.au>

>    Andrew> I don't think this belongs in writer.writerow - I'd suggest it
>    Andrew> belongs in the as yet unwritten csv.util module. The problem is
>    Andrew> that it's going to have an appreciable impact on the normal case
>    Andrew> of writing a tuple.

There's an even better reason now - I've almost completely re-written
the C module so that the reader and writer classes are implemented in C.
I haven't checked the changes in yet, because I need to do some cleaning
up, and I'm too tired - don't make any conflicting changes to _csv.c or
they will be lost.

This should help performance slightly, but the real reason was to sweep
a whole bunch of giblets out of the dialect parsing - my feeling is that
it's a lot cleaner now, but only time will tell.

>Hmmm...  I think of reading/writing dicts as more integration with the DB
>API.  I rarely use plain fetchall() when getting rows from a table.
>Dictionaries are much saner objects.  Accordingly, I'd like it to be as
>painless as possible for people to write them out to CSV files.

Sure. I don't think it's too big an ask that they use an alternate
interface, however.

>Also, one can frequently think of CSV files as a file of dicts with the
>simple optimization that the dictionary keys are only written once, in the
>first row.

Yeah, but then it's something more than a CSV file, isn't it.. 8-)

>That's not to say my code couldn't have been done differently.  I was trying
>hard to avoid testing the type of the object being written.  In retrospect
>the code I have will cause an exception to be raised and caught most of the
>time.  Perhaps it would be better as:
>
>    if hasattr(fields, "has_key"):

The hasattr is about 4 times faster, but by having two interfaces,
we don't even have to pay that cost.

>In any case, I'd like it to be as easy as possible for people to write dicts
>to CSV files and read rows into dicts as possible.

Sure.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From cribeiro at mail.inet.com.br  Thu Feb  6 14:36:21 2003
From: cribeiro at mail.inet.com.br (Carlos Ribeiro)
Date: Thu, 6 Feb 2003 13:36:21 +0000
Subject: [Csv] Re: PEP 305 - Comments (really long post)
In-Reply-To: <15938.2394.88405.59280@montanaro.dyndns.org>
References: <200302052252.18694.cribeiro@mail.inet.com.br>
	<15938.2394.88405.59280@montanaro.dyndns.org>
Message-ID: <200302061336.21724.cribeiro@mail.inet.com.br>

Skip,

Well, nobody can say that I didn't try :-) I almost giving up on my crusade to 
convince you that numbers should be converted by the csv library. It seems 
that we started from different assumptions, but now I think I've understood 
what are your objectives.

I still have a few points to make, though:

1) There is one reason left to convert numbers before returning them, and this 
has a lot to do with information that is discarded in the process. Let us 
follow this example:

"row 1";10   -->  ("row 1", "10")

The second item of the returned tuple is a string, as you stated in your 
answer. The problem is that my application has no way to know if the value 
was originally written in the csv file with or without quotes; this 
information is lost because all values are 'normalized' by the csv library.

If I know the structure of the csv file, then it's fine, but it's not so nice 
when you're trying to detect the structure of an arbitrary csv file. Take a 
look at another example, where the first column is called 'code', the second 
column is 'description', and the third one is 'cost'. Note that this example 
is similar to the structure used for files exported from project management 
software:

"1", "Project phase", 2000
"1.1", "Requirement analysis", 1000
"1.1", "Architectural design", 1000

In this case, MS Excel will detect that the first column as a string, but will 
convert values in the third one to numeric format. It can do that because he 
knows that the first column values were quoted, and the third one isn't. Now, 
when you return a tuple of strings, the user has no way to know if the quotes 
were or not present in the original file.

There are few solutions for this problem, none of them fully satisfactory:

a) return the strings as proposed by you, which leaves the library unusable 
for situations as described above;

b) return strings in such a way that the original quotes are preserved. Then 
it will be up to the user to remove the extra quotes from the "real" strings;

c) convert unquoted numeric values to native numbers (ints or floats) when 
returning the row (as proposed by myself in my previous messages);

d) provide an alternative method to retrieve more information - for example, a 
second tuple with a more detailed description of how was the line analysed. 
While more complex, this approach has some advantages: (1) it does not make 
ths usual code any more complex, and (2) the extra information will help to 
implement 'smarter' csvreaders.

Other alternatives may exist, but I think that the list above sums up very 
well the practical options.

2) In your answer, you cite the case where some numeric values can be hex, or 
whatever base it is. Well, I don't agree with your argument. One of the 
Python's mottos is "to make simple things simple". The simplest case are base 
10 integers; if the library can deal with them in a sane way, you're solving 
the problems of the vast majority of the users. Special cases are just that, 
special, and will be treated in a special fashion anyway.

3) I'm not sure if str() is localized for floats. Using the standard 
installation of PythonWin with a fully localized copy of Windows, it still 
uses periods as decimal point - not commas. I didn't try to change the locale 
manually (I never did that before for Python); I'll try and tell you what 
happens.

BTW, I'm sure that repr() isn't localized, because the syntax for floats is 
not locale-dependent, but you are probably aware of this fact. But I'm  
afraid that str() and repr() calls may end up calling the same function in 
the case of floats.

4) I'm not convinced that passing a binary file is a good idea. Reading the 
PEP I assumed that the csvreader constructor just takes any object that can 
return lines. Well, binary file objects do not meet this definition. It would 
make the system much less flexible, making it more difficult to pass 
arbitrary iterables to the csv library.

For the sake of simplicity and clarity, why not leave the line termination 
option out of the csv library, in such a way that it can be implemented in 
the file object passed to the reader? The csv file would be less dependent on 
implementation details of the file, focusing more on how to interpret the 
content of the lines.

5) I agree that fixed width text files are different beasts. Anyway, it should 
be possible to implement it using the same interface (or API, whatever you 
like calling it). Things like that make the learning curve smoother. But we 
can leave this discussion for a later time.

Thanks for your comments, and please forgive my insistence :-)

Carlos Ribeiro
cribeiro at mail.inet.com.br

From skip at pobox.com  Thu Feb  6 17:07:01 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 6 Feb 2003 10:07:01 -0600
Subject: [Csv] Re: PEP 305 - Comments (really long post)
In-Reply-To: <200302061336.21724.cribeiro@mail.inet.com.br>
References: <200302052252.18694.cribeiro@mail.inet.com.br>
        <15938.2394.88405.59280@montanaro.dyndns.org>
        <200302061336.21724.cribeiro@mail.inet.com.br>
Message-ID: <15938.34853.575324.942183@montanaro.dyndns.org>

    Carlos> 1) There is one reason left to convert numbers before returning
    Carlos> them, and this has a lot to do with information that is
    Carlos> discarded in the process. Let us follow this example:

    Carlos> "row 1";10   -->  ("row 1", "10")

    Carlos> The second item of the returned tuple is a string, as you stated
    Carlos> in your answer. The problem is that my application has no way to
    Carlos> know if the value was originally written in the csv file with or
    Carlos> without quotes; this information is lost because all values are
    Carlos> 'normalized' by the csv library.

Carlos,

You're interpreting the quote character incorrectly.  Quotes are necessary
only to disambiguate fields which contain the delimiter character.  There is
no restriction that they be used minimally, however.  Your example can just
as easily (and just as correctly) have been written as any of the following:

    "row 1";10
    "row 1";"10"
    row 1;10
    row 1;"10"

All have precisely the same meaning.

We do have plans to implement a csvutils module.  One of the things it will
contain is a "sniffer" (actually, it may contain multiple sniffers to sniff
out different properties of the file).  One thing a sniffer might do is try
to determine column types by looking at a relatively short prefix of a CSV
file (20 rows or so).  This may be helpful to you in situations where your
application doesn't know the type information, but in general, your
application should know column types better than the csv module.

    Carlos> "1", "Project phase", 2000
    Carlos> "1.1", "Requirement analysis", 1000
    Carlos> "1.1", "Architectural design", 1000

    Carlos> In this case, MS Excel will detect that the first column as a
    Carlos> string, but will convert values in the third one to numeric
    Carlos> format. 

Perhaps, but Microsoft has the advantage of arrogance. ;-) MS is the
800-pound gorilla, and can thus assume that any CSV data which is fed to
Excel must be in a format Excel understands.  We don't have that luxury.  We
want to make sure people can read CSV data generated by many different
applications, many of which are incompatible with Excel's assumptions.

    Carlos> There are few solutions for this problem, none of them fully
    Carlos> satisfactory:
    ...

There's the key: "none of them fully satisfactory".  If there was a
satisfactory solution, we'd be more open to extracting type information from
the raw data.  Since there isn't we will limit this csv module's to just
parsing the data.

    Carlos> 2) In your answer, you cite the case where some numeric values
    Carlos> can be hex, or whatever base it is. Well, I don't agree with
    Carlos> your argument. One of the Python's mottos is "to make simple
    Carlos> things simple". The simplest case are base 10 integers; if the
    Carlos> library can deal with them in a sane way, you're solving the
    Carlos> problems of the vast majority of the users. Special cases are
    Carlos> just that, special, and will be treated in a special fashion
    Carlos> anyway.

True, the simplest case is base 10.  However, like I said above, many
different applications may be the source of this data (or may want to read
the CSV data we write).  It's just not possible to be all things to all
people.  We're doing what we feel we can do better than anyone else.

    Carlos> 3) I'm not sure if str() is localized for floats. Using the
    Carlos> standard installation of PythonWin with a fully localized copy
    Carlos> of Windows, it still uses periods as decimal point - not
    Carlos> commas. I didn't try to change the locale manually (I never did
    Carlos> that before for Python); I'll try and tell you what happens.

That would be much appreciated.  Another area we need to deal with but which
we have avoided so far is Unicode.

    Carlos> 4) I'm not convinced that passing a binary file is a good
    Carlos> idea. Reading the PEP I assumed that the csvreader constructor
    Carlos> just takes any object that can return lines. Well, binary file
    Carlos> objects do not meet this definition. It would make the system
    Carlos> much less flexible, making it more difficult to pass arbitrary
    Carlos> iterables to the csv library.

The reader takes an iterable object.  If that object has a binary mode flag
we expect it to have been given.  This stuff all works fine now.  I don't
anticipate changes.

    Carlos> For the sake of simplicity and clarity, why not leave the line
    Carlos> termination option out of the csv library, in such a way that it
    Carlos> can be implemented in the file object passed to the reader? 

Because we might be generating CSV files on a Linux system (LF line
terminator) which is supposed to be consumed by a user on a Mac OS 8 system
running ClarisWorks 4 which (being the feeble tool it was) doesn't know
diddley squat about LF line terminators.  Accordingly, we have to set the
lineterminator to CR.  We can't do that with text mode files.  Nor can we
assume that a person still running CW4 and Mac OS 8 will have any sort of
file conversion tools available.

    Carlos> 5) I agree that fixed width text files are different beasts.
    Carlos> Anyway, it should be possible to implement it using the same
    Carlos> interface (or API, whatever you like calling it). Things like
    Carlos> that make the learning curve smoother. But we can leave this
    Carlos> discussion for a later time.

Sure, but "same API" != "same module". ;-)

    Carlos> Thanks for your comments, and please forgive my insistence :-)

No problem.  Just don't move to New Zealand and change your name to
Graham. ;-) [see the recent python-dev flamefest about a native code
compiler for Python]

Skip

From skip at pobox.com  Thu Feb  6 23:39:19 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 6 Feb 2003 16:39:19 -0600
Subject: [Csv] multi-character delimiters, take two
Message-ID: <15938.58391.603139.209913@montanaro.dyndns.org>

At work I've been installing Firewall-1.  I finally got it installed and
enabled today, protecting a single test machine.

Of course, the bad guys are knocking on the door, so I have a growing
logfile.  "Hmmm, 'twould be nice to pop this data into Python and see what
it looks like," I thought.  I dumped the logfile on the firewall itself.  No
dice, it's just plain, undifferentiated text.  Damn.  So I tried exporting
it through the export interface on the management client which runs on
Windows.  Lo and behold, there was a fairly nice looking CSV file, all
fields quoted with '"', except...  the delimiter is two spaces.  I popped it
up in XEmacs to be sure it wasn't a TAB.

What are these people thinking?

So now I've encountered two examples (including my old client in Austria) of
honest-to-goodness tabular data (that is, not fabricated by mad perl hackers
out to trip us up with "well, what if?" games) where the delimiter between
fields is more than a single character.  There are probably others out
there, just waiting to be discovered.  Any chance the len(delimiter) == 1
restriction could be relaxed?

Skip

From andrewm at object-craft.com.au  Thu Feb  6 23:52:59 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 07 Feb 2003 09:52:59 +1100
Subject: [Csv] multi-character delimiters, take two 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15938.58391.603139.209913@montanaro.dyndns.org> 
References: <15938.58391.603139.209913@montanaro.dyndns.org> 
Message-ID: <20030206225259.0E0EB3CA92@coffee.object-craft.com.au>

>So I tried exporting
>it through the export interface on the management client which runs on
>Windows.  Lo and behold, there was a fairly nice looking CSV file, all
>fields quoted with '"', except...  the delimiter is two spaces.  I popped it
>up in XEmacs to be sure it wasn't a TAB.

You might find that two spaces never appear in the data fields, in which
case this might work:

        fields = line.split('  ')

BTW, have you tried using the csv parser with delimiter set to space,
and skipinitialspace set to true?

>So now I've encountered two examples (including my old client in Austria) of
>honest-to-goodness tabular data (that is, not fabricated by mad perl hackers
>out to trip us up with "well, what if?" games) where the delimiter between
>fields is more than a single character.  There are probably others out
>there, just waiting to be discovered.  Any chance the len(delimiter) == 1
>restriction could be relaxed?

Not without some hairy work on the state machine. The more complicated
we make the state machine, the more likely we are to let a nasty bug
slip through, so I'm rather reluctant.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From LogiplexSoftware at earthlink.net  Thu Feb  6 23:55:18 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 06 Feb 2003 14:55:18 -0800
Subject: [Csv] multi-character delimiters, take two
In-Reply-To: <15938.58391.603139.209913@montanaro.dyndns.org>
References: <15938.58391.603139.209913@montanaro.dyndns.org>
Message-ID: <1044572117.23236.1359.camel@software1.logiplex.internal>

On Thu, 2003-02-06 at 14:39, Skip Montanaro wrote:

> What are these people thinking?

<homer>
"Mmm, spaces..."
</homer>

> So now I've encountered two examples (including my old client in Austria) of
> honest-to-goodness tabular data (that is, not fabricated by mad perl hackers
> out to trip us up with "well, what if?" games) where the delimiter between
> fields is more than a single character.  There are probably others out
> there, just waiting to be discovered.  Any chance the len(delimiter) == 1
> restriction could be relaxed?

Or, in this case, the "treat consecutive delimiters as one" might have
been useful.  This *is* an option (on import) in Excel.

BTW, sorry I've gone missing for a while.  I've been putting out fires
on our customers' systems.  Some of them are still smoking, but I had
time to chime in on this =)

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308

From skip at pobox.com  Fri Feb  7 00:07:02 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 6 Feb 2003 17:07:02 -0600
Subject: [Csv] multi-character delimiters, take two 
In-Reply-To: <20030206225259.0E0EB3CA92@coffee.object-craft.com.au>
References: <15938.58391.603139.209913@montanaro.dyndns.org>
        <20030206225259.0E0EB3CA92@coffee.object-craft.com.au>
Message-ID: <15938.60054.445005.692672@montanaro.dyndns.org>

    >> Lo and behold, there was a fairly nice looking CSV file, all fields
    >> quoted with '"', except...  the delimiter is two spaces.

    Andrew> You might find that two spaces never appear in the data fields,
    Andrew> in which case this might work:

    Andrew>         fields = line.split('  ')

Sure, it might, but then I'm back to hackish wing-and-a-prayer parsing.

    Andrew> BTW, have you tried using the csv parser with delimiter set to
    Andrew> space, and skipinitialspace set to true?

Not yet.  Good suggestion though.  I will give it a try later.

    >> So now I've encountered two examples .... Any chance the
    >> len(delimiter) == 1 restriction could be relaxed?

    Andrew> Not without some hairy work on the state machine. The more
    Andrew> complicated we make the state machine, the more likely we are to
    Andrew> let a nasty bug slip through, so I'm rather reluctant.

Point taken, and since you guys on summer vacation are the BDFLs of that
code, your word is law.  Still, don't be surprised to hear someone ask for
it just after 2.3 is out. ;-)

Skip

From skip at pobox.com  Fri Feb  7 00:07:44 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 6 Feb 2003 17:07:44 -0600
Subject: [Csv] multi-character delimiters, take two
In-Reply-To: <1044572117.23236.1359.camel@software1.logiplex.internal>
References: <15938.58391.603139.209913@montanaro.dyndns.org>
        <1044572117.23236.1359.camel@software1.logiplex.internal>
Message-ID: <15938.60096.822480.773111@montanaro.dyndns.org>

    Cliff> Or, in this case, the "treat consecutive delimiters as one" might
    Cliff> have been useful.  This *is* an option (on import) in Excel.

Hmmm...  another useful suggestion.

Thx,

Skip

From skip at pobox.com  Fri Feb  7 00:59:39 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 6 Feb 2003 17:59:39 -0600
Subject: [Csv] multi-character delimiters, take two 
In-Reply-To: <15938.60054.445005.692672@montanaro.dyndns.org>
References: <15938.58391.603139.209913@montanaro.dyndns.org>
        <20030206225259.0E0EB3CA92@coffee.object-craft.com.au>
        <15938.60054.445005.692672@montanaro.dyndns.org>
Message-ID: <15938.63211.957812.39094@montanaro.dyndns.org>

    Andrew> BTW, have you tried using the csv parser with delimiter set to
    Andrew> space, and skipinitialspace set to true?

    Skip> Not yet.  Good suggestion though.  I will give it a try later.

Here's the result.  Inputs look like this:

    "842"  "6Feb2003"  "16:22:42"  "ce0"  "log"  "drop"  "1433"  "pD955C67D.dip.t-dialin.net"  "stonewall"  "2"  ""  
    "843"  "6Feb2003"  "16:25:21"  "ce0"  "log"  "drop"  "325"  "powered.by.bgames.be"  "129.105.117.83"  ""  " th_flags 14 message_info TCP packet out of state"  
    "844"  "6Feb2003"  "16:28:13"  "ce0"  "log"  "drop"  "nbname"  "200.212.86.130"  "stonewall"  "2"  ""  

The dialect class was defined as:

    class spc(csv.excel):
        delimiter=' '
        skipinitialspace=1

The resulting output looks like:

    ['842', '', '6Feb2003', '', '16:22:42', '', 'ce0', '', 'log', '', 'drop', '', '1433', '', 'pD955C67D.dip.t-dialin.net', '', 'stonewall', '', '2', '', '', '', '']
    ['843', '', '6Feb2003', '', '16:25:21', '', 'ce0', '', 'log', '', 'drop', '', '325', '', 'powered.by.bgames.be', '', '129.105.117.83', '', '', '', ' th_flags 14 message_info TCP packet out of state', '', '']
    ['844', '', '6Feb2003', '', '16:28:13', '', 'ce0', '', 'log', '', 'drop', '', 'nbname', '', '200.212.86.130', '', 'stonewall', '', '2', '', '', '', '']

It didn't actually skip the space, but the data is fairly regular, so I can
live with it.

Thanks again for the suggestion.

Skip

From skip at pobox.com  Fri Feb  7 01:17:29 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 6 Feb 2003 18:17:29 -0600
Subject: [Csv] Check this out...
Message-ID: <15938.64281.99865.883746@montanaro.dyndns.org>

I filed a bug report about a negative refcount problem I've been seeing
recently.  Neal Norwitz replied that he thinks it's a bug in the csv
module.  This simple example demonstrates the problem:

    % /usr/local/bin/python
    Python 2.3a1 (#2, Feb  5 2003, 20:57:52) 
    [GCC 3.1 20020420 (prerelease)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import csv
    [28609 refs]
    >>> csv.reader("1", fieldnames=["f1"], restfields=["_rest"], dialect="excel")
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      File "/usr/local/lib/python2.3/site-packages/csv.py", line 102, in __init__
        _OCcsv.__init__(self, dialect, **options)
      File "/usr/local/lib/python2.3/site-packages/csv.py", line 93, in __init__
        self.parser = _csv.parser(**parser_options)
    TypeError: 'restfields' is an invalid keyword argument for this function
    [28713 refs]
    >>> csv.reader("1", fieldnames=["f1"], restfields=["_rest"], dialect="excel")
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      File "/usr/local/lib/python2.3/site-packages/csv.py", line 102, in __init__
        _OCcsv.__init__(self, dialect, **options)
      File "/usr/local/lib/python2.3/site-packages/csv.py", line 93, in __init__
        self.parser = _csv.parser(**parser_options)
    TypeError: 'restfields' is an invalid keyword argument for this function
    [28712 refs]
    >>> csv.reader("1", fieldnames=["f1"], restfields=["_rest"], dialect="excel")
    Fatal Python error: Objects/dictobject.c:373 object at 0x532648 has negative ref count -606348326
    Abort trap

Note that the total number of references decreases each time the reader is
instantiated.  If I fix the typo in the code ("restfields" -> "restfield"),
I don't see the problem:

    >>> import csv
    [28609 refs]
    >>> rdr = csv.reader("1", fieldnames=["f1"], restfield=["_rest"], dialect="excel")
    [28633 refs]
    >>> rdr = csv.reader("1", fieldnames=["f1"], restfield=["_rest"], dialect="excel")
    [28633 refs]
    >>> rdr = csv.reader("1", fieldnames=["f1"], restfield=["_rest"], dialect="excel")
    [28633 refs]
    >>> rdr = csv.reader("1", fieldnames=["f1"], restfield=["_rest"], dialect="excel")
    [28633 refs]

It would appear there is a bug somewhere in the parameter parsing when an
invalid keyword parameter is passed.  Since I didn't modify the C code to
add the dict support, I don't think that's where the problem lies.  In fact,
it would appear that any bogus arg causes the abort:

    >>> rdr = csv.reader("1", bogus="hi bob")
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      File "/usr/local/lib/python2.3/site-packages/csv.py", line 102, in __init__
        _OCcsv.__init__(self, dialect, **options)
      File "/usr/local/lib/python2.3/site-packages/csv.py", line 93, in __init__
        self.parser = _csv.parser(**parser_options)
    TypeError: 'bogus' is an invalid keyword argument for this function
    [28728 refs]
    >>> rdr = csv.reader("1", bogus="hi bob")
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      File "/usr/local/lib/python2.3/site-packages/csv.py", line 102, in __init__
        _OCcsv.__init__(self, dialect, **options)
      File "/usr/local/lib/python2.3/site-packages/csv.py", line 93, in __init__
        self.parser = _csv.parser(**parser_options)
    TypeError: 'bogus' is an invalid keyword argument for this function
    [28727 refs]
    >>> rdr = csv.reader("1", bogus="hi bob")
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      File "/usr/local/lib/python2.3/site-packages/csv.py", line 102, in __init__
        _OCcsv.__init__(self, dialect, **options)
      File "/usr/local/lib/python2.3/site-packages/csv.py", line 93, in __init__
        self.parser = _csv.parser(**parser_options)
    TypeError: 'bogus' is an invalid keyword argument for this function
    [28726 refs]
    >>> rdr = csv.reader("1", bogus="hi bob")
    Fatal Python error: Objects/dictobject.c:373 object at 0x532648 has negative ref count -3
    Abort trap

I can provoke this with both debug and non-debug builds of Python CVS as
well as Python 2.2 (non-debug).  I'll try to take a look at the PyArg_PTAK
code.  I suspect it's in that region.

Skip

From skip at pobox.com  Fri Feb  7 01:19:02 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 6 Feb 2003 18:19:02 -0600
Subject: [Csv] bye bye 2.1?
Message-ID: <15938.64374.523505.231835@montanaro.dyndns.org>

Dave & Andrew,

I notice I can't build w/ Python 2.1 (I thought I was able to early on):

    % python2.1 setup.py install
    running install
    running build
    running build_py
    creating build/lib.darwin-6.3-Power Macintosh-2.1
    copying csv.py -> build/lib.darwin-6.3-Power Macintosh-2.1
    running build_ext
    building '_csv' extension
    creating build/temp.darwin-6.3-Power Macintosh-2.1
    gcc -g -O2 -Wall -Wstrict-prototypes -no-cpp-precomp -I/Users/skip/local/include/python2.1 -c _csv.c -o build/temp.darwin-6.3-Power Macintosh-2.1/_csv.o
    _csv.c:654: `METH_NOARGS' undeclared here (not in a function)
    _csv.c:654: initializer element is not constant
    _csv.c:654: (near initialization for `Parser_methods[1].ml_flags')
    _csv.c:655: initializer element is not constant
    _csv.c:655: (near initialization for `Parser_methods[1]')
    _csv.c:656: `METH_O' undeclared here (not in a function)
    _csv.c:656: initializer element is not constant
    _csv.c:656: (near initialization for `Parser_methods[2].ml_flags')
    _csv.c:657: initializer element is not constant
    _csv.c:657: (near initialization for `Parser_methods[2]')
    _csv.c:658: initializer element is not constant
    _csv.c:658: (near initialization for `Parser_methods[3]')
    _csv.c: In function `init_csv':
    _csv.c:950: warning: implicit declaration of function `PyType_Ready'
    error: command 'gcc' failed with exit status 1

I don't know how far back you want this code supported.  It's up to you
guys.

Skip

From andrewm at object-craft.com.au  Fri Feb  7 01:32:36 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 07 Feb 2003 11:32:36 +1100
Subject: [Csv] multi-character delimiters, take two 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15938.63211.957812.39094@montanaro.dyndns.org> 
References: <15938.58391.603139.209913@montanaro.dyndns.org>
	<20030206225259.0E0EB3CA92@coffee.object-craft.com.au>
	<15938.60054.445005.692672@montanaro.dyndns.org>
	<15938.63211.957812.39094@montanaro.dyndns.org> 
Message-ID: <20030207003236.5D2DE3CA92@coffee.object-craft.com.au>

>Here's the result.  Inputs look like this:
>
>    "842"  "6Feb2003"  "16:22:42"  "ce0"  "log"  "drop"  "1433"  "pD955C67D.dip.t-dialin.net"  "stonewall"  "2"  ""  
>    "843"  "6Feb2003"  "16:25:21"  "ce0"  "log"  "drop"  "325"  "powered.by.bgames.be"  "129.105.117.83"  ""  " th_flags 14 message_info TCP packet out of state"  
>    "844"  "6Feb2003"  "16:28:13"  "ce0"  "log"  "drop"  "nbname"  "200.212.86.130"  "stonewall"  "2"  ""  

Everything is quoted? Then this will work like a charm:

    line[1:-1].split('"  "')

>It didn't actually skip the space, but the data is fairly regular, so I can
>live with it.

Okay - looks like the skipinitialspace stuff needs more testing - I doubt
Dave coded it with delimiter=' ' in mind - it's a pretty pathological
case... 8-)

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Fri Feb  7 01:34:27 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 07 Feb 2003 11:34:27 +1100
Subject: [Csv] Check this out... 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15938.64281.99865.883746@montanaro.dyndns.org> 
References: <15938.64281.99865.883746@montanaro.dyndns.org> 
Message-ID: <20030207003427.E8A373CA92@coffee.object-craft.com.au>

>I can provoke this with both debug and non-debug builds of Python CVS as
>well as Python 2.2 (non-debug).  I'll try to take a look at the PyArg_PTAK
>code.  I suspect it's in that region.

Don't bother - this code has been completely re-written. I've been
watching the refcounts carefully as I wrote the code, and the new code
seems to be doing the right thing.

Sorry to waste your time tracking this one... 8-(

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Fri Feb  7 01:40:17 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 07 Feb 2003 11:40:17 +1100
Subject: [Csv] bye bye 2.1? 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15938.64374.523505.231835@montanaro.dyndns.org> 
References: <15938.64374.523505.231835@montanaro.dyndns.org> 
Message-ID: <20030207004017.23F0C3CA92@coffee.object-craft.com.au>

>    _csv.c:654: `METH_NOARGS' undeclared here (not in a function)
>    _csv.c:656: `METH_O' undeclared here (not in a function)

These two are relatively easy to fix (and the others might simply be
side-effects of these errors). I'll have to have a think whether we
should bother. Python-2.2 is a decent goal - I haven't even tested
against it yet.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Fri Feb  7 05:11:05 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 6 Feb 2003 22:11:05 -0600
Subject: [Csv] Passing along a comment from Tim Peters
Message-ID: <15939.12761.653479.207742@montanaro.dyndns.org>

Andrew, et al,

Here's a comment from Tim Peters regarding the negative ref count problem I
reported. 

    Comment By: Tim Peters (tim_one)
    Date: 2003-02-06 20:23

    Message:
    Logged In: YES 
    user_id=31435

    I think csv_parser is too clevar.  If the PyArg_ParseTuple 
    call fails, it may have already stored a borrowed reference 
    into self->lineterminator, and then it's madness to decref 
    that.in Parser_dealloc().

    "The usual way" to allocate a new object is not to 
    materialize self until *after* PyArg_ParseTuple succeeds.  
    Then nothing delicate needs to be done to clean up, since 
    nothing was done at all yet <wink>.

    Good evidence:  adding the pure hack

    Py_XINCREF(self->lineterminator);

    before 

    Py_DECREF(self);

    stops the negative refcount errors in Neal's example.

While the current module doesn't seem to exhibit the bug, Tim's advice might
still be useful.  The full bug report is at

    http://python.org/sf/681902

Skip

From andrewm at object-craft.com.au  Fri Feb  7 07:05:22 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 07 Feb 2003 17:05:22 +1100
Subject: [Csv] Update PEP?
Message-ID: <20030207060522.706163CA93@coffee.object-craft.com.au>

I think the PEP needs updating - the API hasn't changed too much, but
there's a few warts in there. The docstring for the C module is probably
the most accurate reference at the moment. Can someone give me a hand
and go over the PEP?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Fri Feb  7 16:16:30 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 7 Feb 2003 09:16:30 -0600
Subject: [Csv] Update PEP?
In-Reply-To: <20030207060522.706163CA93@coffee.object-craft.com.au>
References: <20030207060522.706163CA93@coffee.object-craft.com.au>
Message-ID: <15939.52686.652879.386851@montanaro.dyndns.org>

    Andrew> I think the PEP needs updating - the API hasn't changed too
    Andrew> much, but there's a few warts in there. The docstring for the C
    Andrew> module is probably the most accurate reference at the
    Andrew> moment. Can someone give me a hand and go over the PEP?

Sure, I'll try to update it today.  Also, note that libcsv.tex is supposed
to be a section for the library reference manual.  That probably needs
significant attention at this point as well.

Skip

From skip at pobox.com  Fri Feb  7 17:43:24 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 7 Feb 2003 10:43:24 -0600
Subject: [Csv] why push so much code into C?
Message-ID: <15939.57900.514900.606475@montanaro.dyndns.org>

I see that csv.py has dwindled down to next-to-nothing.  Even the dialect
registry stuff is in C.  (Is the Dialect class in csv.py used anymore?  I
see something which looks like a dialect object in the C code.)  It's not
obvious to me that there's any performance gain to be had by having anything
other than the raw parsing and writing code in the C module.  On the other
hand, by pushing code which isn't performance-critical into C it becomes
harder to maintain and extend, and significantly limits the number of people
who can contribute to the code's growth and maturity.

Skip

From andrewm at object-craft.com.au  Sat Feb  8 04:42:29 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Sat, 08 Feb 2003 14:42:29 +1100
Subject: [Csv] why push so much code into C? 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15939.57900.514900.606475@montanaro.dyndns.org> 
References: <15939.57900.514900.606475@montanaro.dyndns.org> 
Message-ID: <20030208034229.DD8623CA92@coffee.object-craft.com.au>

>I see that csv.py has dwindled down to next-to-nothing.  Even the dialect
>registry stuff is in C.  (Is the Dialect class in csv.py used anymore?  I
>see something which looks like a dialect object in the C code.)  It's not
>obvious to me that there's any performance gain to be had by having anything
>other than the raw parsing and writing code in the C module.  On the other
>hand, by pushing code which isn't performance-critical into C it becomes
>harder to maintain and extend, and significantly limits the number of people
>who can contribute to the code's growth and maturity.

The dialect registry went into C so that the reader and writer had access
to it. The code involved was trivial, so it made sense to move it.

The underlying modules accept any instance or class that has appropriate
attributes as a dialect - they don't compare against Dialect. But the
dialect definitions in Python are still used.

The code that has moved into C is relative straight forward - the
really hairy stuff is the parser and the generator (as it has always
been). Limiting the number of people who can modify the *interface*
is not a bad thing... 8-)

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Sat Feb  8 06:06:17 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 7 Feb 2003 23:06:17 -0600
Subject: [Csv] unicode read test checked in
Message-ID: <15940.36937.471963.983008@montanaro.dyndns.org>

I checked in a separate unicode test (test/unicode_test.csv).  It causes a
bus error on my machine, so I figured it was best to keep it separate for
now.

Skip

From skip at pobox.com  Sat Feb  8 06:35:56 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 7 Feb 2003 23:35:56 -0600
Subject: [Csv] This surprised me
Message-ID: <15940.38716.154285.557948@montanaro.dyndns.org>

This code surprised me:

    >>> class foo: pass
    ... 
    >>> csv.register_dialect("excel", foo)
    >>> csv.get_dialect("excel")
    <__main__.foo instance at 0x5309f8>
    >>> import StringIO
    >>> rdr = csv.reader(StringIO.StringIO("1,2,3\r\n"))
    >>> list(rdr)
    [['1', '2', '3']]
    >>> rdr = csv.reader(StringIO.StringIO("1,2,3\r\n"), dialect="excel")
    >>> list(rdr)
    [['1', '2', '3']]
    >>> rdr = csv.reader(StringIO.StringIO("1,2,3\r\n"), dialect=foo)
    >>> list(rdr)
    [['1', '2', '3']]
    >>> rdr = csv.reader(StringIO.StringIO("1,2,3\r\n"), dialect=foo)
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      File "/usr/local/lib/python2.3/site-packages/csv.py", line 27, in __init__
        raise Error, "Dialect did not validate: %s" % ", ".join(errors)
    _csv.Error: Dialect did not validate: delimiter not set, quotechar not set, lineterminator not set, doublequote setting must be True or False, skipinitialspace setting must be True or False

Why didn't it complain anywhere that 'foo' was worthless as a dialect until
the last statement?

Skip

From andrewm at object-craft.com.au  Sat Feb  8 14:08:18 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Sun, 09 Feb 2003 00:08:18 +1100
Subject: [Csv] This surprised me 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15940.38716.154285.557948@montanaro.dyndns.org> 
References: <15940.38716.154285.557948@montanaro.dyndns.org> 
Message-ID: <20030208130818.B61E53CA92@coffee.object-craft.com.au>

>This code surprised me:
>
>    >>> class foo: pass
[...]
>    >>> rdr = csv.reader(StringIO.StringIO("1,2,3\r\n"), dialect=foo)
>    Traceback (most recent call last):
>      File "<stdin>", line 1, in ?
>      File "/usr/local/lib/python2.3/site-packages/csv.py", line 27, in __init__
>        raise Error, "Dialect did not validate: %s" % ", ".join(errors)
>    _csv.Error: Dialect did not validate: delimiter not set, quotechar not set, lineterminator not set, doublequote setting must be True or False, skipinitialspace setting must be True or False
>
>Why didn't it complain anywhere that 'foo' was worthless as a dialect until
>the last statement?

Surely there's more to your example than you quoted in this e-mail? The
exception you mention came from the python code, not the C module
(specifically the Dialect class), but I can't see where it referenced
in the quoted code?

The C code will instanciate (and thus call Dialect's _validate) when
register_dialect is called, or when the class is passed to reader
or writer.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Sat Feb  8 16:14:55 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 8 Feb 2003 09:14:55 -0600
Subject: [Csv] This surprised me 
In-Reply-To: <20030208130818.B61E53CA92@coffee.object-craft.com.au>
References: <15940.38716.154285.557948@montanaro.dyndns.org>
        <20030208130818.B61E53CA92@coffee.object-craft.com.au>
Message-ID: <15941.7919.794691.236800@montanaro.dyndns.org>

    >> This code surprised me:
    ...
    Andrew> Surely there's more to your example than you quoted in this
    Andrew> e-mail? The exception you mention came from the python code, not
    Andrew> the C module (specifically the Dialect class), but I can't see
    Andrew> where it referenced in the quoted code?

Nope, nothing more.  I guess the point I was trying to make is that if I
pass a dialect object which is not subclassed from csv.Dialect (as you
suggested I should be able to do), it seems to be silently accepted.

    Andrew> The C code will instanciate (and thus call Dialect's _validate)
    Andrew> when register_dialect is called, or when the class is passed to
    Andrew> reader or writer.

Correct.  But you indicated that was no longer necessary.  I was wondering
where the error checking went to.

Skip

From skip at pobox.com  Sat Feb  8 16:25:16 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 8 Feb 2003 09:25:16 -0600
Subject: [Csv] This surprised me
Message-ID: <15941.8540.607571.202309@montanaro.dyndns.org>

    >> This code surprised me:
    ...
    Andrew> Surely there's more to your example than you quoted in this
    Andrew> e-mail? The exception you mention came from the python code, not
    Andrew> the C module (specifically the Dialect class), but I can't see
    Andrew> where it referenced in the quoted code?

Nope, nothing more.  I guess the point I was trying to make is that if I
pass a dialect object which is not subclassed from csv.Dialect (as you
suggested I should be able to do), it seems to be silently accepted.

    Andrew> The C code will instanciate (and thus call Dialect's _validate)
    Andrew> when register_dialect is called, or when the class is passed to
    Andrew> reader or writer.

Correct.  But you indicated that was no longer necessary.  I was wondering
where the error checking went to.

Skip

From skip at pobox.com  Sat Feb  8 19:41:00 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 8 Feb 2003 12:41:00 -0600
Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv/test
 unicode_test.py,NONE,1.1 (fwd)
Message-ID: <15941.20284.334750.638247@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: "M.-A. Lemburg" <mal at lemburg.com>
Subject: Re: [Python-checkins] python/nondist/sandbox/csv/test
        unicode_test.py,NONE,1.1
Date: Sat, 08 Feb 2003 18:24:22 +0100
Size: 5008
Url: http://mail.python.org/pipermail/csv/attachments/20030208/2eb0c32f/attachment.mht 

From skip at pobox.com  Sat Feb  8 19:48:17 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 8 Feb 2003 12:48:17 -0600
Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv/test
 unicode_test.py,NONE,1.1
In-Reply-To: <3E453D46.7080307@lemburg.com>
References: <15941.8582.901618.823053@montanaro.dyndns.org>
        <3E453D46.7080307@lemburg.com>
Message-ID: <15941.20721.268117.216891@montanaro.dyndns.org>

(redirecting to the csv mailing list so this stuff gets archived.)

    >> http://mail.python.org/pipermail/python-list/2003-February/145151.html

    mal> Why not convert the input data to UTF-8 and take it from there ?

Good suggestion, thanks.  The only issue is the variable width nature of
utf-8.  I think if we are going to convert to a concrete encoding it would
be easier to convert to something which has constant-width characters
wouldn't it?  Of course, if I can convince the guys in Australia writing the
actual code to deal with a variable-width encoding, it can't be far from
there to allowing multi-character delimiters. ;-)

    mal> Are you sure that Unicode objects will be lower in processing ?

Operating on Python string or unicode objects without converting them to
some sort of C string will almost certainly be slower than the current code
which is a relatively modest finite state machine operating on individual
bytes.

    mal> (Is there a standard for encodings in CSV files ?)

No, there is none, hence the use of codecs.EncodedFile to allow the
programmer to specify the encoding.  Excel can export to two formats it
calls "Unicode CSV" and "Unicode Text".  Exporting a spreadsheet containing
nothing but ASCII as Unicode CSV produced exactly the same comma-separated
file as would have been dumped using the usual CSV export format.  Exporting
the same spreadsheet as Unicode Text produced a tab-separated file which I
guessed to be utf-16.  It started with a little-endian utf-16 BOM and all
the characters were two bytes wide with one byte being an ASCII NUL.

Thanks for the feedback,

Skip

From skip at pobox.com  Sat Feb  8 21:13:01 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 8 Feb 2003 14:13:01 -0600
Subject: [Csv] confused about wrapping readers and writers
Message-ID: <15941.25805.234008.663342@montanaro.dyndns.org>

I still want to be able to read from and write to dictionaries. ;-) I would
like to add a pair of classes to csv.py which implement this, but I don't
quite know what's required, never having written any iterators before.  If I
create a reader:

    >>> rdr = csv.reader(["a,b,c\r\n"])

and ask for its attributes, all I get back are the data attributes:

    >>> dir(rdr)
    ['delimiter', 'doublequote', 'escapechar', 'lineterminator',
    'quotechar', 'quoting', 'skipinitialspace', 'strict'] 

Does the underlying reader object need to expose its Reader_iternext
function as a next() method?  Based upon

    http://www.python.org/doc/current/lib/typeiter.html

I sort of suspect it does.  It looks like it also needs an __iter__() method
which just returns self.

I thought a DictReader would look something like

    class DictReader:
        def __init__(self, f, fieldnames, rest=None, dialect="excel", *args):
            self.fieldnames = fieldnames    # list of keys for the dict
            self.rest = rest                # key to catch long rows
            self.reader = reader(f, dialect, *args)

        def next(self):
            row = self.reader.next()
            d = dict(zip, self.fieldnames, row)
            if len(self.fieldnames) < len(row):
                d[self.rest] = row[len(self.fieldnames):]
            return d

Is all that's missing a next() method for reader objects?

Thx,

Skip

From skip at pobox.com  Sat Feb  8 21:38:17 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 8 Feb 2003 14:38:17 -0600
Subject: [Csv] Re: How best to handle Unicode where only 8-bit chars are now?
In-Reply-To: <b235oa$fm2$1@main.gmane.org>
References: <15940.37847.263991.794301@montanaro.dyndns.org>
        <b235oa$fm2$1@main.gmane.org>
Message-ID: <15941.27321.855319.916966@montanaro.dyndns.org>

    >> Option 3 seems the cleanest, but would slow everything down
    >> significantly because character extraction and comparison would
    >> require a function call instead of an array index operation or a
    >> simple comparison.

    Fredrik> what makes you think 8-bit == fast and unicode == slow?

Nothing, just unfamiliarity.  That's why I was asking.

    Fredrik> have you looked at SRE?  it compiles portions of itself twice,
    Fredrik> to get 8-bit and unicode versions of the core engine.  on
    Fredrik> modern machines, the unicode version often runs *faster* than
    Fredrik> the corresponding 8-bit code.

I'll refer the csv authors to this.

Thx,

Skip

From skip at pobox.com  Sat Feb  8 21:38:25 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 8 Feb 2003 14:38:25 -0600
Subject: [Csv] Re: How best to handle Unicode where only 8-bit chars are now?
	(fwd)
Message-ID: <15941.27329.404107.28526@montanaro.dyndns.org>

archive
-------------- next part --------------
An embedded message was scrubbed...
From: "Fredrik Lundh" <fredrik at pythonware.com>
Subject: Re: How best to handle Unicode where only 8-bit chars are now?
Date: Sat, 8 Feb 2003 15:54:41 +0100
Size: 6728
Url: http://mail.python.org/pipermail/csv/attachments/20030208/eb9895ea/attachment.mht 

From djc at object-craft.com.au  Sun Feb  9 01:56:46 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 09 Feb 2003 11:56:46 +1100
Subject: [Csv] multi-character delimiters, take two
In-Reply-To: <20030207003236.5D2DE3CA92@coffee.object-craft.com.au>
References: <15938.58391.603139.209913@montanaro.dyndns.org>
	<20030206225259.0E0EB3CA92@coffee.object-craft.com.au>
	<15938.60054.445005.692672@montanaro.dyndns.org>
	<15938.63211.957812.39094@montanaro.dyndns.org>
	<20030207003236.5D2DE3CA92@coffee.object-craft.com.au>
Message-ID: <m3of5mz2y9.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>> Here's the result.  Inputs look like this:
>> 
>> "842" "6Feb2003" "16:22:42" "ce0" "log" "drop" "1433"
>> "pD955C67D.dip.t-dialin.net" "stonewall" "2" "" "843" "6Feb2003"
>> "16:25:21" "ce0" "log" "drop" "325" "powered.by.bgames.be"
>> "129.105.117.83" "" " th_flags 14 message_info TCP packet out of
>> state" "844" "6Feb2003" "16:28:13" "ce0" "log" "drop" "nbname"
>> "200.212.86.130" "stonewall" "2" ""

Andrew> Everything is quoted? Then this will work like a charm:

Andrew>     line[1:-1].split('" "')

>> It didn't actually skip the space, but the data is fairly regular,
>> so I can live with it.

Andrew> Okay - looks like the skipinitialspace stuff needs more
Andrew> testing - I doubt Dave coded it with delimiter=' ' in mind -
Andrew> it's a pretty pathological case... 8-)

It might be as simple as swapping the following tests:

        case START_FIELD:
                :
                :
                else if (c == self->dialect.delimiter) {
                        /* save empty field */
                        parse_save_field(self);
                }
                else if (c == ' ' && self->dialect.skipinitialspace)
                        /* ignore space at start of field */
                        ;

The state machine for handling multi-character delimiters is not
necessarily much more compilcated.  Instead of switching to new state
on the basis of a single character, the state machine would have to
introduce transitional states which iterate over the multi-character
delimiter before going to the destination state.

There would have to be some very basic backtracking which allowed the
parser state machine to indicate a false match of delimiter in the
transitional state.  This would rewind the input stream (careful about
infinite loops).

Looking at the state machine for code which reacts to the delimiter.
We would need the following transitional states.

 DELIMITER_START_FIELD
 DELIMITER_ESCAPED_CHAR
 DELIMITER_IN_FIELD
 DELIMITER_ESCAPE_IN_QUOTED_FIELD
 DELIMITER_QUOTE_IN_QUOTED_FIELD

Mind you all of this code falls over once you decide to allow multiple
characters in the quotechar as well.  What happens when
delimiter = 'DD' and quotechar = 'DQ' (where D and Q are some
arbitrary character)?  You start building a partial regex engine.

- Dave

-- 
http://www.object-craft.com.au

From mal at lemburg.com  Sun Feb  9 12:38:03 2003
From: mal at lemburg.com (M.-A. Lemburg)
Date: Sun, 09 Feb 2003 12:38:03 +0100
Subject: [Csv] Re: [Python-checkins] python/nondist/sandbox/csv/test
 unicode_test.py,NONE,1.1
In-Reply-To: <15941.20721.268117.216891@montanaro.dyndns.org>
References: <15941.8582.901618.823053@montanaro.dyndns.org>
	<3E453D46.7080307@lemburg.com>
	<15941.20721.268117.216891@montanaro.dyndns.org>
Message-ID: <3E463D9B.40007@lemburg.com>

Skip Montanaro wrote:
> (redirecting to the csv mailing list so this stuff gets archived.)
> 
>     >> http://mail.python.org/pipermail/python-list/2003-February/145151.html
> 
>     mal> Why not convert the input data to UTF-8 and take it from there ?
> 
> Good suggestion, thanks.  The only issue is the variable width nature of
> utf-8.  I think if we are going to convert to a concrete encoding it would
> be easier to convert to something which has constant-width characters
> wouldn't it?  Of course, if I can convince the guys in Australia writing the
> actual code to deal with a variable-width encoding, it can't be far from
> there to allowing multi-character delimiters. ;-)

We chose UTF-8 in the Python tokenizer/compiler to turn a previously
byte based program part into a Unicode capable one. Many other tools
have used the same approach. Variable length encodings have problems
with slicing and indexing, but unless you need these, I don't see
much of a problem.

>     mal> Are you sure that Unicode objects will be lower in processing ?
> 
> Operating on Python string or unicode objects without converting them to
> some sort of C string will almost certainly be slower than the current code
> which is a relatively modest finite state machine operating on individual
> bytes.

You could use a hybrid approach similar to sre or mxTextTools for
dealing with both types of base types (char vs. Py_UNICODE).

>     mal> (Is there a standard for encodings in CSV files ?)
> 
> No, there is none, hence the use of codecs.EncodedFile to allow the
> programmer to specify the encoding.  Excel can export to two formats it
> calls "Unicode CSV" and "Unicode Text".  Exporting a spreadsheet containing
> nothing but ASCII as Unicode CSV produced exactly the same comma-separated
> file as would have been dumped using the usual CSV export format.  Exporting
> the same spreadsheet as Unicode Text produced a tab-separated file which I
> guessed to be utf-16.  It started with a little-endian utf-16 BOM and all
> the characters were two bytes wide with one byte being an ASCII NUL.

The BOM mark is what MS uses to indicate Unicode in text files.
It's a rather practical approach to the problem, but it works :-)
Perhaps you could add some magic to detect these BOM marks
and then default to UTF-16 input ?!

There's also the possiblity to use UTF-8 BOMs, BTW. See codecs.py
for a list of possible BOM marks.

> Thanks for the feedback,

You're welcome,
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/

From skip at pobox.com  Sun Feb  9 17:47:05 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 9 Feb 2003 10:47:05 -0600
Subject: [Csv] multi-character delimiters, take two
In-Reply-To: <m3of5mz2y9.fsf@ferret.object-craft.com.au>
References: <15938.58391.603139.209913@montanaro.dyndns.org>
        <20030206225259.0E0EB3CA92@coffee.object-craft.com.au>
        <15938.60054.445005.692672@montanaro.dyndns.org>
        <15938.63211.957812.39094@montanaro.dyndns.org>
        <20030207003236.5D2DE3CA92@coffee.object-craft.com.au>
        <m3of5mz2y9.fsf@ferret.object-craft.com.au>
Message-ID: <15942.34313.543359.488243@montanaro.dyndns.org>

    Dave> Mind you all of this code falls over once you decide to allow
    Dave> multiple characters in the quotechar as well.  What happens when
    Dave> delimiter = 'DD' and quotechar = 'DQ' (where D and Q are some
    Dave> arbitrary character)?  You start building a partial regex engine.

Would it work to simply use regular expressions to recognize delimiters and
quotes?  (I'll let you do the math. ;-)

Skip

From andrewm at object-craft.com.au  Mon Feb 10 00:09:05 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 10 Feb 2003 10:09:05 +1100
Subject: [Csv] confused about wrapping readers and writers 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15941.25805.234008.663342@montanaro.dyndns.org> 
References: <15941.25805.234008.663342@montanaro.dyndns.org> 
Message-ID: <20030209230905.816893CA92@coffee.object-craft.com.au>

>I still want to be able to read from and write to dictionaries. ;-) I would
>like to add a pair of classes to csv.py which implement this, but I don't
>quite know what's required, never having written any iterators before.

An object that supports iteration needs an __iter__() method. When called,
this method returns an object that supports iteration (in other words,
has a next() method). __iter__() can return self (in which case, self
needs a next() method).

>If I create a reader:
>
>    >>> rdr = csv.reader(["a,b,c\r\n"])
>
>and ask for its attributes, all I get back are the data attributes:
>
>    >>> dir(rdr)
>    ['delimiter', 'doublequote', 'escapechar', 'lineterminator',
>    'quotechar', 'quoting', 'skipinitialspace', 'strict'] 

For reasons that I haven't looked into, dir() is not finding methods
on the objects we're creating - I suspect this is a hang-over from the
type/class unification (i.e., we need to exercise an extended API to
get our methods exposted).

>Does the underlying reader object need to expose its Reader_iternext
>function as a next() method?  Based upon
>
>    http://www.python.org/doc/current/lib/typeiter.html
>
>I sort of suspect it does.  It looks like it also needs an __iter__() method
>which just returns self.

Hmmm - the C parts of Python are obviously finding them:

>>> import csv
>>> r=csv.reader([])
>>> iter(r)
<_csv.reader object at 0x40188810>

>Is all that's missing a next() method for reader objects?

I suspect so... will let you know.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Mon Feb 10 09:47:24 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 10 Feb 2003 19:47:24 +1100
Subject: [Csv] confused about wrapping readers and writers 
In-Reply-To: Message from Andrew McNamara <andrewm@object-craft.com.au> 
	<20030209230905.816893CA92@coffee.object-craft.com.au> 
References: <15941.25805.234008.663342@montanaro.dyndns.org>
	<20030209230905.816893CA92@coffee.object-craft.com.au> 
Message-ID: <20030210084724.8D6063CA89@coffee.object-craft.com.au>

>>    >>> dir(rdr)
>>    ['delimiter', 'doublequote', 'escapechar', 'lineterminator',
>>    'quotechar', 'quoting', 'skipinitialspace', 'strict'] 
>
>For reasons that I haven't looked into, dir() is not finding methods
>on the objects we're creating - I suspect this is a hang-over from the
>type/class unification (i.e., we need to exercise an extended API to
>get our methods exposted).

We were using "old-style" getattr/setattr - a day of pawing over python
internals showed how to use the new interfaces, so __iter__ and next
are now exposed the way they should be, and dir(...) lists all methods.

I also changed the Dialect structure into a fully fledged python type -
this was something I'd been considering for a while, but had assumed
there would be too much of a performance impact. Turns out there wasn't,
and it's made the code cleaner.

Note that the reader and writer objects no longer have attributes
corresponding to the individual settings - instead, they have a "dialect"
attribute, which contains the settings. It would be a relatively trivial
matter to proxy getattr/setattr requests from reader/writer to the
dialect instance - I can do this if people think it's worthwhile.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From djc at object-craft.com.au  Mon Feb 10 10:44:28 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 10 Feb 2003 20:44:28 +1100
Subject: [Csv] confused about wrapping readers and writers
In-Reply-To: <20030210084724.8D6063CA89@coffee.object-craft.com.au>
References: <15941.25805.234008.663342@montanaro.dyndns.org>
	<20030209230905.816893CA92@coffee.object-craft.com.au>
	<20030210084724.8D6063CA89@coffee.object-craft.com.au>
Message-ID: <m3of5kv5ab.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>>> >>> dir(rdr) ['delimiter', 'doublequote', 'escapechar',
>>> 'lineterminator', 'quotechar', 'quoting', 'skipinitialspace',
>>> 'strict']
>>  For reasons that I haven't looked into, dir() is not finding
>> methods on the objects we're creating - I suspect this is a
>> hang-over from the type/class unification (i.e., we need to
>> exercise an extended API to get our methods exposted).

Andrew> We were using "old-style" getattr/setattr - a day of pawing
Andrew> over python internals showed how to use the new interfaces, so
Andrew> __iter__ and next are now exposed the way they should be, and
Andrew> dir(...) lists all methods.

Andrew> I also changed the Dialect structure into a fully fledged
Andrew> python type - this was something I'd been considering for a
Andrew> while, but had assumed there would be too much of a
Andrew> performance impact. Turns out there wasn't, and it's made the
Andrew> code cleaner.

Andrew> Note that the reader and writer objects no longer have
Andrew> attributes corresponding to the individual settings - instead,
Andrew> they have a "dialect" attribute, which contains the
Andrew> settings. It would be a relatively trivial matter to proxy
Andrew> getattr/setattr requests from reader/writer to the dialect
Andrew> instance - I can do this if people think it's worthwhile.

I think it is well nigh time to let this code loose on the Python
community.  The only possible addition now would be some kind of
mechanism whereby something like the db_row could be linked in with
the module.

        http://opensource.theopalgroup.com/

Mind you the application might be the best place to do this kind of
linkage.

- Dave

-- 
http://www.object-craft.com.au

From andrewm at object-craft.com.au  Mon Feb 10 11:26:33 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 10 Feb 2003 21:26:33 +1100
Subject: [Csv] confused about wrapping readers and writers 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
	<m3of5kv5ab.fsf@ferret.object-craft.com.au> 
References: <15941.25805.234008.663342@montanaro.dyndns.org>
	<20030209230905.816893CA92@coffee.object-craft.com.au>
	<20030210084724.8D6063CA89@coffee.object-craft.com.au>
	<m3of5kv5ab.fsf@ferret.object-craft.com.au> 
Message-ID: <20030210102633.DDCD03CA89@coffee.object-craft.com.au>

>I think it is well nigh time to let this code loose on the Python
>community.  

It now works with Python 2.2, which certainly makes this more feasible.
Supporting versions of Python prior to 2.2 is problematic - the type
model is very different, and they don't have iterators (which the C code
uses in some key locations).

>The only possible addition now would be some kind of
>mechanism whereby something like the db_row could be linked in with
>the module.
>
>        http://opensource.theopalgroup.com/
>
>Mind you the application might be the best place to do this kind of
>linkage.

Maybe Skip's dictionary stuff would get us closer?

We haven't made any impression on the csv.utils sub-module yet - things
like the sniffer. We want to watch we don't miss the 2.3 boat - what's
the next step?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Mon Feb 10 15:38:32 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 10 Feb 2003 08:38:32 -0600
Subject: [Csv] confused about wrapping readers and writers 
In-Reply-To: <20030210102633.DDCD03CA89@coffee.object-craft.com.au>
References: <15941.25805.234008.663342@montanaro.dyndns.org>
        <20030209230905.816893CA92@coffee.object-craft.com.au>
        <20030210084724.8D6063CA89@coffee.object-craft.com.au>
        <m3of5kv5ab.fsf@ferret.object-craft.com.au>
        <20030210102633.DDCD03CA89@coffee.object-craft.com.au>
Message-ID: <15943.47464.805209.596214@montanaro.dyndns.org>

    >> The only possible addition now would be some kind of mechanism
    >> whereby something like the db_row could be linked in with the module.
    >> 
    >> http://opensource.theopalgroup.com/
    >> 
    >> Mind you the application might be the best place to do this kind of
    >> linkage.

    Andrew> Maybe Skip's dictionary stuff would get us closer?

Maybe, but there are enough object-relational mappers out there (I gather
that's sort of what db_row is) that we can't possibly make everyone happy.
I say we punt.  I haven't cvs up'd yet this morning.  Hopefully my
DictReader and DictWriter classes still work. ;-)

    Andrew> We haven't made any impression on the csv.utils sub-module yet -
    Andrew> things like the sniffer. We want to watch we don't miss the 2.3
    Andrew> boat - what's the next step?

That's Cliff's expertise, and judging from his recent silence, I suspect
he's still pretty busy with other things.  Cliff, assuming the rest of the
code is pretty much set how are you fixed for time to work on a sniffer?
Should we propose that's what's there now be incorporated into 2.3 and then
aim for a separate csv.utils module between 2.3 and 2.4 (to be added in
2.4)?

Skip

From skip at pobox.com  Mon Feb 10 16:41:13 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 10 Feb 2003 09:41:13 -0600
Subject: [Csv] DictReader/DictWriter behavior question
Message-ID: <15943.51225.611672.963525@montanaro.dyndns.org>

I have DictReader set up to handle short or long rows in a reasonable
fashion.  If the number of fields in the input row is more than the number
len of the "fieldnames" list passed to the constructor, an extra field, keyed
by the optional "rest" argument gathers the remaining data.  For example:

    >>> rdr = csv.DictReader (["a,b,c,d,e\r\n"], fieldnames="1 2 3".split())
    >>> rdr.next()
    {'1': 'a', None: ['d', 'e'], '3': 'c', '2': 'b'}
    >>> rdr = csv.DictReader (["a,b,c,d,e\r\n"], fieldnames="1 2 3".split(), rest="foo")
    >>> rdr.next()
    {'1': 'a', '3': 'c', '2': 'b', 'foo': ['d', 'e']}

Similarly, if the row is short:

    >>> rdr = csv.DictReader (["a,b,c\r\n"], fieldnames="1 2 3 4 5 6".split(), restval="dflt")
    >>> rdr.next()
    {'1': 'a', '3': 'c', '2': 'b', '5': 'dflt', '4': 'dflt', '6': 'dflt'}

(I'm about to change the "rest" parameter to "restkey".)

My problem is the DictWriter.  It uses a similar mechanism to map dicts to
output rows:

    >>> f = StringIO.StringIO()
    >>> wrtr = csv.DictWriter(f, fieldnames="1 2 3".split())
    >>> wrtr.writerow({"1":30,"2":20,"3":10})
    >>> f.getvalue()
    '30,20,10\r\n'

When writing though, I face the dilemma of what to do if the dictionary
being written has one or more keys which don't appear in the fieldnames
list.  I can silently ignore them (that's the current behavior), I can raise
an exception, or I can give the user control.  There's no way to actually
write that data because you have no obvious way to order those values.  (I
could do something hokey like write out the key and the value somehow.)

What do you think is the best behavior, ignore values or raise an exception?
Or do you have other ideas?

Skip

From LogiplexSoftware at earthlink.net  Mon Feb 10 19:19:19 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 10 Feb 2003 10:19:19 -0800
Subject: [Csv] confused about wrapping readers and writers
In-Reply-To: <15943.47464.805209.596214@montanaro.dyndns.org>
References: <15941.25805.234008.663342@montanaro.dyndns.org>
	 <20030209230905.816893CA92@coffee.object-craft.com.au>
	 <20030210084724.8D6063CA89@coffee.object-craft.com.au>
	 <m3of5kv5ab.fsf@ferret.object-craft.com.au>
	 <20030210102633.DDCD03CA89@coffee.object-craft.com.au>
	 <15943.47464.805209.596214@montanaro.dyndns.org>
Message-ID: <1044901159.1376.74.camel@software1.logiplex.internal>

On Mon, 2003-02-10 at 06:38, Skip Montanaro wrote:
>     >> The only possible addition now would be some kind of mechanism
>     >> whereby something like the db_row could be linked in with the module.
>     >> 
>     >> http://opensource.theopalgroup.com/
>     >> 
>     >> Mind you the application might be the best place to do this kind of
>     >> linkage.
> 
>     Andrew> Maybe Skip's dictionary stuff would get us closer?
> 
> Maybe, but there are enough object-relational mappers out there (I gather
> that's sort of what db_row is) that we can't possibly make everyone happy.
> I say we punt.  I haven't cvs up'd yet this morning.  Hopefully my
> DictReader and DictWriter classes still work. ;-)
> 
>     Andrew> We haven't made any impression on the csv.utils sub-module yet -
>     Andrew> things like the sniffer. We want to watch we don't miss the 2.3
>     Andrew> boat - what's the next step?
> 
> That's Cliff's expertise, and judging from his recent silence, I suspect
> he's still pretty busy with other things.  Cliff, assuming the rest of the
> code is pretty much set how are you fixed for time to work on a sniffer?
> Should we propose that's what's there now be incorporated into 2.3 and then
> aim for a separate csv.utils module between 2.3 and 2.4 (to be added in
> 2.4)?

Hi all,

Sorry about my MIA status.  I've gotten things at work reduced to
smoldering ashes, which is the usual state-of-affairs, so hopefully I
can actually contribute a bit.

I think we need to once again decide what we want/need in cvsutils. 
Obvious candidates are:

1. Sniffer for guessing delimiter
2. Sniffer for guessing quotechar
3. Sniffer for guessing whether first row is header

These were easy as they already exist in DSV ;)   We just need to decide
what the API will look like for the algorithms.  Right now the DSV stuff
just returns char, char, bool, respectively for the above functions.  It
would be easy to write a wrapper that calls all three consecutively and
returns a dialect object (I don't think it's necessary to match against
existing dialects, but maybe we should?).

4. Row -> dict converter.  This should be easy as well.  The user can
use the results of the guessHeaders() sniffer or just provide their own
list of names to use as keys.  I haven't looked at Skip's code yet, but
I don't see how this can be anything but trivial.

What other things are we looking at? 

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308

From skip at pobox.com  Mon Feb 10 19:30:44 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 10 Feb 2003 12:30:44 -0600
Subject: [Csv] confused about wrapping readers and writers
In-Reply-To: <1044901159.1376.74.camel@software1.logiplex.internal>
References: <15941.25805.234008.663342@montanaro.dyndns.org>
        <20030209230905.816893CA92@coffee.object-craft.com.au>
        <20030210084724.8D6063CA89@coffee.object-craft.com.au>
        <m3of5kv5ab.fsf@ferret.object-craft.com.au>
        <20030210102633.DDCD03CA89@coffee.object-craft.com.au>
        <15943.47464.805209.596214@montanaro.dyndns.org>
        <1044901159.1376.74.camel@software1.logiplex.internal>
Message-ID: <15943.61396.366088.546062@montanaro.dyndns.org>

    Cliff> 1. Sniffer for guessing delimiter
    Cliff> 2. Sniffer for guessing quotechar
    Cliff> 3. Sniffer for guessing whether first row is header

These all sound fine.

    Cliff> It would be easy to write a wrapper that calls all three
    Cliff> consecutively and returns a dialect object (I don't think it's
    Cliff> necessary to match against existing dialects, but maybe we
    Cliff> should?).

You'd have to assume reasonable defaults for the other parameters.  How
about line terminator and QUOTE_{ALL,MINIMAL,NONE,NONNUMERIC} sniffers?
(Are the QUOTE_* values used by readers?)

    Cliff> 4. Row -> dict converter.

I don't think this will be necessary.  I already added DictReader and
DictWriter classes to csv.py which do the pretty much obvious (to me) thing.

    Cliff> What other things are we looking at? 

Some proofreading/editing of the PEP and the libcsv.tex file?

Skip

From LogiplexSoftware at earthlink.net  Mon Feb 10 21:50:21 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 10 Feb 2003 12:50:21 -0800
Subject: [Csv] confused about wrapping readers and writers
In-Reply-To: <15943.61396.366088.546062@montanaro.dyndns.org>
References: <15941.25805.234008.663342@montanaro.dyndns.org>
	 <20030209230905.816893CA92@coffee.object-craft.com.au>
	 <20030210084724.8D6063CA89@coffee.object-craft.com.au>
	 <m3of5kv5ab.fsf@ferret.object-craft.com.au>
	 <20030210102633.DDCD03CA89@coffee.object-craft.com.au>
	 <15943.47464.805209.596214@montanaro.dyndns.org>
	 <1044901159.1376.74.camel@software1.logiplex.internal>
	 <15943.61396.366088.546062@montanaro.dyndns.org>
Message-ID: <1044910221.2250.6.camel@software1.logiplex.internal>

On Mon, 2003-02-10 at 10:30, Skip Montanaro wrote:
>     Cliff> 1. Sniffer for guessing delimiter
>     Cliff> 2. Sniffer for guessing quotechar
>     Cliff> 3. Sniffer for guessing whether first row is header
> 
> These all sound fine.
> 
>     Cliff> It would be easy to write a wrapper that calls all three
>     Cliff> consecutively and returns a dialect object (I don't think it's
>     Cliff> necessary to match against existing dialects, but maybe we
>     Cliff> should?).
> 
> You'd have to assume reasonable defaults for the other parameters.  How
> about line terminator and QUOTE_{ALL,MINIMAL,NONE,NONNUMERIC} sniffers?
> (Are the QUOTE_* values used by readers?)

Line terminator would seem necessary, QUOTE_* doesn't seem necessary for
import.

>     Cliff> 4. Row -> dict converter.
> 
> I don't think this will be necessary.  I already added DictReader and
> DictWriter classes to csv.py which do the pretty much obvious (to me) thing.

Okay.

>     Cliff> What other things are we looking at? 
> 
> Some proofreading/editing of the PEP and the libcsv.tex file?

Can do.  I've got a Python meeting tonight and I'm helping someone clean
a barn tomorrow night (the life of a programmer, you know) but I might
be able to squeeze a bit of time in to get some of this done at least by
Wednesday.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308

From skip at pobox.com  Mon Feb 10 23:47:16 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 10 Feb 2003 16:47:16 -0600
Subject: [Csv] update - csv.py & libcsv.tex
Message-ID: <15944.11252.212318.864943@montanaro.dyndns.org>

Just checked in new versions of csv.py and libcsv.tex.  The former includes
a couple changes from previous notes ("rest" param is not "restkey",
DictWriter objects now have user-configurable "extrasaction" to deal with
case of dicts which have keys not in the known fieldnames).  I added text
regarding the DictReader and DictWriter classes and fixed a number of Latex
errors in libcsv.tex.

Skip

From skip at pobox.com  Tue Feb 11 15:52:21 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 11 Feb 2003 08:52:21 -0600
Subject: [Csv] writerow() leakage?
Message-ID: <15945.3621.91285.415100@montanaro.dyndns.org>

I just checked in some attempts at leakage testing in test/test_csv.py.
Creating readers and writers appears okay, as does reading data.  It appears
that writerow() leaks though.

The new tests will only be run if sys.gettotalrefcount() is available, so
you'll need to run them with a --with-pydebug build.

Skip

From skip at pobox.com  Tue Feb 11 22:49:58 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 11 Feb 2003 15:49:58 -0600
Subject: [Csv] ignore blank lines?
Message-ID: <15945.28678.631121.25754@montanaro.dyndns.org>

Would there be any value in telling the csv module to ignore blank lines?  I
notice that the logfile exporter of Firewall-1 seems to always append three
blank lines to its output.  (BTW, I discovered a command-line logfile export
capability which runs on Solaris, so I can dispense with the hokey two-space
separator the Windows-based log viewer uses.)

Skip

From djc at object-craft.com.au  Wed Feb 12 00:02:22 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 12 Feb 2003 10:02:22 +1100
Subject: [Csv] ignore blank lines?
In-Reply-To: <15945.28678.631121.25754@montanaro.dyndns.org>
References: <15945.28678.631121.25754@montanaro.dyndns.org>
Message-ID: <m31y2e5sld.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Skip> Would there be any value in telling the csv module to ignore
Skip> blank lines?  I notice that the logfile exporter of Firewall-1
Skip> seems to always append three blank lines to its output.  (BTW, I
Skip> discovered a command-line logfile export capability which runs
Skip> on Solaris, so I can dispense with the hokey two-space separator
Skip> the Windows-based log viewer uses.)

Couldn't the application just ignore records which have zero fields?

- Dave

-- 
http://www.object-craft.com.au

From andrewm at object-craft.com.au  Wed Feb 12 00:05:59 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 12 Feb 2003 10:05:59 +1100
Subject: [Csv] writerow() leakage? 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15945.3621.91285.415100@montanaro.dyndns.org> 
References: <15945.3621.91285.415100@montanaro.dyndns.org> 
Message-ID: <20030211230559.D0A723CB83@coffee.object-craft.com.au>

>I just checked in some attempts at leakage testing in test/test_csv.py.
>Creating readers and writers appears okay, as does reading data.  It appears
>that writerow() leaks though.

It's the implementation of StringIO that's making it look like writerow is
leaking references: StringIO() appends the data you write to a list.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Wed Feb 12 01:24:56 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 11 Feb 2003 18:24:56 -0600
Subject: [Csv] ignore blank lines?
In-Reply-To: <m31y2e5sld.fsf@ferret.object-craft.com.au>
References: <15945.28678.631121.25754@montanaro.dyndns.org>
        <m31y2e5sld.fsf@ferret.object-craft.com.au>
Message-ID: <15945.37976.592926.369940@montanaro.dyndns.org>

    Skip> Would there be any value in telling the csv module to ignore blank
    Skip> lines?

    Dave> Couldn't the application just ignore records which have zero
    Dave> fields?

I suppose so, but it seems somehow cleaner to me if I know the parser won't
return empty lists. 

Skip

From sjmachin at lexicon.net  Wed Feb 12 01:46:59 2003
From: sjmachin at lexicon.net (John Machin)
Date: Wed, 12 Feb 2003 11:46:59 +1100
Subject: [Csv] ignore blank lines?
In-Reply-To: <15945.37976.592926.369940@montanaro.dyndns.org>
References: <15945.28678.631121.25754@montanaro.dyndns.org>
	<m31y2e5sld.fsf@ferret.object-craft.com.au>
	<15945.37976.592926.369940@montanaro.dyndns.org>
Message-ID: <oprkghsll3m50ryr@localhost>

On Tue, 11 Feb 2003 18:24:56 -0600, Skip Montanaro <skip at pobox.com> wrote:

>
> Skip> Would there be any value in telling the csv module to ignore blank
> Skip> lines?
>
> Dave> Couldn't the application just ignore records which have zero
> Dave> fields?
>
> I suppose so, but it seems somehow cleaner to me if I know the parser 
> won't
> return empty lists.
>

Does Skip mean a blank line "...\n    \n..." or an empty line "...\n\n..." 
???

It might help if first this were discussed (and the answer documented):

When writing, both [] and [""] will produce empty lines.
Dave seems to imply that on reading, an "empty" line will produce [], not 
[""]. There is some ground for arguing the latter, on the basis that a 
record with only n delimiters and nothing else should produce (n+1) * [""] 
and in this case n is zero

Whatever, I'm with Dave. The caller can handle this. In any case, ignoring 
empty lines (except maybe one or two that have inadvertently appeared at 
the end of the file) seems perilous to me. I've seen lots of dud data in my 
time, but never once a file where I could happily ignore non-terminal empty 
lines.

-- 

From djc at object-craft.com.au  Wed Feb 12 02:40:15 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 12 Feb 2003 12:40:15 +1100
Subject: [Csv] ignore blank lines?
In-Reply-To: <oprkghsll3m50ryr@localhost>
References: <15945.28678.631121.25754@montanaro.dyndns.org>
	<m31y2e5sld.fsf@ferret.object-craft.com.au>
	<15945.37976.592926.369940@montanaro.dyndns.org>
	<oprkghsll3m50ryr@localhost>
Message-ID: <m31y2e46ps.fsf@ferret.object-craft.com.au>

>>>>> "John" == John Machin <sjmachin at lexicon.net> writes:

John> On Tue, 11 Feb 2003 18:24:56 -0600, Skip Montanaro
John> <skip at pobox.com> wrote:
>>
Skip> Would there be any value in telling the csv module to ignore
Skip> blank lines?
>>
Dave> Couldn't the application just ignore records which have zero
Dave> fields?
>>  I suppose so, but it seems somehow cleaner to me if I know the
>> parser won't return empty lists.
>> 

John> Does Skip mean a blank line "...\n \n..." or an empty line
John> "...\n\n..." ???

John> It might help if first this were discussed (and the answer
John> documented):

John> When writing, both [] and [""] will produce empty lines.  Dave
John> seems to imply that on reading, an "empty" line will produce [],
John> not [""]. There is some ground for arguing the latter, on the
John> basis that a record with only n delimiters and nothing else
John> should produce (n+1) * [""] and in this case n is zero

John> Whatever, I'm with Dave. The caller can handle this. In any
John> case, ignoring empty lines (except maybe one or two that have
John> inadvertently appeared at the end of the file) seems perilous to
John> me. I've seen lots of dud data in my time, but never once a file
John> where I could happily ignore non-terminal empty lines.

Let's try it out:

>>> import csv
>>> r = csv.reader(['', '""', ' '])
>>> r.next()
[]
>>> r.next()
['']
>>> r.next()
[' ']
>>> class F:
...     def write(self, s):
...         print repr(s)
...
>>> w = csv.writer(F())
>>> w.writerow([])
'\r\n'
>>> w.writerow([''])
'""\r\n'
>>> w.writerow([' '])
' \r\n'

Seems like the module is doing the right (sensible) thing.

- Dave

-- 
http://www.object-craft.com.au

From skip at pobox.com  Wed Feb 12 03:45:10 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 11 Feb 2003 20:45:10 -0600
Subject: [Csv] writerow() leakage? 
In-Reply-To: <20030211230559.D0A723CB83@coffee.object-craft.com.au>
References: <15945.3621.91285.415100@montanaro.dyndns.org>
        <20030211230559.D0A723CB83@coffee.object-craft.com.au>
Message-ID: <15945.46390.240131.505750@montanaro.dyndns.org>

    Andrew> It's the implementation of StringIO that's making it look like
    Andrew> writerow is leaking references: StringIO() appends the data you
    Andrew> write to a list.

Thanks.  Test fixed.

S

From andrewm at object-craft.com.au  Wed Feb 12 05:00:17 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 12 Feb 2003 15:00:17 +1100
Subject: [Csv] csv.writer, file must be binary mode...
Message-ID: <20030212040017.5C84A3CB83@coffee.object-craft.com.au>

A posting by Tim Peters on the Python list reminded me that csv.writer()
is not the only module that requires it be passed a file in binary module - 
Pickle is classic example.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Wed Feb 12 07:19:37 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 12 Feb 2003 17:19:37 +1100
Subject: [Csv] This surprised me 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15941.8540.607571.202309@montanaro.dyndns.org> 
References: <15941.8540.607571.202309@montanaro.dyndns.org> 
Message-ID: <20030212061937.863823CB83@coffee.object-craft.com.au>

>    >> This code surprised me:
>    ...
>    Andrew> Surely there's more to your example than you quoted in this
>    Andrew> e-mail? The exception you mention came from the python code, not
>    Andrew> the C module (specifically the Dialect class), but I can't see
>    Andrew> where it referenced in the quoted code?
>
>Nope, nothing more.  I guess the point I was trying to make is that if I
>pass a dialect object which is not subclassed from csv.Dialect (as you
>suggested I should be able to do), it seems to be silently accepted.

Uh? If I recall correctly, the exception quoted came from the python
Dialect class, but it wasn't involved in the line that threw the
exception? 8-)

>    Andrew> The C code will instanciate (and thus call Dialect's _validate)
>    Andrew> when register_dialect is called, or when the class is passed to
>    Andrew> reader or writer.
>
>Correct.  But you indicated that was no longer necessary.  I was wondering
>where the error checking went to.

I decided it wasn't necessary - if the instance has the necessary bits
and no more, we can use it as parameters, whether it's a descendant of
Dialect or not.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Wed Feb 12 07:49:43 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 12 Feb 2003 17:49:43 +1100
Subject: [Csv] Re: Unicode again 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15929.5633.929389.67150@montanaro.dyndns.org> 
References: <15929.5633.929389.67150@montanaro.dyndns.org> 
Message-ID: <20030212064943.6C8C63CB83@coffee.object-craft.com.au>

>I've been thinking a little about the Unicode issue some more.  I really
>think you don't want to dive into picking apart Unicode strings.  If
>nothing else, you'll have to deal with a mixture of wide and narrow
>characters.  How about two paths?  If you know everything's a plain
>string, execute your current code.  If any elements are Unicode strings,
>take the slower, high-level path.

I've had a bit of a chance to look at the C unicode implementation, and
it's pretty clean - essentially you just have a string of unsigned shorts
(or unsigned longs if python was build with wide support) instead of
unsigned chars. Generally you don't have to worry about variable length
data (we'd cover 99.99% of use cases by ignoring the exceptions).

I think I currently favour the approach used in sre, where preprocessor
tricks are used to compile two versions of the core, but I'm sure this
won't be trivial. Probably not something we can deal with before 2.3.
Hopefully this won't preclude integration with 2.3.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Wed Feb 12 15:22:48 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 12 Feb 2003 08:22:48 -0600
Subject: [Csv] csv.writer, file must be binary mode...
In-Reply-To: <20030212040017.5C84A3CB83@coffee.object-craft.com.au>
References: <20030212040017.5C84A3CB83@coffee.object-craft.com.au>
Message-ID: <15946.22712.538271.973935@montanaro.dyndns.org>

    Andrew> A posting by Tim Peters on the Python list reminded me that
    Andrew> csv.writer() is not the only module that requires it be passed a
    Andrew> file in binary module - Pickle is classic example.

Thanks for the tip.  I'll mention this in the PEP.

Skip

From skip at pobox.com  Wed Feb 12 15:31:25 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 12 Feb 2003 08:31:25 -0600
Subject: [Csv] This surprised me 
In-Reply-To: <20030212061937.863823CB83@coffee.object-craft.com.au>
References: <15941.8540.607571.202309@montanaro.dyndns.org>
        <20030212061937.863823CB83@coffee.object-craft.com.au>
Message-ID: <15946.23229.379024.7389@montanaro.dyndns.org>

    >> Correct.  But you indicated that was no longer necessary.  I was
    >> wondering where the error checking went to.

    Andrew> I decided it wasn't necessary - if the instance has the
    Andrew> necessary bits and no more, we can use it as parameters, whether
    Andrew> it's a descendant of Dialect or not.

Yeah, but what if it has no necessary bits?  Shouldn't the user be alerted
to that fact?

    >>> import csv
    >>> class foo: pass
    ... 
    >>> rdr = csv.reader(["a,b,c\r\n"], dialect=foo)
    >>> rdr.next()
    ['a', 'b', 'c']

If nothing else, we need to define the specific defaults for the various
parameters.  In the above case, clearly my foo class isn't overriding
anything.

Skip

From skip at pobox.com  Wed Feb 12 16:02:45 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 12 Feb 2003 09:02:45 -0600
Subject: [Csv] csv.writer, file must be binary mode...
In-Reply-To: <20030212040017.5C84A3CB83@coffee.object-craft.com.au>
References: <20030212040017.5C84A3CB83@coffee.object-craft.com.au>
Message-ID: <15946.25109.366279.925230@montanaro.dyndns.org>

    Andrew> A posting by Tim Peters on the Python list reminded me that
    Andrew> csv.writer() is not the only module that requires it be passed a
    Andrew> file in binary module - Pickle is classic example.

On second thought, the only reason Pickle requires binary mode is when the
binary pickle format is selected, right?  Hmmm...  I don't think we really
need to say "Pickle requires binary mode, so we can too."

Skip

From skip at pobox.com  Wed Feb 12 16:11:06 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 12 Feb 2003 09:11:06 -0600
Subject: [Csv] ignore blank lines?
In-Reply-To: <oprkghsll3m50ryr@localhost>
References: <15945.28678.631121.25754@montanaro.dyndns.org>
        <m31y2e5sld.fsf@ferret.object-craft.com.au>
        <15945.37976.592926.369940@montanaro.dyndns.org>
        <oprkghsll3m50ryr@localhost>
Message-ID: <15946.25610.202929.623399@montanaro.dyndns.org>

    John> Does Skip mean a blank line "...\n \n..." or an empty line
    John> "...\n\n..."  ???

I meant a line which consists of just the lineterminator sequence.

Here's my use case.  In the DictReader class, if the underlying reader
object returns an empty list and I don't catch it, I wind up returning a
dictionary all of whose fields are set to the restval (typically None).  The
caller can't simply compare that against {} as the caller of csv.reader()
can compare the returned value against [], so it makes sense for me to elide
that case in the DictReader code.

I modified DictReader.next() to start like:

    def next(self):
        row = self.reader.next()
        while row == []:
            row = self.reader.next()
        ... process row ...

Does that behavior make sense in this case?

Skip

From skip at pobox.com  Wed Feb 12 18:39:16 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 12 Feb 2003 11:39:16 -0600
Subject: [Csv] Ready for another announcement?
Message-ID: <15946.34500.633614.47737@montanaro.dyndns.org>

Are we ready to make another announcement?  It seems most of the PEP 308
furor has died down, so perhaps an announcement will actually be seen.

Skip

From sjmachin at lexicon.net  Wed Feb 12 21:29:11 2003
From: sjmachin at lexicon.net (John Machin)
Date: Thu, 13 Feb 2003 07:29:11 +1100
Subject: [Csv] ignore blank lines?
In-Reply-To: <15946.25610.202929.623399@montanaro.dyndns.org>
References: <15945.28678.631121.25754@montanaro.dyndns.org>
	<m31y2e5sld.fsf@ferret.object-craft.com.au>
	<15945.37976.592926.369940@montanaro.dyndns.org>
	<oprkghsll3m50ryr@localhost> <15946.25610.202929.623399@montanaro.dyndns.org>
Message-ID: <oprkh0ixoom50ryr@localhost>

On Wed, 12 Feb 2003 09:11:06 -0600, Skip Montanaro <skip at pobox.com> wrote:

>
> John> Does Skip mean a blank line "...\n \n..." or an empty line
> John> "...\n\n..."  ???
>
> I meant a line which consists of just the lineterminator sequence.
>
> Here's my use case.  In the DictReader class, if the underlying reader
> object returns an empty list and I don't catch it, I wind up returning a
> dictionary all of whose fields are set to the restval (typically None).  
> The
> caller can't simply compare that against {} as the caller of csv.reader()
> can compare the returned value against [], so it makes sense for me to 
> elide
> that case in the DictReader code.
>
> I modified DictReader.next() to start like:
>
> def next(self):
> row = self.reader.next()
> while row == []:
> row = self.reader.next()
> ... process row ...
>
> Does that behavior make sense in this case?

I am +0 on suppressing empty lines at the end of the input stream, but -1 
on suppressing these (especially with neither notice nor option for non- 
suppression) if they appear between non-empty data rows. Rather than 
petition Dave & Andrew for yet another toggle, I would say make it easier 
for the caller to detect this situation ...

if row == []: return {}

Cheers,
John

-- 

From andrewm at object-craft.com.au  Wed Feb 12 23:23:15 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 13 Feb 2003 09:23:15 +1100
Subject: [Csv] csv.writer, file must be binary mode... 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15946.25109.366279.925230@montanaro.dyndns.org> 
References: <20030212040017.5C84A3CB83@coffee.object-craft.com.au>
	<15946.25109.366279.925230@montanaro.dyndns.org> 
Message-ID: <20030212222315.722AA3CB83@coffee.object-craft.com.au>

>    Andrew> A posting by Tim Peters on the Python list reminded me that
>    Andrew> csv.writer() is not the only module that requires it be passed a
>    Andrew> file in binary module - Pickle is classic example.
>
>On second thought, the only reason Pickle requires binary mode is when the
>binary pickle format is selected, right?  Hmmm...  I don't think we really
>need to say "Pickle requires binary mode, so we can too."

That wasn't really what I was thinking - it was more like "requiring
binary mode is going to confuse people and be an endless source of bugs",
but then I saw the Pickle stuff and now I'm not so worried.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Wed Feb 12 23:57:09 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 12 Feb 2003 16:57:09 -0600
Subject: [Csv] csv.writer, file must be binary mode... 
In-Reply-To: <20030212222315.722AA3CB83@coffee.object-craft.com.au>
References: <20030212040017.5C84A3CB83@coffee.object-craft.com.au>
        <15946.25109.366279.925230@montanaro.dyndns.org>
        <20030212222315.722AA3CB83@coffee.object-craft.com.au>
Message-ID: <15946.53573.89825.118609@montanaro.dyndns.org>

    Andrew> That wasn't really what I was thinking - it was more like
    Andrew> "requiring binary mode is going to confuse people and be an
    Andrew> endless source of bugs", but then I saw the Pickle stuff and now
    Andrew> I'm not so worried.

Ah, okay.  It's good I didn't didn't do anything then. ;-)

Skip

From skip at pobox.com  Thu Feb 13 02:16:56 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 12 Feb 2003 19:16:56 -0600
Subject: [Csv] ini file fumbling broke
Message-ID: <15946.61960.251535.643909@montanaro.dyndns.org>

Someone recently decreed that all files mentioned in BAYESCUSTOMIZE must end
in ".ini" and modified Options.py (I named my customize file ~/hammie.opt).
Was this related to the embedded-spaces-in-paths problem?  Sumthin's gotta
give I think.  If spaces are common in filenames, we need to pick a better
separator.  (Or allow the separator to be platform-specific.)  On Unix
systems, ":" is a good path separator (but would be bad on MacOS < X
systems).  I think ";" is more common on Windows.  I don't think forcing
customize files to end in ".ini" is right.  Even one of the default files
searched for in Options.py is "~/.spambayesrc".

Thoughts?

Skip

From sjmachin at lexicon.net  Thu Feb 13 12:13:18 2003
From: sjmachin at lexicon.net (John Machin)
Date: Thu, 13 Feb 2003 22:13:18 +1100
Subject: [Csv] non-portable initialisation of types in _csv.c
Message-ID: <oprki5ggtbm50ryr@localhost>

static PyTypeObject Dialect_Type = {
   /* aarrgghh PyObject_HEAD_INIT(&PyType_Type) */
   PyObject_HEAD_INIT(NULL)
   0,                                      /* ob_size */

-- 

From skip at pobox.com  Thu Feb 13 15:58:18 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 13 Feb 2003 08:58:18 -0600
Subject: [Csv] non-portable initialisation of types in _csv.c
In-Reply-To: <oprki5ggtbm50ryr@localhost>
References: <oprki5ggtbm50ryr@localhost>
Message-ID: <15947.45706.139024.281581@montanaro.dyndns.org>

    John> static PyTypeObject Dialect_Type = {
    John>    /* aarrgghh PyObject_HEAD_INIT(&PyType_Type) */
    John>    PyObject_HEAD_INIT(NULL)
    John>    0,                                      /* ob_size */

John,

Thanks, is this a Windows thing?  What about the head initializer for the
Reader_Type and Writer_Type types?

Skip

From skip at pobox.com  Thu Feb 13 20:00:36 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 13 Feb 2003 13:00:36 -0600
Subject: [Csv] trial zip/tar packages of csv module available
Message-ID: <15947.60244.162082.486394@montanaro.dyndns.org>

If you are interested in reading or writing CSV files from Python and you
have Python 2.2 or 2.3 available, please take a moment to download, extract
and install either or both of the following URLs:

    http://manatee.mojam.com/~skip/csv.tar.gz
    http://manatee.mojam.com/~skip/csv.zip

If you'd prefer, you can grab the files the the Python CVS sandbox:

    http://sourceforge.net/cvs/?group_id=5470
    http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/

Not included in the above zip/tgz files is the latest version of PEP 305.
You can view it here:

    http://www.python.org/peps/pep-0305.html

The goal is to get this package into Python 2.3, though we've tried to keep
it working under 2.2.  It uses iterators, so I don't know if it will work
with anything before 2.2.  The package has been built on Linux and Mac OS X
at this point.  I think it's been built on Windows though I'm not positive.
There shouldn't be anything terribly platform-dependent there.

To build and install, just do the usual distutils dance:

    python setup.py install

If you cd to the test subdirectory, you can run the 60 or so unit tests:

    cd test
    python test_csv.py

If your Python interpreter was configured using --with-pydebug it will run a
few memory leak tests.  If not it will let you know they are being skipped.
(If you try it both ways, make sure to delete the build subdirectory between
builds, otherwise you'll get link errors.)

Feedback is welcomed on both the package and the PEP, but please remember to
include csv at mail.mojam.com in your mail.

Thanks,

Skip

From sjmachin at lexicon.net  Thu Feb 13 23:16:51 2003
From: sjmachin at lexicon.net (John Machin)
Date: Fri, 14 Feb 2003 09:16:51 +1100
Subject: [Csv] trial zip/tar packages of csv module available
In-Reply-To: <15947.60244.162082.486394@montanaro.dyndns.org>
References: <15947.60244.162082.486394@montanaro.dyndns.org>
Message-ID: <oprkjz6df5m50ryr@localhost>

On Thu, 13 Feb 2003 13:00:36 -0600, Skip Montanaro <skip at pobox.com> wrote:

>
> If you are interested in reading or writing CSV files from Python and you
> have Python 2.2 or 2.3 available, please take a moment to download, 
> extract
> and install either or both of the following URLs:
>
> http://manatee.mojam.com/~skip/csv.tar.gz
> http://manatee.mojam.com/~skip/csv.zip

> The goal is to get this package into Python 2.3, though we've tried to 
> keep
> it working under 2.2.  It uses iterators, so I don't know if it will work
> with anything before 2.2.  The package has been built on Linux and Mac OS 
> X
> at this point.  I think it's been built on Windows though I'm not 
> positive.
> There shouldn't be anything terribly platform-dependent there.
>

Good news first, whinges at the end of the message :-)

===
Compiles & installs OK out-of-the-box with Python 2.2, Windows 2000, BCC32 
(Borland 5.5 freebie command-line compiler) -- thanks to revision 1.30 :-)
===
C:\csv\test>python test_csv.py
*** skipping leakage tests ***
........................................................
----------------------------------------------------------------------
Ran 56 tests in 0.030s

OK
===
Slurped through a 150Mb CSV file at a reasonable speed without any memory 
leak that could be detected by the primitive method of watching the Task 
Manager memory graph.
===

Doco:

"""0.1.1 Module Contents
The csv module defines the following functions.
reader(iterable[, dialect=?excel? ] [, fmtparam])
Return a reader object which will iterate over lines in the given 
csvfile."""

Huh? What "given csvfile"?
Need to define carefully what iterable.next() is expected to deliver; a 
line, with or without a trailing newline? a string of 1 or more bytes which 
may contain embedded line separators, either as true separators or as 
(quoted) data? [e.g. iterable could be a generator which uses say 
read(16384)]. I have noticed in the csv mailing list some muttering along 
the lines of "the iterable's underlying file must have been opened in 
binary mode"!? Que?

This might necessitate a FAQ entry:
>>> cr = csv.reader("iterable is string!")
>>> [x for x in cr]
[['i'], ['t'], ['e'], ['r'], ['a'], ['b'], ['l'], ['e'], [' '], ['i'], 
['s'], [' '], ['s'], ['t'], ['r'], ['i'], ['n'], ['g'], ['!']
]
>>>

===

Does the reader detect any errors at all? E.g. I expected some complaint 
here, instead of silently doing nothing:
>>> import csv
>>> cr = csv.reader(['f1,"unterminated quoted field,f3'])
>>> for x in cr: print x
...
>>> cr = csv.reader(['f1,"terminated quoted field",f3'])
>>> for x in cr: print x
...
['f1', 'terminated quoted field', 'f3']
>>> cr = csv.reader(['f1,"unterminated quoted field,f3\n'])
>>> for x in cr: print x
...
>>>
===

Judging by the fact that in _csv.c '\0' is passed around as a line-ending 
signal, it's not 8-bit-clean. This fact should be at least documented, if 
not fixed (which looks like a bit of a rewrite). Strange behaviour on 
embedded '\0' may worry not only pedants but also folk who are recipients 
of data files created by J. Random Boofhead III and friends.

===
Cheers,
John

From sjmachin at lexicon.net  Thu Feb 13 23:33:22 2003
From: sjmachin at lexicon.net (John Machin)
Date: Fri, 14 Feb 2003 09:33:22 +1100
Subject: [Csv] non-portable initialisation of types in _csv.c
In-Reply-To: <15947.45706.139024.281581@montanaro.dyndns.org>
References: <oprki5ggtbm50ryr@localhost>
	<15947.45706.139024.281581@montanaro.dyndns.org>
Message-ID: <oprkj0xwrkm50ryr@localhost>

On Thu, 13 Feb 2003 08:58:18 -0600, Skip Montanaro <skip at pobox.com> wrote:

>
> John> static PyTypeObject Dialect_Type = {
> John>    /* aarrgghh PyObject_HEAD_INIT(&PyType_Type) */
> John>    PyObject_HEAD_INIT(NULL)
> John>    0,                                      /* ob_size */
>
> John,
>
> Thanks, is this a Windows thing? >
> What about the head initializer for the
> Reader_Type and Writer_Type types?

My understanding is this:

The offending code is strictly not correct C -- the initialiser is not a 
constant; it's the address of a gadget not declared in the current source 
file. However some compiler/linker combinations can nut it out. Some 
compilers take advantage of this; some can't or won't; Windows compilers 
seem to be in the can't or won't category.

> What about the head initializer for the
> Reader_Type and Writer_Type types?

Skip, what's sauce for the first goose is also sauce for the second and 
subsequent geese. You seem to have sauced all 3 birds in rev 1.30.

I notice that it seems to work without the "FooType.ob_type = 
&PyType_Type;" incantation in the module initialisation. Perhaps 
PyType_Ready() fixes this up.

Cheers,
John

-- 

From skip at pobox.com  Thu Feb 13 23:53:41 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 13 Feb 2003 16:53:41 -0600
Subject: [Csv] non-portable initialisation of types in _csv.c
In-Reply-To: <oprkj0xwrkm50ryr@localhost>
References: <oprki5ggtbm50ryr@localhost>
        <15947.45706.139024.281581@montanaro.dyndns.org>
        <oprkj0xwrkm50ryr@localhost>
Message-ID: <15948.8693.867705.309628@montanaro.dyndns.org>

    John> I notice that it seems to work without the "FooType.ob_type =
    John> &PyType_Type;" incantation in the module initialisation. Perhaps
    John> PyType_Ready() fixes this up.

Yes, that's one of the things it does.  Perhaps it would have been better
named as "PyType_MakeReady".

Thanks for the other feedback as well.  I'll let Dave and Andrew mull over
that stuff.  The issue of 8-bit next-to-godliness will probably have to be
addressed once Unicode is tackled.

Skip

From andrewm at object-craft.com.au  Fri Feb 14 07:03:18 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 14 Feb 2003 17:03:18 +1100
Subject: [Csv] non-portable initialisation of types in _csv.c 
In-Reply-To: Message from John Machin <sjmachin@lexicon.net> 
   of "Fri, 14 Feb 2003 09:33:22 +1100." <oprkj0xwrkm50ryr@localhost> 
References: <oprki5ggtbm50ryr@localhost>
	<15947.45706.139024.281581@montanaro.dyndns.org>  <oprkj0xwrkm50ryr@localhost>

Message-ID: <20030214060319.F0D933CC5D@coffee.object-craft.com.au>

>>> John>    /* aarrgghh PyObject_HEAD_INIT(&PyType_Type) */
[...]
>>The offending code is strictly not correct C -- the initialiser is not a 
>>constant; it's the address of a gadget not declared in the current source 
>>file. However some compiler/linker combinations can nut it out. Some 
>>compilers take advantage of this; some can't or won't; Windows compilers 
>>seem to be in the can't or won't category.

Indeed - thanks for picking this up. I can only assume it was a
cut-n-paste accident, because it was originally PyObject_HEAD_INIT(0).

>>Skip, what's sauce for the first goose is also sauce for the second and 
>>subsequent geese. You seem to have sauced all 3 birds in rev 1.30.

Thanks Skip - all three needed doing.

>>I notice that it seems to work without the "FooType.ob_type = 
>>&PyType_Type;" incantation in the module initialisation. Perhaps 
>>PyType_Ready() fixes this up.
>
>Yes, that's one of the things it does.  Perhaps it would have been better
>named as "PyType_MakeReady".

And the consequences of *not* calling PyType_Ready() are particularly
obscure. There's enough information to allow the Python core to assert
if a type hasn't been finalised - I wonder why it doesn't?

>The issue of 8-bit next-to-godliness will probably have to be
>addressed once Unicode is tackled.

Definitely not this go around, anyway. I doubt it's lack is a big deal
(lack of unicode is a bigger deal) - since CSV is a text format, finding
a null in the input would be very unusual (and I wouldn't be surprised
if excel choked too... 8-).

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Fri Feb 14 07:11:30 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 14 Feb 2003 17:11:30 +1100
Subject: [Csv] trial zip/tar packages of csv module available 
In-Reply-To: Message from John Machin <sjmachin@lexicon.net> 
   of "Fri, 14 Feb 2003 09:16:51 +1100." <oprkjz6df5m50ryr@localhost> 
References: <15947.60244.162082.486394@montanaro.dyndns.org>
	<oprkjz6df5m50ryr@localhost> 
Message-ID: <20030214061130.163773CC5D@coffee.object-craft.com.au>

>Slurped through a 150Mb CSV file at a reasonable speed without any memory 
>leak that could be detected by the primitive method of watching the Task 
>Manager memory graph.

I've been using a --enable-pydebug version of python while working on the
_csv module, and have been watching the reference counts fairly carefully.
While it's still possible there are reference leaks, I'd expect them to
be in code off the main path (exception handling, etc, although I watched
these carefully too).

>"""0.1.1 Module Contents
>The csv module defines the following functions.
>reader(iterable[, dialect="excel" ] [, fmtparam])
>Return a reader object which will iterate over lines in the given 
>csvfile."""
>
>Huh? What "given csvfile"?
>Need to define carefully what iterable.next() is expected to deliver; a 
>line, with or without a trailing newline? 

In the docstring, I changed this to:

    The "iterable" argument can be any object that returns a line
    of input for each iteration, such as a file object or a list.  The
    optional "dialect" parameter is discussed below.  The function
    also accepts optional keyword arguments which override settings
    provided by the dialect.

    The returned object is an iterator.  Each iteration returns a row
    of the CSV file (which can span multiple input lines):

Do you think this is clearer?

The reader will cope with a file opened binary or not - it *should*
do the right thing in either case.

>This might necessitate a FAQ entry:
>>>> cr = csv.reader("iterable is string!")
>>>> [x for x in cr]
>[['i'], ['t'], ['e'], ['r'], ['a'], ['b'], ['l'], ['e'], [' '], ['i'], 
>['s'], [' '], ['s'], ['t'], ['r'], ['i'], ['n'], ['g'], ['!']
>]

I don't think there is ever a case where you would want the input
iteratable to be a string - I could probably just raise an exception if
it is?

>Does the reader detect any errors at all? E.g. I expected some complaint 
>here, instead of silently doing nothing:
>>>> import csv
>>>> cr = csv.reader(['f1,"unterminated quoted field,f3'])
>>>> for x in cr: print x
>...
>>>> cr = csv.reader(['f1,"terminated quoted field",f3'])
>>>> for x in cr: print x
>...
>['f1', 'terminated quoted field', 'f3']
>>>> cr = csv.reader(['f1,"unterminated quoted field,f3\n'])
>>>> for x in cr: print x
>...

That's a hang-over from the the old Object Craft csv module (where it was
the user's problem), and you are right - it needs to be fixed. I'll look
into it shortly. Thanks for picking it up.

>Judging by the fact that in _csv.c '\0' is passed around as a line-ending 
>signal, it's not 8-bit-clean. This fact should be at least documented, if 
>not fixed (which looks like a bit of a rewrite). Strange behaviour on 
>embedded '\0' may worry not only pedants but also folk who are recipients 
>of data files created by J. Random Boofhead III and friends.

Yep - Skip - can you doco the fact that the input should not contain null
characters or be unicode strings?

Null characters in the input will be treated as newlines, if I remember
correctly.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Fri Feb 14 07:14:58 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 14 Feb 2003 00:14:58 -0600
Subject: [Csv] Out of town, BDFL pronouncement, incorporation, Unicode
Message-ID: <15948.35170.966135.741531@montanaro.dyndns.org>

Folks,

I'll be at work Friday, but will be leaving Saturday for warm, sunny Mexico
for a week of r&r away from Chicago's chilly climate.  The latest version of
the code and PEP are "out there", hopefully getting poked and prodded a bit.

Assuming nothing earth-shattering develops by mid-week, would one of you
like to propose on python-dev that Guido pronounce on the PEP and give a
thumbs-up or -down on the module?  I can take care of merging it into the
Python distribution (stitch it into setup.py, the test directory and the
libref manual) when I return.

Any thoughts from Dave and Andrew about Unicode?  Marc Andr? Lemburg (or was
it Martin von L?wis?) suggested just encoding Unicode as utf-8.  Someone
else (Fredrik Lundh I believe) suggested a double-compilation scheme such as
Modules/_sre.c uses.  One pass gets you 8-bit characters, the other wide
characters.  Presumably, the correct state machine to execute would be
chosen based upon the input data types.

Skip

From skip at pobox.com  Fri Feb 14 07:17:27 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 14 Feb 2003 00:17:27 -0600
Subject: [Csv] non-portable initialisation of types in _csv.c 
In-Reply-To: <20030214060319.F0D933CC5D@coffee.object-craft.com.au>
References: <oprki5ggtbm50ryr@localhost>
        <15947.45706.139024.281581@montanaro.dyndns.org>
        <oprkj0xwrkm50ryr@localhost>
        <20030214060319.F0D933CC5D@coffee.object-craft.com.au>
Message-ID: <15948.35319.353684.464773@montanaro.dyndns.org>

    >> The issue of 8-bit next-to-godliness will probably have to be
    >> addressed once Unicode is tackled.

    Andrew> Definitely not this go around, anyway. I doubt it's lack is a
    Andrew> big deal (lack of unicode is a bigger deal) - since CSV is a
    Andrew> text format, finding a null in the input would be very unusual
    Andrew> (and I wouldn't be surprised if excel choked too... 8-).

Don't forget that Excel's "Unicode Text" format seems to dump into utf-16,
which is littered with NUL characters (roughly every other character in the
common case where all your text is representable as ascii).  Moral of the
story: If Unicode is important, NUL characters will be important.

Skip

From skip at pobox.com  Fri Feb 14 07:19:22 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 14 Feb 2003 00:19:22 -0600
Subject: [Csv] trial zip/tar packages of csv module available 
In-Reply-To: <20030214061130.163773CC5D@coffee.object-craft.com.au>
References: <15947.60244.162082.486394@montanaro.dyndns.org>
        <oprkjz6df5m50ryr@localhost>
        <20030214061130.163773CC5D@coffee.object-craft.com.au>
Message-ID: <15948.35434.750651.855382@montanaro.dyndns.org>

    Andrew> Yep - Skip - can you doco the fact that the input should not
    Andrew> contain null characters or be unicode strings?

Will do.

Skip

From andrewm at object-craft.com.au  Fri Feb 14 07:34:45 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 14 Feb 2003 17:34:45 +1100
Subject: [Csv] Out of town, BDFL pronouncement, incorporation, Unicode 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15948.35170.966135.741531@montanaro.dyndns.org> 
References: <15948.35170.966135.741531@montanaro.dyndns.org> 
Message-ID: <20030214063445.3A0A73CC5D@coffee.object-craft.com.au>

>Assuming nothing earth-shattering develops by mid-week, would one of you
>like to propose on python-dev that Guido pronounce on the PEP and give a
>thumbs-up or -down on the module?  I can take care of merging it into the
>Python distribution (stitch it into setup.py, the test directory and the
>libref manual) when I return.

Okay.

>Any thoughts from Dave and Andrew about Unicode?  Marc Andr? Lemburg (or was
>it Martin von L?wis?) suggested just encoding Unicode as utf-8.  Someone
>else (Fredrik Lundh I believe) suggested a double-compilation scheme such as
>Modules/_sre.c uses.  One pass gets you 8-bit characters, the other wide
>characters.  Presumably, the correct state machine to execute would be
>chosen based upon the input data types.

What little I know about utf-8 suggests that the current module should be
safe - nulls won't appear, and subsequent bytes in multi-byte characters
all have their high bit set. None of the special characters can be a
unicode character, of course. The user could do something like:

    csv.reader([line.encode('utf-8') for line in lines])

I think the unicode files emitted by Excel are actually utf-8 encoded,
so this won't even be necessary - the user will just have to decode each
field with the utf-8 codec.

Proper unicode support is something we probably should do (the user
might have a UCS-2 encoded file, etc), but it won't happen in the next
week or so.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Fri Feb 14 07:37:07 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 14 Feb 2003 17:37:07 +1100
Subject: [Csv] non-portable initialisation of types in _csv.c 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15948.35319.353684.464773@montanaro.dyndns.org> 
References: <oprki5ggtbm50ryr@localhost>
	<15947.45706.139024.281581@montanaro.dyndns.org> <oprkj0xwrkm50ryr@localhost>
	<20030214060319.F0D933CC5D@coffee.object-craft.com.au>
	<15948.35319.353684.464773@montanaro.dyndns.org> 
Message-ID: <20030214063707.D883A3CC5D@coffee.object-craft.com.au>

>    >> The issue of 8-bit next-to-godliness will probably have to be
>    >> addressed once Unicode is tackled.
>
>    Andrew> Definitely not this go around, anyway. I doubt it's lack is a
>    Andrew> big deal (lack of unicode is a bigger deal) - since CSV is a
>    Andrew> text format, finding a null in the input would be very unusual
>    Andrew> (and I wouldn't be surprised if excel choked too... 8-).
>
>Don't forget that Excel's "Unicode Text" format seems to dump into utf-16,
>which is littered with NUL characters (roughly every other character in the
>common case where all your text is representable as ascii).  Moral of the
>story: If Unicode is important, NUL characters will be important.

If that's so, you'd have to convert the input to utf-8 first - even
without the null issue, there would be plenty of other issues feeding
16 bit input to an 8 bit parser... 8-)

Once the internals have been modified to support python's internal
unicode representation, the current null handling could even stay... 8-)

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Fri Feb 14 08:06:10 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 14 Feb 2003 01:06:10 -0600
Subject: [Csv] ignore blank lines?
In-Reply-To: <oprkh0ixoom50ryr@localhost>
References: <15945.28678.631121.25754@montanaro.dyndns.org>
        <m31y2e5sld.fsf@ferret.object-craft.com.au>
        <15945.37976.592926.369940@montanaro.dyndns.org>
        <oprkghsll3m50ryr@localhost>
        <15946.25610.202929.623399@montanaro.dyndns.org>
        <oprkh0ixoom50ryr@localhost>
Message-ID: <15948.38242.974158.425677@montanaro.dyndns.org>

    John> ... I would say make it easier for the caller to detect this
    John> situation ...

    John> if row == []: return {}

Except the way the DictReader works (and the way I intended for it to work)
is that you specify a default value when creating a reader.  When you
encounter a short row, the missing keys are all added to the dictionary,
each associated with the default.  An empty dict should never be returned.
You're really trying to treat the CSV file as a table where each row has a
constant number of columns.

This makes sense if you think of this as analogous to using the DB API to
fetch rows from a database table as dictionaries (e.g., c.dictfetchall()
with psycopg or using the cursorclass=DictCursor with MySQLdb.  You would
never get an empty dictionary (or sequence, for that matter) corresponding
to an individual row of results.  Either it's there with content, or it's
not there at all.  It's never there and empty.

I don't have Excel handy at the moment, but I just tried a little experiment
with gnumeric.  I entered "abc", "def", and "ghi" in the first three cells
of row 1, jumped down to row 3 and entered "123", "456" and "789" in the
first three cells of that row.  I then dumped it as CSV.  Here's the result:

    abc,def,ghi
    ,,
    123,456,789

Can someone try this with Excel or some other spreadsheet (I'll try
Appleworks in the morning if it occurs to me before I rush out the door)?
Does it produce truly blank lines or does it prevent that by inserting one
or more field separators?

Skip

From sjmachin at lexicon.net  Fri Feb 14 11:30:31 2003
From: sjmachin at lexicon.net (John Machin)
Date: Fri, 14 Feb 2003 21:30:31 +1100
Subject: [Csv] trial zip/tar packages of csv module available 
In-Reply-To: <20030214061130.163773CC5D@coffee.object-craft.com.au>
References: <15947.60244.162082.486394@montanaro.dyndns.org>
	<oprkjz6df5m50ryr@localhost>
	<20030214061130.163773CC5D@coffee.object-craft.com.au>
Message-ID: <oprkkx45gmm50ryr@localhost>

On Fri, 14 Feb 2003 17:11:30 +1100, Andrew McNamara <andrewm at object- 
craft.com.au> wrote:

>> Slurped through a 150Mb CSV file at a reasonable speed without any 
>> memory leak that could be detected by the primitive method of watching 
>> the Task Manager memory graph.
>
> I've been using a --enable-pydebug version of python while working on the
> _csv module, and have been watching the reference counts fairly 
> carefully.

Yes, I'd gathered that from various asides in messages on this list. I was 
just being a little ironical about my own primitive way of checking.

>
>> """0.1.1 Module Contents
>> The csv module defines the following functions.
>> reader(iterable[, dialect="excel" ] [, fmtparam])
>> Return a reader object which will iterate over lines in the given 
>> csvfile."""
>>
>> Huh? What "given csvfile"?
>> Need to define carefully what iterable.next() is expected to deliver; a 
>> line, with or without a trailing newline?
>
> In the docstring, I changed this to:
>
> The "iterable" argument can be any object that returns a line
> of input for each iteration, such as a file object or a list.  The
> optional "dialect" parameter is discussed below.  The function
> also accepts optional keyword arguments which override settings
> provided by the dialect.
> The returned object is an iterator.  Each iteration returns a row
> of the CSV file (which can span multiple input lines):

There is not necessarily a file involved --- say "returns a row of CSV 
data"

>
> Do you think this is clearer?

Frankly, no. You've dropped the "given csvfile" (almost), but you haven't 
said whether a "line" is expected to be terminated, and if so with what: 
(a) \n irrespective of platform (b) platform's native terminator (c) \r or 
\r\n or \n (don't care which).

My guess is that if the "line" is terminated by \r or \r\n or \n, you'll 
ignore the terminator, and if it's not terminated at all, then there's 
nothing to ignore, and happiness prevails. Am I correct?

>
> The reader will cope with a file opened binary or not - it *should*
> do the right thing in either case.

The reader doesn't know what the iterable is iterating over. The behaviour 
should be defined in terms of what the reader expects iterable.next() to 
deliver.

>
>> This might necessitate a FAQ entry:
>>>>> cr = csv.reader("iterable is string!")
>>>>> [x for x in cr]
>> [['i'], ['t'], ['e'], ['r'], ['a'], ['b'], ['l'], ['e'], [' '], ['i'], 
>> ['s'], [' '], ['s'], ['t'], ['r'], ['i'], ['n'], ['g'], ['!']
>> ]
>
> I don't think there is ever a case where you would want the input
> iteratable to be a string - I could probably just raise an exception if
> it is?

You certainly wouldn't want the behaviour demonstrated above. However the 
punter may get confused and go cr = csv.reader(file("raboof.csv".read()))

>
>> Judging by the fact that in _csv.c '\0' is passed around as a line- 
>> ending signal, it's not 8-bit-clean. This fact should be at least 
>> documented, if not fixed (which looks like a bit of a rewrite). Strange 
>> behaviour on embedded '\0' may worry not only pedants but also folk who 
>> are recipients of data files created by J. Random Boofhead III and 
>> friends.
>
> Yep - Skip - can you doco the fact that the input should not contain null
> characters or be unicode strings?
>
> Null characters in the input will be treated as newlines, if I remember
> correctly.

Docoing that would be useful as well.

Cheers,
John

-- 

From djc at object-craft.com.au  Fri Feb 14 14:39:28 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 15 Feb 2003 00:39:28 +1100
Subject: [Csv] trial zip/tar packages of csv module available
In-Reply-To: <oprkkx45gmm50ryr@localhost>
References: <15947.60244.162082.486394@montanaro.dyndns.org>
	<oprkjz6df5m50ryr@localhost>
	<20030214061130.163773CC5D@coffee.object-craft.com.au>
	<oprkkx45gmm50ryr@localhost>
Message-ID: <m3n0kzf0bz.fsf@ferret.object-craft.com.au>

>>>>> "John" == John Machin <sjmachin at lexicon.net> writes:

John> Frankly, no. You've dropped the "given csvfile" (almost), but
John> you haven't said whether a "line" is expected to be terminated,
John> and if so with what: (a) \n irrespective of platform (b)
John> platform's native terminator (c) \r or \r\n or \n (don't care
John> which).

John> My guess is that if the "line" is terminated by \r or \r\n or
John> \n, you'll ignore the terminator, and if it's not terminated at
John> all, then there's nothing to ignore, and happiness prevails. Am
John> I correct?

Almost.

Since the parser expects you to deliver a sequence of lines via an
iterable, it requires that line termination be at the end of any
string supplied.  The parser will raise an exception if any characters
follow a line terminator on any individual line string.

- Dave

-- 
http://www.object-craft.com.au

From sjmachin at lexicon.net  Fri Feb 14 20:44:00 2003
From: sjmachin at lexicon.net (John Machin)
Date: Sat, 15 Feb 2003 06:44:00 +1100
Subject: [Csv] ignore blank lines?
In-Reply-To: <15949.16699.749757.280021@montanaro.dyndns.org>
References: <15945.28678.631121.25754@montanaro.dyndns.org>
	<m31y2e5sld.fsf@ferret.object-craft.com.au>
	<15945.37976.592926.369940@montanaro.dyndns.org>
	<oprkghsll3m50ryr@localhost>
	<15946.25610.202929.623399@montanaro.dyndns.org>
	<oprkh0ixoom50ryr@localhost>
	<15948.38242.974158.425677@montanaro.dyndns.org>
	<oprkkuuarym50ryr@localhost> <15949.16699.749757.280021@montanaro.dyndns.org>
Message-ID: <oprklnrmdgm50ryr@localhost>

On Fri, 14 Feb 2003 13:19:23 -0600, Skip Montanaro <skip at pobox.com> wrote:

>
> >> Except the way the DictReader works (and the way I intended for it to
> >> work) is that you specify a default value when creating a reader.
> >> When you encounter a short row, the missing keys are all added to the
> >> dictionary, each associated with the default.
>
> John> What software have you met that actually outputs physically short
> John> rows in an environment where you are expecting a constant number
> John> of columns?
>
> Aside from Firewall-1's logfile exporter, none.  It generates a bunch of
> rows with a constant number of fields, then mysteriously appends three 
> blank
> lines (no commas) to the end.  The change I implemented to DictReader was
> precisely because of this (broken, in my opinion) behavior.  I doubt a
> database worth its salt would do anything like that.
>
> John> This is in fact in agreement with the point that I was trying to
> John> make: a completely empty line (as well as a line containing
> John> ",,,,,,,,") is unexpected and/or meaningless in your
> John> dictionary/database paradigm. We just need to agree on whether
> John> such a line should be silently jettisoned, or an easy-to-detect
> John> value should be returned to the caller.
>
> Well, I would argue that a row of commas just means a row of empty 
> strings.

It can mean that the database has a row with all values NULL, or some other 
equally distrubing circumstance.

> Other than that, I agree, I wouldn't expect blank lines or lines with too
> few columns from properly function programs which are supposed to dump 
> rows
> with constant numbers of columns.

Exactly. Which makes me wonder why you have implemented defaults for short 
rows.

>
> I guess my Python aphorism for the day is "Practicality beats purity."

I don't understand this comment. You are advocating (in fact have 
implemented) hiding disturbing circumstances from the callers. Do you 
classify this as practical or pure?

From sjmachin at lexicon.net  Fri Feb 14 23:48:33 2003
From: sjmachin at lexicon.net (John Machin)
Date: Sat, 15 Feb 2003 09:48:33 +1100
Subject: mishandling of embedded NULs (was: Re: [Csv] trial zip/tar packages
	of csv module available)
In-Reply-To: <oprkkx45gmm50ryr@localhost>
References: <15947.60244.162082.486394@montanaro.dyndns.org>
	<oprkjz6df5m50ryr@localhost>
	<20030214061130.163773CC5D@coffee.object-craft.com.au>
	<oprkkx45gmm50ryr@localhost>
Message-ID: <oprklwa7j9m50ryr@localhost>

[John Machin]
>>> Judging by the fact that in _csv.c '\0' is passed around as a line- 
>>> ending signal, it's not 8-bit-clean. This fact should be at least 
>>> documented, if not fixed (which looks like a bit of a rewrite). Strange 
>>> behaviour on embedded '\0' may worry not only pedants but also folk who 
>>> are recipients of data files created by J. Random Boofhead III and 
>>> friends.

[Andrew McNamara]
>> Yep - Skip - can you doco the fact that the input should not contain 
>> null
>> characters or be unicode strings?
>>
>> Null characters in the input will be treated as newlines, if I remember
>> correctly.
>

[John Machin]
> Docoing that would be useful as well.

[and it's me again:]

Actually it doesn't quite treat a NUL exactly like a newline; it throws 
data away without any warning; see below.

>>> import csv
>>> guff = ["aaa\0bbb", "x\0\0y"]
>>> [x for x in csv.reader(guff)]
[['aaa'], ['x']]
>>> guff2 = ["aaa\nbbb", "x\n\ny"]
>>> [x for x in csv.reader(guff2)]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
_csv.Error: newline inside string
>>>

From skip at pobox.com  Sat Feb 15 01:54:08 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 14 Feb 2003 18:54:08 -0600
Subject: mishandling of embedded NULs (was: Re: [Csv] trial zip/tar
	packages        of csv module available)
In-Reply-To: <oprklwa7j9m50ryr@localhost>
References: <15947.60244.162082.486394@montanaro.dyndns.org>
        <oprkjz6df5m50ryr@localhost>
        <20030214061130.163773CC5D@coffee.object-craft.com.au>
        <oprkkx45gmm50ryr@localhost>
        <oprklwa7j9m50ryr@localhost>
Message-ID: <15949.36784.151916.149873@montanaro.dyndns.org>

    John> Actually it doesn't quite treat a NUL exactly like a newline; it
    John> throws data away without any warning; see below.

This is to be expected I think, considering C strings are being manipulated
at the low level.  I just added a check to _csv.c and an extra test.  It now
raises csv.Error if the file being read contains NUL bytes.  (Should an
exception be raised on output as well?)

Skip

From sjmachin at lexicon.net  Sat Feb 15 05:31:31 2003
From: sjmachin at lexicon.net (John Machin)
Date: Sat, 15 Feb 2003 15:31:31 +1100
Subject: mishandling of embedded NULs (was: Re: [Csv] trial zip/tar
	packages of csv module available)
In-Reply-To: <15949.36784.151916.149873@montanaro.dyndns.org>
References: <15947.60244.162082.486394@montanaro.dyndns.org>
	<oprkjz6df5m50ryr@localhost>
	<20030214061130.163773CC5D@coffee.object-craft.com.au>
	<oprkkx45gmm50ryr@localhost>        <oprklwa7j9m50ryr@localhost>
	<15949.36784.151916.149873@montanaro.dyndns.org>
Message-ID: <oprkmb6tgmm50ryr@localhost>

On Fri, 14 Feb 2003 18:54:08 -0600, Skip Montanaro <skip at pobox.com> wrote:

>
> John> Actually it doesn't quite treat a NUL exactly like a newline; it
> John> throws data away without any warning; see below.
>
> This is to be expected I think, considering C strings are being 
> manipulated
> at the low level.  I just added a check to _csv.c and an extra test.  It 
> now
> raises csv.Error if the file being read contains NUL bytes.  (Should an
> exception be raised on output as well?)

Yes, but conditionally -- IMHO the caller should be able to specify 
(strictwriting=True) that an exception should be raised on *any* attempt to 
write data that could not be read back "sensibly" using the same dialect 
etc. Getting exceptions or a different number of rows or columns when the 
data are read back are certainly not "sensible". This general regime would 
allow someone who must produce (say) a non-quoted "|"-delimited file format 
to verify that there were no "|" in the data. OTOH the caller can specify 
strictwriting=False if it's a "you asked for it, you got it" situation.

Cheers,
John

From adalke at mindspring.com  Sat Feb 15 09:24:45 2003
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat, 15 Feb 2003 01:24:45 -0700
Subject: [Csv] csv
Message-ID: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>

Hi,

  I tried out the csv module Skip recently made reference to on c.l.py.
I'm afraid I didn't read the docs too clearly -- wanted to see if I could
figure out how to use the module without documentation ;)

  Anyway, my file formats are either space delimited (no quotes --
the following work "infile.readline().split(' ')) or tab delimited.  (Note,
btw, that that is not split() and two adjacent spaces means there is
an empty field.)

I wanted to make a "space" dialect.  I thought the following would
work, but it didn't.

>>> class Space(csv.Dialect):
...     delimiter = " "
...     quotechar = False
...     escapechar = False
...     doublequote = False
...     skipinitialspace = False
...     lineterminator = "\n"
...     quoting = csv.QUOTE_NONE
... 
>>> Space()
<__main__.Space instance at 0x162ff8>
>>> csv.register_dialect("space", Space)
>>> csv.reader(open("/home/mug/test.smi"))
<_csv.reader object at 0x1df9c0>
>>> q=_
>>> for a in q:
...     pass
... 
>>> a
['c1ccccc1 benzene']
>>> len(a)
1
>>> print open("/home/mug/test.smi").read()
c1ccccc1 benzene

>>> 

Also, suppose for my own project I have a "SpaceDialect".
The current API requires a global registry for that dialect.
I don't like the chance of clobbering, though I know it to be
rare.  Would the ability to pass

   dialect = SpaceDialect

(that is, a Dialect subclass) rather than the name be
an appropriate addition to the API?

My apologies for not spending much time on this.
I need to catch a plane in a couple of hours. :(

                    Andrew
                    dalke at dalkescientific.com

From skip at pobox.com  Sat Feb 15 12:07:18 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 15 Feb 2003 05:07:18 -0600
Subject: mishandling of embedded NULs (was: Re: [Csv] trial zip/tar
        packages of csv module available)
In-Reply-To: <oprkmb6tgmm50ryr@localhost>
References: <15947.60244.162082.486394@montanaro.dyndns.org>
        <oprkjz6df5m50ryr@localhost>
        <20030214061130.163773CC5D@coffee.object-craft.com.au>
        <oprkkx45gmm50ryr@localhost>
        <oprklwa7j9m50ryr@localhost>
        <15949.36784.151916.149873@montanaro.dyndns.org>
        <oprkmb6tgmm50ryr@localhost>
Message-ID: <15950.8038.440199.244791@montanaro.dyndns.org>

(Last message before leaving for the plane...)

    >> (Should an exception be raised on output as well?)

    John> Yes, but conditionally -- IMHO the caller should be able to
    John> specify (strictwriting=True) that an exception should be raised on
    John> *any* attempt to write data that could not be read back
    John> "sensibly"...

I believe the issue of reading/writing NUL bytes this is just a temporary
limitation of the current implementation.  It will be fixed it in the future
(it has to, because some Unicode encodings will read or write NULs in the
data stream), so we don't need to get very elaborate with our handling of
NULs.  For now, simply raising an exception should suffice.

Skip

From sjmachin at lexicon.net  Sat Feb 15 19:14:04 2003
From: sjmachin at lexicon.net (John Machin)
Date: Sun, 16 Feb 2003 05:14:04 +1100
Subject: Andrew Dalke's space example (was Re: [Csv] csv)
In-Reply-To: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
Message-ID: <oprknd9q0tm50ryr@localhost>

On Sat, 15 Feb 2003 01:24:45 -0700, Andrew Dalke <adalke at mindspring.com> 
wrote:

> Anyway, my file formats are either space delimited (no quotes --
> the following work "infile.readline().split(' ')) or tab delimited.  
> (Note,
> btw, that that is not split() and two adjacent spaces means there is
> an empty field.)
>
> I wanted to make a "space" dialect.  I thought the following would
> work, but it didn't.
>
>>>> class Space(csv.Dialect):
> ...     delimiter = " "
> ...     quotechar = False
> ...     escapechar = False

These should be one-byte strings, not booleans.

> ...     doublequote = False
> ...     skipinitialspace = False
> ...     lineterminator = "\n"
> ...     quoting = csv.QUOTE_NONE
> ...
>>>> Space()
> <__main__.Space instance at 0x162ff8>
>>>> csv.register_dialect("space", Space)
>>>> csv.reader(open("/home/mug/test.smi"))

You need to tell the reader factory which dialect to use, if you don't want 
the default ("excel").
csv.reader(open("/home/mug/test.smi"), dialect="space")

>
> Also, suppose for my own project I have a "SpaceDialect".
> The current API requires a global registry for that dialect.
> I don't like the chance of clobbering, though I know it to be
> rare.  Would the ability to pass
>
> dialect = SpaceDialect
>
> (that is, a Dialect subclass) rather than the name be
> an appropriate addition to the API?
>

Registration is not persistent. What is the use case for registering a 
dialect in one module and using it in a csv.reader() or writer() call in 
another module? If no use case, then registration is pointless, and the 
class could be passed as the dialect argument.

There are various problems brought out by Andrew's example; see attached 
file dalke.py

These are
(1) very obscure error message
   "TypeError: bad argument type for built-in operation"
caused by using quotechar = False instead of quotechar = None
Also this appears out of the reader() call, not the register_dialect() 
call!!!
*IF* there is a valid use case for registration, then the dialect should be 
validated then, not when used.
(2) says it needs quotechar != None even when quoting=QUOTE_NONE
(3) The "quoting" argument is honoured only by writers, not by readers -- 
i.e. in general you can't reliably read back a file that you've created and 
in particular to read Andrew D's files you need to set quotechar to some 
char that you hope is not in the input -- maybe '\0'.
(4) Maybe the whole dialect thing is a bit too baroque and Byzantine -- see 
example 5 in dalke.py. The **dict_of_arguments gadget offers the "don't 
need to type long list of arguments" advantage claimed for dialect classes, 
and you get the same obscure error message if you stuff up the type of an 
argument (see example 6) -- all of this without writing all that 
register/validate/etc code.

Maybe if we jump in quickly we could get an improved error message in the 
Python core for 2.3: at least identify which arg has the problem, and if 
lucky get it to say e.g. "expected <type x> given <type y>" and hey let's 
go for broke, how about which function is being called and even stop 
confusing the punters by calling functions in extension modules "built-in". 
This would benefit all Python users, not just csv users.

Cheers,
John
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dalke.py
Type: application/octet-stream
Size: 2742 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20030216/deb84579/attachment.obj 

From djc at object-craft.com.au  Sun Feb 16 11:59:21 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 16 Feb 2003 21:59:21 +1100
Subject: Andrew Dalke's space example (was Re: [Csv] csv)
In-Reply-To: <oprknd9q0tm50ryr@localhost>
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
	<oprknd9q0tm50ryr@localhost>
Message-ID: <m3of5cqynq.fsf@ferret.object-craft.com.au>

> There are various problems brought out by Andrew's example; see
> attached file dalke.py
> 
> These are
> (1) very obscure error message "TypeError: bad argument type for
> built-in operation" caused by using quotechar = False instead of
> quotechar = None Also this appears out of the reader() call, not the
> register_dialect() call!!!  *IF* there is a valid use case for
> registration, then the dialect should be validated then, not when
> used.

+1 on that one.  I scratched my head for a while when seeing that
error too.  It wasn't until I read through the C code that the penny
dropped.

> (2) says it needs quotechar != None even when quoting=QUOTE_NONE

+1

> (3) The "quoting" argument is honoured only by writers, not by
> readers -- i.e. in general you can't reliably read back a file that
> you've created and in particular to read Andrew D's files you need
> to set quotechar to some char that you hope is not in the input --
> maybe '\0'.

Aside from the quote of '\0', I am not sure I follow what you mean.
If you set quoting so that it produces ambiguous output that is hardly
the fault of the writer.

> (4) Maybe the whole dialect thing is a bit too baroque and Byzantine
> -- see example 5 in dalke.py. The **dict_of_arguments gadget offers
> the "don't need to type long list of arguments" advantage claimed
> for dialect classes, and you get the same obscure error message if
> you stuff up the type of an argument (see example 6) -- all of this
> without writing all that register/validate/etc code.

How much clearer would things be if the validation of dialects were
pulled up into the Python?  Being able to see the Python code which
raised the exception would be a huge help to the user.

- Dave

-- 
http://www.object-craft.com.au

From djc at object-craft.com.au  Sun Feb 16 12:04:59 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 16 Feb 2003 22:04:59 +1100
Subject: mishandling of embedded NULs (was: Re: [Csv] trial zip/tar
	packages of csv module available)
In-Reply-To: <15950.8038.440199.244791@montanaro.dyndns.org>
References: <15947.60244.162082.486394@montanaro.dyndns.org>
	<oprkjz6df5m50ryr@localhost>
	<20030214061130.163773CC5D@coffee.object-craft.com.au>
	<oprkkx45gmm50ryr@localhost> <oprklwa7j9m50ryr@localhost>
	<15949.36784.151916.149873@montanaro.dyndns.org>
	<oprkmb6tgmm50ryr@localhost>
	<15950.8038.440199.244791@montanaro.dyndns.org>
Message-ID: <m3k7g0qyec.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Skip> (Last message before leaving for the plane...)
>>> (Should an exception be raised on output as well?)

John> Yes, but conditionally -- IMHO the caller should be able to
John> specify (strictwriting=True) that an exception should be raised
John> on *any* attempt to write data that could not be read back
John> "sensibly"...

Skip> I believe the issue of reading/writing NUL bytes this is just a
Skip> temporary limitation of the current implementation.  It will be
Skip> fixed it in the future (it has to, because some Unicode
Skip> encodings will read or write NULs in the data stream), so we
Skip> don't need to get very elaborate with our handling of NULs.  For
Skip> now, simply raising an exception should suffice.

The '\0' to indicate line termination is a hang over from my original
code.  There is no reason why the code could not just use '\n' to
signal end of line (like every one else on the planet).

- Dave

-- 
http://www.object-craft.com.au

From sjmachin at lexicon.net  Sun Feb 16 12:11:46 2003
From: sjmachin at lexicon.net (John Machin)
Date: Sun, 16 Feb 2003 22:11:46 +1100
Subject: Andrew Dalke's space example (was Re: [Csv] csv)
In-Reply-To: <oprknd9q0tm50ryr@localhost>
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
	<oprknd9q0tm50ryr@localhost>
Message-ID: <oprkopdwz4m50ryr@localhost>

On Sun, 16 Feb 2003 05:14:04 +1100, John Machin <sjmachin at lexicon.net> 
wrote:

> (4) Maybe the whole dialect thing is a bit too baroque and Byzantine -- 
> see example 5 in dalke.py. The **dict_of_arguments gadget offers the 
> "don't need to type long list of arguments" advantage claimed for dialect 
> classes, and you get the same obscure error message if you stuff up the 
> type of an argument (see example 6) -- all of this without writing all 
> that register/validate/etc code.

I was wrong; I guessed that _csv.c used PyArg_PTAK, but it doesn't -- it 
rams the Dialect (Python level) instance's attributes plus the csv.reader 
keyword arguments into a DialectType (C level) instance, the setattr being 
eventually done either by PyMember_SetOne, or in _csv.c itself -- in both 
cases, a type mismatch means a call to PyErr_BadArgument() which issues the 
obscure message "bad argument type for built-in operation".

PyArg_PTAK gives a more meaningful message if the required type is a single 
char, for example "argument 2 must be char, not int". However where the 
required type is int, you get "an integer is required" ... looks like a 
patch wouldn't go astray.

Cheers,
John

From sjmachin at lexicon.net  Sun Feb 16 20:43:27 2003
From: sjmachin at lexicon.net (John Machin)
Date: Mon, 17 Feb 2003 06:43:27 +1100
Subject: mishandling of embedded NULs (was: Re: [Csv] trial zip/tar
	packages of csv module available)
In-Reply-To: <m3k7g0qyec.fsf@ferret.object-craft.com.au>
References: <15947.60244.162082.486394@montanaro.dyndns.org>
	<oprkjz6df5m50ryr@localhost>
	<20030214061130.163773CC5D@coffee.object-craft.com.au>
	<oprkkx45gmm50ryr@localhost> <oprklwa7j9m50ryr@localhost>
	<15949.36784.151916.149873@montanaro.dyndns.org> <oprkmb6tgmm50ryr@localhost>
	<15950.8038.440199.244791@montanaro.dyndns.org>
	<m3k7g0qyec.fsf@ferret.object-craft.com.au>
Message-ID: <oprkpc2ppam50ryr@localhost>

On 16 Feb 2003 22:04:59 +1100, Dave Cole <djc at object-craft.com.au> wrote:

>>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:
>
> Skip> (Last message before leaving for the plane...)
>>>> (Should an exception be raised on output as well?)
>
> John> Yes, but conditionally -- IMHO the caller should be able to
> John> specify (strictwriting=True) that an exception should be raised
> John> on *any* attempt to write data that could not be read back
> John> "sensibly"...
>
> Skip> I believe the issue of reading/writing NUL bytes this is just a
> Skip> temporary limitation of the current implementation.  It will be
> Skip> fixed it in the future (it has to, because some Unicode
> Skip> encodings will read or write NULs in the data stream), so we
> Skip> don't need to get very elaborate with our handling of NULs.  For
> Skip> now, simply raising an exception should suffice.
>
Dave> The '\0' to indicate line termination is a hang over from my original
Dave> code.  There is no reason why the code could not just use '\n' to
Dave> signal end of line (like every one else on the planet).

Are you sure? I had the impression that it was used as an out-of-band 
signal -- something that didn't appear in the data (you hope!) -- so that 
you could take exception to newlines that weren't at the end of line.

From sjmachin at lexicon.net  Sun Feb 16 23:32:09 2003
From: sjmachin at lexicon.net (John Machin)
Date: Mon, 17 Feb 2003 09:32:09 +1100
Subject: [Csv] escapechar confusion
Message-ID: <oprkpkvvqwm50ryr@localhost>

Docstring:

"        csv.QUOTE_NONE means that quotes are never placed around 
fields.\n"
"    * escapechar - specifies a one-character string used to escape \n"
"        the delimiter when quoting is set to QUOTE_NONE.\n"
===
libcsv.tex [note especially the alleged treatment of escapechar when 
doublequote == False]:

\begin{memberdesc}[boolean]{doublequote}
Controls how instances of \var{quotechar} appearing inside a field should 
be
themselves be quoted.  When \constant{True}, the character is doubledd.
When \constant{False}, the \var{escapechar} must be a one-character string
which is used as a prefix to the \var{quotechar}.  It defaults to
\constant{True}.
\end{memberdesc}

\begin{memberdesc}{escapechar}
A one-character string used to escape the \var{delimiter} if \var{quoting}
is set to \constant{QUOTE_NONE}.  It defaults to \constant{None}.
\end{memberdesc}
===
My attempt at clarifying requirements on fiddling the contents of each 
field being written:
[in examples, escapechar = '~' (to avoid backslashorrhea) and assumes 
delimiter = ',' and quotechar = '"']

if quoting == QUOTE_NONE and escapechar is not None:
   escape the delimiter, lineterminator(s), and the escapechar itself
   Level 3, Macackie Mansions -> Level 3~, Macackie Mansions
   Level 3, "Macackie Mansions" -> Level 3~, "Macackie Mansions"
   Can~on Grando -> Can~~on Grando
   # This scheme is plausible, unambiguous and in fact more efficient than 
the "standard" doubling-of-quotes scheme.
elif quoting != QUOTE_NONE and not doublequote:
   if escapechar is None:
      raise "..."
   escape the quotechar and the escapechar itself
   Note: there is no *need* to escape the delimiter or line terminators, as 
they are "covered"
   by the quoting. Level 3, Macackie Mansions -> "Level 3, Macackie 
Mansions"
   Level 3, "Macackie Mansions" -> "Level 3, ~"Macackie Mansions~""
   Can~on Grando -> "Can~~on Grando"
   # This scheme is bizarre (like some other CSV mutants) but at least it 
doesn't cause ambiguity on input.
   # What software does this? Who sponsored its inclusion?
   # Does it need option(s) to cater for (redundantly) escaping (a) 
delimiter (b) line terminator(s)
   # And it hasn't been implemented on output -- see below
else:
   escapechar is not used
===
What _csv.c does on output:

>>> source = [123456, 'aaa,bbb', 'ccc,"ddd"', '"eee",fff', 9876.5]
>>> csv.writer(sys.stdout, escapechar="~", quoting=csv.QUOTE_NONE, 
>>> doublequote=False).writerow(source)
123456,aaa~,bbb,ccc~,"ddd","eee"~,fff,9876.5
# as expected
>>> csv.writer(sys.stdout, escapechar="~", quoting=csv.QUOTE_MINIMAL, 
>>> doublequote=False).writerow(source)
123456,"aaa,bbb","ccc,"ddd"",""eee",fff",9876.5
# No escaping done
===
What _csv.c does on input:

Firstly, the simple escape scheme:

>>> indata1 = ['123456,aaa~,bbb,ccc~,"ddd","eee"~,fff,9876.5']

>>> [x for x in csv.reader(indata1, escapechar="~", quoting=csv.QUOTE_NONE, 
>>> doublequote=True)]
[['123456', 'aaa,bbb', 'ccc,"ddd"', 'eee~', 'fff', '9876.5']]
# wrong or confusing, QUOTE_NONE but still testing for quotechar at start 
of field

>>> [x for x in csv.reader(indata1, escapechar="~", quoting=csv.QUOTE_NONE, 
>>> doublequote=False)]
[['123456', 'aaa,bbb', 'ccc,"ddd"', 'eee,fff', '9876.5']]
# wrong or confusing, QUOTE_NONE but still testing for quotechar at start 
of field

>>> [x for x in csv.reader(indata1, escapechar="~", quoting=csv.QUOTE_NONE, 
>>> doublequote=False, quotechar=None)]
TypeError: bad argument type for built-in operation
# already grumbled about this

>>> [x for x in csv.reader(indata1, escapechar="~", quoting=csv.QUOTE_NONE, 
>>> doublequote=False, quotechar="!")]
[['123456', 'aaa,bbb', 'ccc,"ddd"', '"eee",fff', '9876.5']]
# actual == expected

Secondly, the bizarre scheme (escaping the quotechar):

>>> indata2 = ['123456,aaa~,bbb,ccc~,"ddd","eee"~,fff,"ggg,~"hhh~"",iii- 
>>> ~"jjj~",9876.5']

>>> [x for x in csv.reader(indata2, escapechar="~", 
>>> quoting=csv.QUOTE_MINIMAL, doublequote=False, quotechar='"')]
[['123456', 'aaa,bbb', 'ccc,"ddd"', 'eee,fff', 'ggg,"hhh"', 'iii-"jjj"', 
'9876.5']]
# bizarre + options; this is assuming that the writer was escaping 
delimiters
-- 

From djc at object-craft.com.au  Sun Feb 16 23:52:51 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 17 Feb 2003 09:52:51 +1100
Subject: mishandling of embedded NULs (was: Re: [Csv] trial zip/tar
	packages of csv module available)
In-Reply-To: <oprkpc2ppam50ryr@localhost>
References: <15947.60244.162082.486394@montanaro.dyndns.org>
	<oprkjz6df5m50ryr@localhost>
	<20030214061130.163773CC5D@coffee.object-craft.com.au>
	<oprkkx45gmm50ryr@localhost> <oprklwa7j9m50ryr@localhost>
	<15949.36784.151916.149873@montanaro.dyndns.org>
	<oprkmb6tgmm50ryr@localhost>
	<15950.8038.440199.244791@montanaro.dyndns.org>
	<m3k7g0qyec.fsf@ferret.object-craft.com.au>
	<oprkpc2ppam50ryr@localhost>
Message-ID: <m3d6lrkfcs.fsf@ferret.object-craft.com.au>

>>>>> "John" == John Machin <sjmachin at lexicon.net> writes:

Dave> The '\0' to indicate line termination is a hang over from my
Dave> original code.  There is no reason why the code could not just
Dave> use '\n' to signal end of line (like every one else on the
Dave> planet).

John> Are you sure? I had the impression that it was used as an
John> out-of-band signal -- something that didn't appear in the data
John> (you hope!) -- so that you could take exception to newlines that
John> weren't at the end of line.

The outer loop of the parser was detecting end of line variations
'\n', '\r\n', and '\r' and checks for following characters.  If
characters are discovered an exception is raised.  If no characters
are following, the inner parsing code is passed '\0' to indicate end
of line.

Since there was no way for the inner code to ever receive a '\n' as
data, I changed the '\0' special value to '\n'.

- Dave

-- 
http://www.object-craft.com.au

From sjmachin at lexicon.net  Mon Feb 17 00:00:23 2003
From: sjmachin at lexicon.net (John Machin)
Date: Mon, 17 Feb 2003 10:00:23 +1100
Subject: Andrew Dalke's space example (was Re: [Csv] csv)
In-Reply-To: <m3of5cqynq.fsf@ferret.object-craft.com.au>
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
	<oprknd9q0tm50ryr@localhost> <m3of5cqynq.fsf@ferret.object-craft.com.au>
Message-ID: <oprkpl6xqdm50ryr@localhost>

On 16 Feb 2003 21:59:21 +1100, Dave Cole <djc at object-craft.com.au> wrote:

[John Machin]
>> There are various problems brought out by Andrew's example; see
>> attached file dalke.py

>> (3) The "quoting" argument is honoured only by writers, not by
>> readers -- i.e. in general you can't reliably read back a file that
>> you've created and in particular to read Andrew D's files you need
>> to set quotechar to some char that you hope is not in the input --
>> maybe '\0'.

[Dave Cole]
> Aside from the quote of '\0', I am not sure I follow what you mean.
> If you set quoting so that it produces ambiguous output that is hardly
> the fault of the writer.

Of course not. What I was getting at was that the ability to write various 
schemes (some ambiguous, some not) is provided, but it is not possible to 
read back all unambiguous schemes, and there is little if any support for 
checking that the data corresponds to the scheme the caller thinks was used 
to write it, and there are no options to drive what to do on input if the 
writing scheme was ambiguous.

[John Machin]
>> (4) Maybe the whole dialect thing is a bit too baroque and Byzantine
>> -- see example 5 in dalke.py. The **dict_of_arguments gadget offers
>> the "don't need to type long list of arguments" advantage claimed
>> for dialect classes, and you get the same obscure error message if
>> you stuff up the type of an argument (see example 6) -- all of this
>> without writing all that register/validate/etc code.
>
[Dave Cole]
> How much clearer would things be if the validation of dialects were
> pulled up into the Python?  Being able to see the Python code which
> raised the exception would be a huge help to the user.

How much clearer would things be if the error message said "quotechar must 
be char, not int"?

The clarity should arise from the error message, not from its source. I 
think it a reasonable goal that a developer should have to inspect the 
callee's source (if available!) only in desperation. The one line of source 
that is shown in the traceback from Python modules is sometimes not very 
helpful e.g. the above reasonably helpful error message could have been 
produced by something like this:

   raise NastyError, "%s must be %s, not %s" % (self.attr_name[k], 
self.attr_type_abbr[k], show_type(input_value))

No comments on the possibility of throwing the whole dialect-via-classes 
idea away???

-- 

From andrewm at object-craft.com.au  Mon Feb 17 00:17:23 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 17 Feb 2003 10:17:23 +1100
Subject: Andrew Dalke's space example (was Re: [Csv] csv) 
In-Reply-To: Message from John Machin <sjmachin@lexicon.net> 
   of "Sun, 16 Feb 2003 22:11:46 +1100." <oprkopdwz4m50ryr@localhost> 
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
	<oprknd9q0tm50ryr@localhost>  <oprkopdwz4m50ryr@localhost> 
Message-ID: <20030216231723.202913CC5C@coffee.object-craft.com.au>

>PyArg_PTAK gives a more meaningful message if the required type is a single 
>char, for example "argument 2 must be char, not int". However where the 
>required type is int, you get "an integer is required" ... looks like a 
>patch wouldn't go astray.

PyArg_PTAK was originally used, but really isn't well suited to what we're
trying to do, and ends up raising obscure errors of it's own (or, more to
the point, goes subtly wrong without warning the user).

Giving the C DialectType a setattr which does the input validation is
probably the better answer.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From djc at object-craft.com.au  Mon Feb 17 00:30:47 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 17 Feb 2003 10:30:47 +1100
Subject: Andrew Dalke's space example (was Re: [Csv] csv)
In-Reply-To: <oprkpl6xqdm50ryr@localhost>
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
	<oprknd9q0tm50ryr@localhost>
	<m3of5cqynq.fsf@ferret.object-craft.com.au>
	<oprkpl6xqdm50ryr@localhost>
Message-ID: <m3y94fiz14.fsf@ferret.object-craft.com.au>

>>>>> "John" == John Machin <sjmachin at lexicon.net> writes:

John> [Dave Cole]
>> Aside from the quote of '\0', I am not sure I follow what you mean.
>> If you set quoting so that it produces ambiguous output that is
>> hardly the fault of the writer.

John> Of course not. What I was getting at was that the ability to
John> write various schemes (some ambiguous, some not) is provided,
John> but it is not possible to read back all unambiguous schemes, and
John> there is little if any support for checking that the data
John> corresponds to the scheme the caller thinks was used to write
John> it, and there are no options to drive what to do on input if the
John> writing scheme was ambiguous.

I must be a bit thick or something...  I have the feeling you are
correct, but I just can't see it.  Can you provide some (simple)
examples and suggest where the code could be improved?

John> [Dave Cole]
>> How much clearer would things be if the validation of dialects were
>> pulled up into the Python?  Being able to see the Python code which
>> raised the exception would be a huge help to the user.

John> How much clearer would things be if the error message said
John> "quotechar must be char, not int"?

Probably only 7 squillion percent.

John> The clarity should arise from the error message, not from its
John> source. I think it a reasonable goal that a developer should
John> have to inspect the callee's source (if available!) only in
John> desperation. The one line of source that is shown in the
John> traceback from Python modules is sometimes not very helpful
John> e.g. the above reasonably helpful error message could have been
John> produced by something like this:

John>    raise NastyError, "%s must be %s, not %s" %
John> (self.attr_name[k], self.attr_type_abbr[k],
John> show_type(input_value))

John> No comments on the possibility of throwing the whole
John> dialect-via-classes idea away???

The dialect should validate when you instantiate it.  This probably
means that we should require a csv.Dialect instance rather than a
class as the parameter to csv.reader() and csv.writer().

>>> class Space(csv.Dialect):
...     delimiter = " "
...     quotechar = False
...     escapechar = False
...     doublequote = False
...     skipinitialspace = False
...     lineterminator = "\n"
...     quoting = csv.QUOTE_NONE
... 
>>> Space()
<__main__.Space instance at 0x401f3dcc>

Is it possible for the csv.Dialect to raise an exception when Space is
instantiated?  I don't know enough about the new style classes.

- Dave

-- 
http://www.object-craft.com.au

From djc at object-craft.com.au  Mon Feb 17 00:35:11 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 17 Feb 2003 10:35:11 +1100
Subject: Andrew Dalke's space example (was Re: [Csv] csv)
In-Reply-To: <20030216231723.202913CC5C@coffee.object-craft.com.au>
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
	<oprknd9q0tm50ryr@localhost> <oprkopdwz4m50ryr@localhost>
	<20030216231723.202913CC5C@coffee.object-craft.com.au>
Message-ID: <m3u1f3iyts.fsf@ferret.object-craft.com.au>

>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:

>> PyArg_PTAK gives a more meaningful message if the required type is
>> a single char, for example "argument 2 must be char, not
>> int". However where the required type is int, you get "an integer
>> is required" ... looks like a patch wouldn't go astray.

Andrew> PyArg_PTAK was originally used, but really isn't well suited
Andrew> to what we're trying to do, and ends up raising obscure errors
Andrew> of it's own (or, more to the point, goes subtly wrong without
Andrew> warning the user).

Andrew> Giving the C DialectType a setattr which does the input
Andrew> validation is probably the better answer.

Does that mean that the validation is only on individual attributes,
not on the set of attributes?

- Dave

-- 
http://www.object-craft.com.au

From andrewm at object-craft.com.au  Mon Feb 17 00:44:22 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 17 Feb 2003 10:44:22 +1100
Subject: Andrew Dalke's space example (was Re: [Csv] csv) 
In-Reply-To: Message from Dave Cole <djc@object-craft.com.au> 
	<m3u1f3iyts.fsf@ferret.object-craft.com.au> 
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
	<oprknd9q0tm50ryr@localhost> <oprkopdwz4m50ryr@localhost>
	<20030216231723.202913CC5C@coffee.object-craft.com.au>
	<m3u1f3iyts.fsf@ferret.object-craft.com.au> 
Message-ID: <20030216234422.6E7AD3CC5C@coffee.object-craft.com.au>

>>> PyArg_PTAK gives a more meaningful message if the required type is
>>> a single char, for example "argument 2 must be char, not
>>> int". However where the required type is int, you get "an integer
>>> is required" ... looks like a patch wouldn't go astray.
>
>Andrew> PyArg_PTAK was originally used, but really isn't well suited
>Andrew> to what we're trying to do, and ends up raising obscure errors
>Andrew> of it's own (or, more to the point, goes subtly wrong without
>Andrew> warning the user).
>
>Andrew> Giving the C DialectType a setattr which does the input
>Andrew> validation is probably the better answer.
>
>Does that mean that the validation is only on individual attributes,
>not on the set of attributes?

Yep - at the moment there are no inter-attribute checks. 

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From sjmachin at lexicon.net  Mon Feb 17 00:47:25 2003
From: sjmachin at lexicon.net (John Machin)
Date: Mon, 17 Feb 2003 10:47:25 +1100
Subject: Andrew Dalke's space example (was Re: [Csv] csv) 
In-Reply-To: <20030216231723.202913CC5C@coffee.object-craft.com.au>
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
	<oprknd9q0tm50ryr@localhost>  <oprkopdwz4m50ryr@localhost>
	<20030216231723.202913CC5C@coffee.object-craft.com.au>
Message-ID: <oprkpodbh6m50ryr@localhost>

On Mon, 17 Feb 2003 10:17:23 +1100, Andrew McNamara <andrewm at object- 
craft.com.au> wrote:

>> PyArg_PTAK gives a more meaningful message if the required type is a 
>> single char, for example "argument 2 must be char, not int". However 
>> where the required type is int, you get "an integer is required" ... 
>> looks like a patch wouldn't go astray.
>
> PyArg_PTAK was originally used, but really isn't well suited to what 
> we're
> trying to do,

Hmmm ... nobody seems to want to discuss my point that what you're trying 
to do (the whole dialect thing) is a bit over the top.

> and ends up raising obscure errors of it's own (or, more to
> the point, goes subtly wrong without warning the user).

Can you give an example of "goes subtly wrong without warning"? Have you 
reported these problems?
I recall noticing a while back that it would silently truncate a supplied 
float to fit a desired int w/o any complaint [rationale is evidently : 
"floats have an int() method, don't they?"] -- is that the sort of thing 
you mean?

-- 

From djc at object-craft.com.au  Mon Feb 17 00:59:27 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 17 Feb 2003 10:59:27 +1100
Subject: Andrew Dalke's space example (was Re: [Csv] csv)
In-Reply-To: <oprkpodbh6m50ryr@localhost>
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
	<oprknd9q0tm50ryr@localhost> <oprkopdwz4m50ryr@localhost>
	<20030216231723.202913CC5C@coffee.object-craft.com.au>
	<oprkpodbh6m50ryr@localhost>
Message-ID: <m3k7fzixpc.fsf@ferret.object-craft.com.au>

>>>>> "John" == John Machin <sjmachin at lexicon.net> writes:

John> Hmmm ... nobody seems to want to discuss my point that what
John> you're trying to do (the whole dialect thing) is a bit over the
John> top.

I think that rationale is more along the lines of "validation of
"random" objects created in Python is harder than validation of
objects created by code we control", but I could be wrong.

>> and ends up raising obscure errors of it's own (or, more to the
>> point, goes subtly wrong without warning the user).

John> Can you give an example of "goes subtly wrong without warning"?
John> Have you reported these problems?  I recall noticing a while
John> back that it would silently truncate a supplied float to fit a
John> desired int w/o any complaint [rationale is evidently : "floats
John> have an int() method, don't they?"] -- is that the sort of thing
John> you mean?

When/where did it silently truncate a float?  Can you provide an
example?

- Dave

-- 
http://www.object-craft.com.au

From andrewm at object-craft.com.au  Mon Feb 17 01:06:13 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 17 Feb 2003 11:06:13 +1100
Subject: Andrew Dalke's space example (was Re: [Csv] csv) 
In-Reply-To: Message from John Machin <sjmachin@lexicon.net> 
   of "Mon, 17 Feb 2003 10:47:25 +1100." <oprkpodbh6m50ryr@localhost> 
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
	<oprknd9q0tm50ryr@localhost> <oprkopdwz4m50ryr@localhost>
	<20030216231723.202913CC5C@coffee.object-craft.com.au>
	<oprkpodbh6m50ryr@localhost> 
Message-ID: <20030217000613.CFE2D3CC5C@coffee.object-craft.com.au>

>> PyArg_PTAK was originally used, but really isn't well suited to what
>> we're trying to do,
>
>Hmmm ... nobody seems to want to discuss my point that what you're trying 
>to do (the whole dialect thing) is a bit over the top.

Yep - we had this discussion early on - the list archives should have
details:

    http://manatee.mojam.com/pipermail/csv/

Note that the registry stuff is entirely optional. You can pass a class
or instance as the dialect, and it will work as expected. The doco should
probably be updated to mention this.

>> and ends up raising obscure errors of it's own (or, more to
>> the point, goes subtly wrong without warning the user).
>
>Can you give an example of "goes subtly wrong without warning"? Have you 
>reported these problems?

What we're trying to do is not what PyArg_PTAK does well - it's not
PyArg_PTAK's fault that it doesn't do what we want...

One problem was that PyArg_PTAK tries to hide the distinction between
positional and keyword arguments - every keyword argument is given a
position. This was more of a problem in the old days (the parameters
were originally all positional).

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From sjmachin at lexicon.net  Mon Feb 17 02:13:17 2003
From: sjmachin at lexicon.net (John Machin)
Date: Mon, 17 Feb 2003 12:13:17 +1100
Subject: reading back what you wrote (was Re: Andrew Dalke's space example
	(was Re: [Csv] csv))
In-Reply-To: <m3y94fiz14.fsf@ferret.object-craft.com.au>
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
	<oprknd9q0tm50ryr@localhost> <m3of5cqynq.fsf@ferret.object-craft.com.au>
	<oprkpl6xqdm50ryr@localhost> <m3y94fiz14.fsf@ferret.object-craft.com.au>
Message-ID: <oprkpscfrom50ryr@localhost>

On 17 Feb 2003 10:30:47 +1100, Dave Cole <djc at object-craft.com.au> wrote:

>>>>>> "John" == John Machin <sjmachin at lexicon.net> writes:
>
> John> [Dave Cole]
>>> Aside from the quote of '\0', I am not sure I follow what you mean.
>>> If you set quoting so that it produces ambiguous output that is
>>> hardly the fault of the writer.
>
> John> Of course not. What I was getting at was that the ability to
> John> write various schemes (some ambiguous, some not) is provided,
> John> but it is not possible to read back all unambiguous schemes, and
> John> there is little if any support for checking that the data
> John> corresponds to the scheme the caller thinks was used to write
> John> it, and there are no options to drive what to do on input if the
> John> writing scheme was ambiguous.
>
> I must be a bit thick or something...  I have the feeling you are
> correct, but I just can't see it.  Can you provide some (simple)
> examples and suggest where the code could be improved?
>

Here is my approach:

(1) Define not only a scheme for writing "standard" CSV but schemes for 
writing the various mutations that I have come across

(2) Have a strict_output option to govern behaviour when the input is such 
that output cannot be reversed (exception immediately, exception at end if 
error count is not zero, no exception)

Example (a) someone wants to write using a no-quoted scheme but they have a 
delimiter inside a field  (b) a doublequote=False, escapechar=None scheme 
but there is a quotechar in the data

(3) On input, require the caller to specify exactly what scheme they think 
was used to create the data. Check carefully that the incoming data 
corresponds to the alleged scheme. Again, have a strict_input option.

Here we have some data that was written by a doublequote=False, 
escapechar=None, quoting=QUOTE_ALL scheme:

>>> badcsv = ['"quotes not doubled"', '"rear of "Fubar Flats""', '""Thistle 
>>> Do" RMB 123"']

and it is munged w/o warning if read with standard CSV settings:

>>> [x for x in csv.reader(badcsv)]
[['quotes not doubled'], ['rear of Fubar Flats""'], ['Thistle Do" RMB 
123"']]

and trying to tell the csv module what to do doesn't help:

>>> [x for x in csv.reader(badcsv, doublequote=False, escapechar=None)]
[['quotes not doubled'], ['rear of Fubar Flats""'], ['Thistle Do" RMB 
123"']]

It is possible to recover the data if each field had an even number of 
quotes, but this requires a quite different state machine:

>>> badcsvstr = '"quotes not doubled"\n"rear of "Fubar Flats""\n""Thistle 
>>> Do" RMB 123"'
# my module requires input iterables only to deliver one or more bytes per 
iteration i.e can be more or less than exactly one line and the module does 
the end-of-line detection and yes it special-cases the iterable being a 
string, for obvious efficiency reasons.

>>> [x for x in delimited.importer(badcsvstr, 
>>> quote_mode=delimited.QUOTE_SINGLE)]
[['quotes not doubled'], ['rear of "Fubar Flats"'], ['"Thistle Do" RMB 
123']]
# We've recovered what was most likely to have been in the original data

and will crack it if told that this data is standard CSV:

>>> impo = delimited.importer(badcsvstr)
>>> list(impo)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
delimited.DataError: After rear_quote, expected rear_quote, delimiter or 
newline; found <F> (hex 46)

and just in case you're trying to find the offending line in a 100 Mb file:

>>> impo.input_row_number, impo.input_char_column
(1, 10) # zero-relative

Hope this explains where I'm coming from ...

Cheers,
John

From sjmachin at lexicon.net  Mon Feb 17 02:35:30 2003
From: sjmachin at lexicon.net (John Machin)
Date: Mon, 17 Feb 2003 12:35:30 +1100
Subject: Andrew Dalke's space example (was Re: [Csv] csv) 
In-Reply-To: <20030217000613.CFE2D3CC5C@coffee.object-craft.com.au>
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
	<oprknd9q0tm50ryr@localhost> <oprkopdwz4m50ryr@localhost>
	<20030216231723.202913CC5C@coffee.object-craft.com.au>
	<oprkpodbh6m50ryr@localhost>
	<20030217000613.CFE2D3CC5C@coffee.object-craft.com.au>
Message-ID: <oprkptdgmcm50ryr@localhost>

On Mon, 17 Feb 2003 11:06:13 +1100, Andrew McNamara <andrewm at object- 
craft.com.au> wrote:

>
> Note that the registry stuff is entirely optional. You can pass a class
> or instance as the dialect, and it will work as expected. The doco should
> probably be updated to mention this.

Yes, it should. What is the use case for the registry, anyway?

From andrewm at object-craft.com.au  Mon Feb 17 02:40:37 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 17 Feb 2003 12:40:37 +1100
Subject: Andrew Dalke's space example (was Re: [Csv] csv) 
In-Reply-To: Message from John Machin <sjmachin@lexicon.net> 
   of "Mon, 17 Feb 2003 12:35:30 +1100." <oprkptdgmcm50ryr@localhost> 
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
	<oprknd9q0tm50ryr@localhost> <oprkopdwz4m50ryr@localhost>
	<20030216231723.202913CC5C@coffee.object-craft.com.au>
	<oprkpodbh6m50ryr@localhost>
	<20030217000613.CFE2D3CC5C@coffee.object-craft.com.au>
	<oprkptdgmcm50ryr@localhost> 
Message-ID: <20030217014037.899903CC5C@coffee.object-craft.com.au>

>> Note that the registry stuff is entirely optional. You can pass a class
>> or instance as the dialect, and it will work as expected. The doco should
>> probably be updated to mention this.
>
>Yes, it should. What is the use case for the registry, anyway?

Actually, it started out being an internal implementation detail. You're
supposted to be able to specify common dialects via a string (for example
"excel"), and obviously the module needed some way of recording these.

You can, in fact, pretend the dialect classes don't exist. This works fine:

        r = csv.reader(input_file, delimiter = '\t')

The module supplies default values for all the parameters - the defaults
correspond to the way Excel parses csv files.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Mon Feb 17 03:09:34 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 17 Feb 2003 13:09:34 +1100
Subject: Andrew Dalke's space example (was Re: [Csv] csv) 
In-Reply-To: Message from John Machin <sjmachin@lexicon.net> 
   of "Mon, 17 Feb 2003 12:59:30 +1100." <oprkpuhgxpm50ryr@localhost> 
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
	<oprknd9q0tm50ryr@localhost> <oprkopdwz4m50ryr@localhost>
	<20030216231723.202913CC5C@coffee.object-craft.com.au>
	<oprkpodbh6m50ryr@localhost>
	<20030217000613.CFE2D3CC5C@coffee.object-craft.com.au>
	<oprkptdgmcm50ryr@localhost>
	<20030217014037.899903CC5C@coffee.object-craft.com.au>
	<oprkpuhgxpm50ryr@localhost> 
Message-ID: <20030217020934.0D5773CC5C@coffee.object-craft.com.au>

>The "supposted to be able to specify common dialects via a string" 
>requirement seems rather superfluous when you can pass in a class or an 
>instance. Thus the registry caper seems also superfluous.

I think the requirement came from the GUI camp - they want to be able to
provide their users with a pulldown with a list of supported file formats.

>> You can, in fact, pretend the dialect classes don't exist.
>
>Yes, I'd noticed. This just means that you then need extra code (in C!) to 
>validate the keyword arguments and cram them into the Dialect instance.
>
>What do you lose, apart from a maintenance headache, if you throw away the 
>whole Dialect notion and just stick to key-word arguments (with appropriate 
>defaults, of course)?

Well, then you have Object Craft's csv module (on which the current
implementation was based)... 8-)

But you still need to do a whole heap of validation whichever way you
do it. The current validation could certainly do with more work.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Wed Feb 19 01:01:44 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 19 Feb 2003 11:01:44 +1100
Subject: [Csv] Out of town, BDFL pronouncement, incorporation, Unicode 
In-Reply-To: Message from Andrew McNamara <andrewm@object-craft.com.au> 
	<20030214063445.3A0A73CC5D@coffee.object-craft.com.au> 
References: <15948.35170.966135.741531@montanaro.dyndns.org>
	<20030214063445.3A0A73CC5D@coffee.object-craft.com.au> 
Message-ID: <20030219000144.AF27C3CC5E@coffee.object-craft.com.au>

>>Assuming nothing earth-shattering develops by mid-week, would one of you
>>like to propose on python-dev that Guido pronounce on the PEP and give a
>>thumbs-up or -down on the module?  I can take care of merging it into the
>>Python distribution (stitch it into setup.py, the test directory and the
>>libref manual) when I return.
>
>Okay.

Guido's doing the 2.3a2 release today - we're not going to get into a2,
so I'm going to wait until he's finished with a2 before posting. I also
think we have a few doco and other issues that have been discussed in
the last week that need to be tidied up. I'm rather short of time at
the moment - any help others can give (going back through the archive
and making a TODO list would be valuable) would be appreciated.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Mon Feb 24 02:15:44 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 23 Feb 2003 19:15:44 -0600
Subject: Andrew Dalke's space example (was Re: [Csv] csv) 
In-Reply-To: <oprkptdgmcm50ryr@localhost>
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
        <oprknd9q0tm50ryr@localhost>
        <oprkopdwz4m50ryr@localhost>
        <20030216231723.202913CC5C@coffee.object-craft.com.au>
        <oprkpodbh6m50ryr@localhost>
        <20030217000613.CFE2D3CC5C@coffee.object-craft.com.au>
        <oprkptdgmcm50ryr@localhost>
Message-ID: <15961.29248.867924.249023@montanaro.dyndns.org>

    John> Yes, it should. What is the use case for the registry, anyway?

My original thought was that the module itself would grow new dialects over
time and that it would be easier for programmers and users to remember and
recognize strings like "excel" or "gnumeric" or "appleworks".  The biggest
use for a registry is probably within GUI apps that need to read/write CSV
files.  The strings make nice pop-up menu items, then are internally used as
keys in the "registry", which is nothing more than a dict.

Skip

From skip at pobox.com  Wed Feb 26 16:09:50 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 26 Feb 2003 09:09:50 -0600
Subject: [Csv] What's our status?
Message-ID: <15964.55486.242989.782539@montanaro.dyndns.org>

Guys,

Are we ready to go?  As you can see from the attached PEP 283 checkin
message, Guido is hopeful.

Skip

-------------- next part --------------
An embedded message was scrubbed...
From: gvanrossum at users.sourceforge.net
Subject: [Python-checkins] python/nondist/peps pep-0283.txt,1.31,1.32
Date: Wed, 26 Feb 2003 06:58:15 -0800
Size: 6093
Url: http://mail.python.org/pipermail/csv/attachments/20030226/f98d19f0/attachment.mht 

From LogiplexSoftware at earthlink.net  Wed Feb 26 18:32:01 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 26 Feb 2003 09:32:01 -0800
Subject: [Csv] What's our status?
In-Reply-To: <15964.55486.242989.782539@montanaro.dyndns.org>
References: <15964.55486.242989.782539@montanaro.dyndns.org>
Message-ID: <1046280720.27223.9.camel@software1.logiplex.internal>

On Wed, 2003-02-26 at 07:09, Skip Montanaro wrote:
> Guys,
> 
> Are we ready to go?  As you can see from the attached PEP 283 checkin
> message, Guido is hopeful.
> 
> Skip
> 

I'm fairly happy with the state of the csv parser and the PEP.

I'm working on csvutils.py right now.  The guessDelimiter() function
from DSV isn't really the best for our purposes as it expects a fairly
fixed number of columns and we're allowing for variable columns per row.
Also, allowing spaces around delimiters is going to throw
guessQuoteChar().  I've got some ideas for fixing guessQuoteChar() but
guessDelimiter is going to need an entirely new approach (which I think
I have an idea for =)

Sorry for being a slug.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308

From skip at pobox.com  Wed Feb 26 18:42:15 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 26 Feb 2003 11:42:15 -0600
Subject: [Csv] What's our status?
In-Reply-To: <1046280720.27223.9.camel@software1.logiplex.internal>
References: <15964.55486.242989.782539@montanaro.dyndns.org>
        <1046280720.27223.9.camel@software1.logiplex.internal>
Message-ID: <15964.64631.161113.441183@montanaro.dyndns.org>

    Cliff> I'm fairly happy with the state of the csv parser and the PEP.

    Cliff> I'm working on csvutils.py right now.

Let's not wait terribly long to get things in the mill.  If I remember
correctly, 2.3b1 will be out around mid-March.  I'd like to ask Guido to
pronounce on the PEP and code in the next few days if possible.

I will post a note to python-dev asking people to take a look at the code
and the PEP.

Skip

From skip at pobox.com  Wed Feb 26 19:03:20 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 26 Feb 2003 12:03:20 -0600
Subject: [Csv] PEP 305 - CSV File API - please have a look
Message-ID: <15965.360.659692.788321@montanaro.dyndns.org>

Folks,

In advance of asking Guido to review and pronounce on PEP 305 and its
related code, I'd like to ask you to take a few minutes to review what we've
produced.  There is the PEP, of course:

    http://www.python.org/peps/pep-0305.html

but there is also source code, a large number of test cases and a libref
section available in the CVS sandbox.  Cliff Wells is working on a csvutils
module which will contain adaptations of the "sniffing" routines from his
DSV package.

Just do a "csv up -dP ." in your nondist/sandbox directory to get the latest
version of everything.  Feel free to review and/or comment on any or all of
it, but please please post your comments to the csv at mail.mojam.com mailing
list.  You can review our rather active correspondence at

    http://manatee.mojam.com/pipermail/csv/

or if you're really excited about CSV files, you can subscribe at

    http://manatee.mojam.com/mailman/listinfo/csv

Thx,

Skip

From guido at python.org  Wed Feb 26 19:10:47 2003
From: guido at python.org (Guido van Rossum)
Date: Wed, 26 Feb 2003 13:10:47 -0500
Subject: [Csv] Re: [Python-Dev] PEP 305 - CSV File API - please have a look
In-Reply-To: Your message of "Wed, 26 Feb 2003 12:03:20 CST."
             <15965.360.659692.788321@montanaro.dyndns.org> 
References: <15965.360.659692.788321@montanaro.dyndns.org> 
Message-ID: <200302261810.h1QIAmT20744@odiug.zope.com>

> Just do a "csv up -dP ." in your nondist/sandbox directory to get the latest

You've been typing csv too much. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)

From skip at pobox.com  Wed Feb 26 21:27:44 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 26 Feb 2003 14:27:44 -0600
Subject: [Csv] Dialect.validate()
Message-ID: <15965.9024.863156.732714@montanaro.dyndns.org>

If you have a moment, please take a look at the simple-minded validate()
method in the Dialect class.  I'm sure it can be strengthened quite a bit.

Thx,

Skip

From skip at pobox.com  Wed Feb 26 21:37:32 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 26 Feb 2003 14:37:32 -0600
Subject: [Csv] Dialect validation errors
Message-ID: <15965.9612.508559.964220@montanaro.dyndns.org>

Any thoughts on the way I'm generating Dialect validation errors as a list
of strings?  I'm starting to write test cases for that stuff and it occurs
to me that checking for specific strings in the validation output is going
to be fragile.

Skip

From skip at pobox.com  Wed Feb 26 22:42:49 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 26 Feb 2003 15:42:49 -0600
Subject: [Csv] ignore blank lines?
In-Reply-To: <oprklnrmdgm50ryr@localhost>
References: <15945.28678.631121.25754@montanaro.dyndns.org>
        <m31y2e5sld.fsf@ferret.object-craft.com.au>
        <15945.37976.592926.369940@montanaro.dyndns.org>
        <oprkghsll3m50ryr@localhost>
        <15946.25610.202929.623399@montanaro.dyndns.org>
        <oprkh0ixoom50ryr@localhost>
        <15948.38242.974158.425677@montanaro.dyndns.org>
        <oprkkuuarym50ryr@localhost>
        <15949.16699.749757.280021@montanaro.dyndns.org>
        <oprklnrmdgm50ryr@localhost>
Message-ID: <15965.13529.765871.804925@montanaro.dyndns.org>

Returning to an old pre-vacation topic...

    >> Well, I would argue that a row of commas just means a row of empty
    >> strings.

    John> It can mean that the database has a row with all values NULL, or
    John> some other equally distrubing circumstance.

We've already established that there is no way to store NULL/None values in
a CSV file and have them be reliably reconstituted when the file is read
back in.

    >> Other than that, I agree, I wouldn't expect blank lines or lines with
    >> too few columns from properly function programs which are supposed to
    >> dump rows with constant numbers of columns.

    John> Exactly. Which makes me wonder why you have implemented defaults
    John> for short rows.

Perhaps it's just overkill.

    >> I guess my Python aphorism for the day is "Practicality beats
    >> purity."

    John> I don't understand this comment. You are advocating (in fact have
    John> implemented) hiding disturbing circumstances from the callers. Do
    John> you classify this as practical or pure?

If, for some reason, a row in a CSV file has a short line or a blank line I
don't want the processing to barf.  Most CSV files are program-generated,
and in my opinion the likelihood of a user introducing more problems into
the file by hand editing it are too high.  I'd rather worm around problems
in the files.

On output, I think it would be convenient to not require dictionaries being
dumped to the file to have a full complement of key-value pairs.  Not all
such data will be generated by a database which fully populates all fields.

Skip

From skip at pobox.com  Wed Feb 26 22:58:17 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 26 Feb 2003 15:58:17 -0600
Subject: Andrew Dalke's space example (was Re: [Csv] csv)
In-Reply-To: <m3of5cqynq.fsf@ferret.object-craft.com.au>
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
        <oprknd9q0tm50ryr@localhost>
        <m3of5cqynq.fsf@ferret.object-craft.com.au>
Message-ID: <15965.14457.888418.349998@montanaro.dyndns.org>

    >> (4) Maybe the whole dialect thing is a bit too baroque and Byzantine
    >> -- see example 5 in dalke.py. The **dict_of_arguments gadget offers
    >> the "don't need to type long list of arguments" advantage claimed for
    >> dialect classes, and you get the same obscure error message if you
    >> stuff up the type of an argument (see example 6) -- all of this
    >> without writing all that register/validate/etc code.

    Dave> How much clearer would things be if the validation of dialects
    Dave> were pulled up into the Python?

That was my intention all along.  The problem I see is that someone might
pass an instance as the dialect parameter to csv.reader() or csv.writer()
which is not an instance of csv.Dialect.  If we can get the
Dialect._validate() method right, all the C code would have to do is make
sure the object passed to the factory functions as the dialect parameter is
an instance of csv.Dialect or that the class passed to register_dialect is a
subclass of csv.Dialect.

Skip

From LogiplexSoftware at earthlink.net  Wed Feb 26 23:10:48 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 26 Feb 2003 14:10:48 -0800
Subject: [Csv] What's our status?
In-Reply-To: <1046280720.27223.9.camel@software1.logiplex.internal>
References: <15964.55486.242989.782539@montanaro.dyndns.org>
	 <1046280720.27223.9.camel@software1.logiplex.internal>
Message-ID: <1046297447.27223.19.camel@software1.logiplex.internal>

On Wed, 2003-02-26 at 09:32, Cliff Wells wrote:

> I'm working on csvutils.py right now.  The guessDelimiter() function
> from DSV isn't really the best for our purposes as it expects a fairly
> fixed number of columns and we're allowing for variable columns per row.
> Also, allowing spaces around delimiters is going to throw
> guessQuoteChar().  I've got some ideas for fixing guessQuoteChar() but
> guessDelimiter is going to need an entirely new approach (which I think
> I have an idea for =)

Okay, here's my status:

1) I can sniff the quotechar.
2) I can sniff the delimiter IF:
    a) there is a quotechar [determine delimiter based on relation to 
       quotechar].
       or
    b) the data is regular, that is, the number of columns doesn't vary
       a lot from record to record [based upon number of occurrences of 
       delimiter in each record, to grossly simplify things].  This is  
       the method DSV uses.

However, for the following I am so far unable to come up with a way to
determine the delimiter:

all,work,and,no,play,makes,jack,a,dull,boy
all,work,and,no,play,makes,jack,a,dull
boy
all,work,and,no,play,makes,jack,a
dull,boy
all,work,and,no,play,makes,jack
a,dull,boy
all,work,and,no,play,makes
jack,a,dull,boy
all,work,and,no,play
makes,jack,a,dull,boy
all,work,and,no
play,makes,jack,a,dull,boy
all,work,and
no,play,makes,jack,a,dull,boy

Anyone have a suggestion?  All work and no play makes jack a dull boy.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308

From skip at pobox.com  Wed Feb 26 23:12:47 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 26 Feb 2003 16:12:47 -0600
Subject: Andrew Dalke's space example (was Re: [Csv] csv) 
In-Reply-To: <20030217000613.CFE2D3CC5C@coffee.object-craft.com.au>
References: <1ddc01c2d4cb$b2bd1160$67c8a8c0@jethro>
        <oprknd9q0tm50ryr@localhost>
        <oprkopdwz4m50ryr@localhost>
        <20030216231723.202913CC5C@coffee.object-craft.com.au>
        <oprkpodbh6m50ryr@localhost>
        <20030217000613.CFE2D3CC5C@coffee.object-craft.com.au>
Message-ID: <15965.15327.198910.214585@montanaro.dyndns.org>

    Andrew> Note that the registry stuff is entirely optional. You can pass
    Andrew> a class or instance as the dialect, and it will work as
    Andrew> expected. The doco should probably be updated to mention this.

So noted in the docs.

Skip

From skip at pobox.com  Wed Feb 26 23:17:27 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 26 Feb 2003 16:17:27 -0600
Subject: [Csv] What's our status?
In-Reply-To: <1046297447.27223.19.camel@software1.logiplex.internal>
References: <15964.55486.242989.782539@montanaro.dyndns.org>
        <1046280720.27223.9.camel@software1.logiplex.internal>
        <1046297447.27223.19.camel@software1.logiplex.internal>
Message-ID: <15965.15607.38055.873934@montanaro.dyndns.org>

    Cliff> Okay, here's my status:

    Cliff> 1) I can sniff the quotechar.
    Cliff> 2) I can sniff the delimiter IF:
    ...
    Cliff> However, for the following I am so far unable to come up with a
    Cliff> way to determine the delimiter:
    ...

Can you check in what you have so we can poke it a bit?

Also, I suspect this whole thing should be a package.  That is, the csv
utils module should be csv.utils not csvutils.  Comments?

Skip

From djc at object-craft.com.au  Thu Feb 27 00:03:35 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 27 Feb 2003 10:03:35 +1100
Subject: [Csv] ignore blank lines?
In-Reply-To: <15965.13529.765871.804925@montanaro.dyndns.org>
References: <15945.28678.631121.25754@montanaro.dyndns.org>
	<m31y2e5sld.fsf@ferret.object-craft.com.au>
	<15945.37976.592926.369940@montanaro.dyndns.org>
	<oprkghsll3m50ryr@localhost>
	<15946.25610.202929.623399@montanaro.dyndns.org>
	<oprkh0ixoom50ryr@localhost>
	<15948.38242.974158.425677@montanaro.dyndns.org>
	<oprkkuuarym50ryr@localhost>
	<15949.16699.749757.280021@montanaro.dyndns.org>
	<oprklnrmdgm50ryr@localhost>
	<15965.13529.765871.804925@montanaro.dyndns.org>
Message-ID: <m3r89ur6ew.fsf@ferret.object-craft.com.au>

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Skip> Returning to an old pre-vacation topic...

>>> Well, I would argue that a row of commas just means a row of empty
>>> strings.

John> It can mean that the database has a row with all values NULL, or
John> some other equally distrubing circumstance.

Skip> We've already established that there is no way to store
Skip> NULL/None values in a CSV file and have them be reliably
Skip> reconstituted when the file is read back in.

That is not strictly true.  We could come up with a dialect parameter
which is unique to the Python csv module which does this:

        abc,null,def   <-> ['abc', None, 'def']
        abc,"null",def <-> ['abc', 'null', 'def']
        abc,,def       <-> ['abc', '', 'def']

This would allow us to provide a format which was even more useful for
DB-API users.

- Dave

-- 
http://www.object-craft.com.au

From skip at pobox.com  Thu Feb 27 00:24:22 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 26 Feb 2003 17:24:22 -0600
Subject: [Csv] ignore blank lines?
In-Reply-To: <m3r89ur6ew.fsf@ferret.object-craft.com.au>
References: <15945.28678.631121.25754@montanaro.dyndns.org>
        <m31y2e5sld.fsf@ferret.object-craft.com.au>
        <15945.37976.592926.369940@montanaro.dyndns.org>
        <oprkghsll3m50ryr@localhost>
        <15946.25610.202929.623399@montanaro.dyndns.org>
        <oprkh0ixoom50ryr@localhost>
        <15948.38242.974158.425677@montanaro.dyndns.org>
        <oprkkuuarym50ryr@localhost>
        <15949.16699.749757.280021@montanaro.dyndns.org>
        <oprklnrmdgm50ryr@localhost>
        <15965.13529.765871.804925@montanaro.dyndns.org>
        <m3r89ur6ew.fsf@ferret.object-craft.com.au>
Message-ID: <15965.19622.423269.684185@montanaro.dyndns.org>

    Dave> That is not strictly true.  We could come up with a dialect
    Dave> parameter which is unique to the Python csv module which does
    Dave> this:

    Dave>         abc,null,def   <-> ['abc', None, 'def']
    Dave>         abc,"null",def <-> ['abc', 'null', 'def']
    Dave>         abc,,def       <-> ['abc', '', 'def']

-1.  Too much chance for confusion and mistakes.  Quotes are for quoting,
not for data typing.

Skip

From LogiplexSoftware at earthlink.net  Thu Feb 27 02:10:24 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 26 Feb 2003 17:10:24 -0800
Subject: [Csv] What's our status?
In-Reply-To: <1046297447.27223.19.camel@software1.logiplex.internal>
References: <15964.55486.242989.782539@montanaro.dyndns.org>
	 <1046280720.27223.9.camel@software1.logiplex.internal>
	 <1046297447.27223.19.camel@software1.logiplex.internal>
Message-ID: <1046308224.27222.68.camel@software1.logiplex.internal>

On Wed, 2003-02-26 at 14:10, Cliff Wells wrote:
> On Wed, 2003-02-26 at 09:32, Cliff Wells wrote:
> 
> > I'm working on csvutils.py right now.  The guessDelimiter() function
> > from DSV isn't really the best for our purposes as it expects a fairly
> > fixed number of columns and we're allowing for variable columns per row.
> > Also, allowing spaces around delimiters is going to throw
> > guessQuoteChar().  I've got some ideas for fixing guessQuoteChar() but
> > guessDelimiter is going to need an entirely new approach (which I think
> > I have an idea for =)
> 
> Okay, here's my status:
> 
> 1) I can sniff the quotechar.
> 2) I can sniff the delimiter IF:
>     a) there is a quotechar [determine delimiter based on relation to 
>        quotechar].
>        or
>     b) the data is regular, that is, the number of columns doesn't vary
>        a lot from record to record [based upon number of occurrences of 
>        delimiter in each record, to grossly simplify things].  This is  
>        the method DSV uses.
> 
> However, for the following I am so far unable to come up with a way to
> determine the delimiter:
> 
> all,work,and,no,play,makes,jack,a,dull,boy
> all,work,and,no,play,makes,jack,a,dull
> boy
> all,work,and,no,play,makes,jack,a
> dull,boy
> all,work,and,no,play,makes,jack
> a,dull,boy
> all,work,and,no,play,makes
> jack,a,dull,boy
> all,work,and,no,play
> makes,jack,a,dull,boy
> all,work,and,no
> play,makes,jack,a,dull,boy
> all,work,and
> no,play,makes,jack,a,dull,boy

Okay, banging my head against a wall here.  Consider this "CSV" file:

all
work
and
no
play
makes
jack
a
dull
boy

I don't see why this wouldn't be considered valid CSV, yet there is
clearly no delimiter (assuming there would have been one had each row
contained more than one column).  It seems we could just pass ',' as the
delimiter since it won't be used anyway until we encounter:

redrum
redrum
redrum
re,drum

Where "," is actually part of the data (assume for a moment that \t was
the delimiter.

Further, consider that any of the characters ('r', 'e', 'd', 'u', 'm')
could possibly be considered a delimiter (not likely though, and I'd be
willing to limit possibilities to string.punctuation + string.whitespace
for these situations if I thought it would really help).

It's becoming clear to me that without the constraints I mentioned
earlier (valid quotechar or the columns are of a mostly fixed length)
there is no good way to sniff the format. This seems unfortunate because
the formats that are unsniffable are the simplest possible cases.

Sigh.  Will think about it more but I'm becoming more pessimistic the
longer I look at it.

OTOH, I personally don't have a big problem with the constraints [just a
small one].  The DSV sniffers have been used by a lot of people without
complaint and they required fixed column widths regardless of whether
there was a quotechar or not and we're actually doing a bit better than
that right now.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308

From skip at pobox.com  Thu Feb 27 02:15:07 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 26 Feb 2003 19:15:07 -0600
Subject: [Csv] What's our status?
In-Reply-To: <1046308224.27222.68.camel@software1.logiplex.internal>
References: <15964.55486.242989.782539@montanaro.dyndns.org>
        <1046280720.27223.9.camel@software1.logiplex.internal>
        <1046297447.27223.19.camel@software1.logiplex.internal>
        <1046308224.27222.68.camel@software1.logiplex.internal>
Message-ID: <15965.26267.85191.93543@montanaro.dyndns.org>

    Cliff> Okay, banging my head against a wall here.  Consider this "CSV"
    Cliff> file:

    Cliff> all
    Cliff> work
    Cliff> and
    Cliff> no
    Cliff> play
    Cliff> makes
    Cliff> jack
    Cliff> a
    Cliff> dull
    Cliff> boy

Is there something that suggests a sniffer can't fail to decide/guess?

    Cliff> OTOH, I personally don't have a big problem with the constraints
    Cliff> [just a small one].  The DSV sniffers have been used by a lot of
    Cliff> people without complaint and they required fixed column widths
    Cliff> regardless of whether there was a quotechar or not and we're
    Cliff> actually doing a bit better than that right now.

So maybe we make constant number of columns a constraint?

Skip

From andrewm at object-craft.com.au  Thu Feb 27 02:19:16 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 27 Feb 2003 12:19:16 +1100
Subject: [Csv] What's our status? 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15965.26267.85191.93543@montanaro.dyndns.org> 
References: <15964.55486.242989.782539@montanaro.dyndns.org>
	<1046280720.27223.9.camel@software1.logiplex.internal>
	<1046297447.27223.19.camel@software1.logiplex.internal>
	<1046308224.27222.68.camel@software1.logiplex.internal>
	<15965.26267.85191.93543@montanaro.dyndns.org> 
Message-ID: <20030227011916.5E0453CC5C@coffee.object-craft.com.au>

>Is there something that suggests a sniffer can't fail to decide/guess?

That would be better than guessing wrong, I think.

>    Cliff> OTOH, I personally don't have a big problem with the constraints
>    Cliff> [just a small one].  The DSV sniffers have been used by a lot of
>    Cliff> people without complaint and they required fixed column widths
>    Cliff> regardless of whether there was a quotechar or not and we're
>    Cliff> actually doing a bit better than that right now.
>
>So maybe we make constant number of columns a constraint?

Or even a hint. Maybe the user of the module can provide some "educated
guesses" as to the nature of the file.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From LogiplexSoftware at earthlink.net  Thu Feb 27 02:46:36 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 26 Feb 2003 17:46:36 -0800
Subject: [Csv] What's our status?
In-Reply-To: <15965.26267.85191.93543@montanaro.dyndns.org>
References: <15964.55486.242989.782539@montanaro.dyndns.org>
	 <1046280720.27223.9.camel@software1.logiplex.internal>
	 <1046297447.27223.19.camel@software1.logiplex.internal>
	 <1046308224.27222.68.camel@software1.logiplex.internal>
	 <15965.26267.85191.93543@montanaro.dyndns.org>
Message-ID: <1046310395.27223.91.camel@software1.logiplex.internal>

On Wed, 2003-02-26 at 17:15, Skip Montanaro wrote:
>     Cliff> Okay, banging my head against a wall here.  Consider this "CSV"
>     Cliff> file:
> 
>     Cliff> all
>     Cliff> work
>     Cliff> and
>     Cliff> no
>     Cliff> play
>     Cliff> makes
>     Cliff> jack
>     Cliff> a
>     Cliff> dull
>     Cliff> boy
> 
> Is there something that suggests a sniffer can't fail to decide/guess?

No, it's just unfortunate that what appears to be the simple cases is
where it fails.

> 
>     Cliff> OTOH, I personally don't have a big problem with the constraints
>     Cliff> [just a small one].  The DSV sniffers have been used by a lot of
>     Cliff> people without complaint and they required fixed column widths
>     Cliff> regardless of whether there was a quotechar or not and we're
>     Cliff> actually doing a bit better than that right now.
> 
> So maybe we make constant number of columns a constraint?

Number of columns or quoted.  But perhaps that's confusing?

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308

From djc at object-craft.com.au  Thu Feb 27 09:33:14 2003
From: djc at object-craft.com.au (Dave Cole)
Date: 27 Feb 2003 19:33:14 +1100
Subject: [Csv] ignore blank lines?
In-Reply-To: <15965.19622.423269.684185@montanaro.dyndns.org>
References: <15945.28678.631121.25754@montanaro.dyndns.org>
	<m31y2e5sld.fsf@ferret.object-craft.com.au>
	<15945.37976.592926.369940@montanaro.dyndns.org>
	<oprkghsll3m50ryr@localhost>
	<15946.25610.202929.623399@montanaro.dyndns.org>
	<oprkh0ixoom50ryr@localhost>
	<15948.38242.974158.425677@montanaro.dyndns.org>
	<oprkkuuarym50ryr@localhost>
	<15949.16699.749757.280021@montanaro.dyndns.org>
	<oprklnrmdgm50ryr@localhost>
	<15965.13529.765871.804925@montanaro.dyndns.org>
	<m3r89ur6ew.fsf@ferret.object-craft.com.au>
	<15965.19622.423269.684185@montanaro.dyndns.org>
Message-ID: <m365r6nmwl.fsf@ferret.object-craft.com.au>

    Dave> That is not strictly true.  We could come up with a dialect
    Dave> parameter which is unique to the Python csv module which does
    Dave> this:

    Dave>         abc,null,def   <-> ['abc', None, 'def']
    Dave>         abc,"null",def <-> ['abc', 'null', 'def']
    Dave>         abc,,def       <-> ['abc', '', 'def']

Skip> -1.  Too much chance for confusion and mistakes.  Quotes are for
Skip> quoting, not for data typing.

The point is to provide a round-trip for the DB-API.  I think you
would have rocks in your head if you tried to use or create this data
with anything other than the CSV module and the DB-API.

Anyway, it was just a thought.

- Dave

-- 
http://www.object-craft.com.au

From sjmachin at lexicon.net  Thu Feb 27 13:12:16 2003
From: sjmachin at lexicon.net (John Machin)
Date: Thu, 27 Feb 2003 23:12:16 +1100
Subject: [Csv] What's our status?
In-Reply-To: <1046297447.27223.19.camel@software1.logiplex.internal>
References: <15964.55486.242989.782539@montanaro.dyndns.org>
	<1046280720.27223.9.camel@software1.logiplex.internal>
	<1046297447.27223.19.camel@software1.logiplex.internal>
Message-ID: <oprk85iqu3m50ryr@localhost>

On 26 Feb 2003 14:10:48 -0800, Cliff Wells <LogiplexSoftware at earthlink.net> 
wrote:

>
> However, for the following I am so far unable to come up with a way to
> determine the delimiter:
>
> all,work,and,no,play,makes,jack,a,dull,boy
> all,work,and,no,play,makes,jack,a,dull
> boy
> all,work,and,no,play,makes,jack,a
[snip]

>
> Anyone have a suggestion?  All work and no play makes jack a dull boy.

[Warning: late at night, OTTOMH, may contain babblings]

Errrmmm, maybe I've missed the plot or lost the point or whatever, but a 
good start would be assuming that only in pathological cases would the 
delimiter or the quote be an alphanumeric character i.e. the file has been 
produced by an ordinary user, not a red-team tester.

Try the most frequent two non-alphanumeric characters as the candidates for 
the delimiter and the quotechar? If there's only 1 non-alphanumeric 
character, then it's the delimiter.
If there aren't any non-AN chars [an example in one of your messages], then 
there's only one field per record.

Where there are two or more candidates for the delimiter and quotechar, you 
could use some plausibility  heuristics e.g. " and ' are more likely to be 
quotes than delimiters however tab, comma, semicolon, colon, vertical bar, 
and tilde are plausible delimiters.

Some cautions:

(1) "Warning -- Europeans here";1,234;5,678

(2) Joe Blow~'The Vaults',456 Main 
St,Snowtown,SA,5999~31/12/1999~01/04/2000
# delimiter (tilde) occurs 3 times, no quotechar at all, data characters 
comma and slash occur 4 times each (more than delimiter).

In any case, it appears to me that you can't pronounce on the result until 
you've parsed a large chunk of the file with each plausible hypothesis, 
especially if the hypothesis admits (quoted) newlines inside the data. Some 
possible decision criteria are (1) percentage of syntax errors (2) standard 
deviation of number of columns ...

Hope this helps,
John

From LogiplexSoftware at earthlink.net  Thu Feb 27 18:07:58 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 27 Feb 2003 09:07:58 -0800
Subject: [Csv] What's our status?
In-Reply-To: <oprk85iqu3m50ryr@localhost>
References: <15964.55486.242989.782539@montanaro.dyndns.org>
	 <1046280720.27223.9.camel@software1.logiplex.internal>
	 <1046297447.27223.19.camel@software1.logiplex.internal>
	 <oprk85iqu3m50ryr@localhost>
Message-ID: <1046365677.27223.119.camel@software1.logiplex.internal>

On Thu, 2003-02-27 at 04:12, John Machin wrote:
> On 26 Feb 2003 14:10:48 -0800, Cliff Wells <LogiplexSoftware at earthlink.net> 
> wrote:
> 
> >
> > However, for the following I am so far unable to come up with a way to
> > determine the delimiter:
> >
> > all,work,and,no,play,makes,jack,a,dull,boy
> > all,work,and,no,play,makes,jack,a,dull
> > boy
> > all,work,and,no,play,makes,jack,a
> [snip]
> 
> >
> > Anyone have a suggestion?  All work and no play makes jack a dull boy.
> 
> [Warning: late at night, OTTOMH, may contain babblings]

I started babbling yesterday while working on this.  Luckily the interns
came and gave me my injection.  However, it's difficult to type with
these leather straps on and I can't quite reach the buckles with my
teeth...

> Errrmmm, maybe I've missed the plot or lost the point or whatever, but a 
> good start would be assuming that only in pathological cases would the 
> delimiter or the quote be an alphanumeric character i.e. the file has been 
> produced by an ordinary user, not a red-team tester.

I'm willing to make that assumption for this case, but read on...

> Try the most frequent two non-alphanumeric characters as the candidates for 
> the delimiter and the quotechar? If there's only 1 non-alphanumeric 
> character, then it's the delimiter.

If we have a quotechar, then the problem is solved.  Unfortunately the
situation I expect here is that there will be more than one
non-alphanumeric character per line.  It's quite common to see
dates/timestamps in *every* row of a csv file:

data,2003/02/27,08:51:00
data,2003/02/27,08:52:00
data,2003/02/27,08:53:00
data,2003/02/27,08:54:00

In this case it is difficult to know whether ,/ or : is the delimiter. 
It's not entirely unreasonable to use a "preferred" list of delimiters
but it's not entirely safe either ;)  In fact, the current
implementation will resort to a preferred list in this example and
return , as the delimiter.  However, given the following:

2003/02/27,08:51:00
data,2003/02/27,08:52:00
08:53:00
data,2003/02/27,08:54:00

It would most likely (without testing) return ":" as the delimiter as it
occurs equally consistently with "/", but is higher in the preferred
list.  This is wrong as the delimiter is clearly ",".  That being said,
I would simply consider this file as being unsniffable as it has no real
pattern.

> If there aren't any non-AN chars [an example in one of your messages], then 
> there's only one field per record.

Hm.  That might actually be useful.

> Where there are two or more candidates for the delimiter and quotechar, you 
> could use some plausibility  heuristics e.g. " and ' are more likely to be 
> quotes than delimiters however tab, comma, semicolon, colon, vertical bar, 
> and tilde are plausible delimiters.

As I mentioned earlier, quotes are already handled.  If quotes are
present, I think the current implementation is good enough to handle
most files.

> Some cautions:
> 
> (1) "Warning -- Europeans here";1,234;5,678

So you see my point =)

> (2) Joe Blow~'The Vaults',456 Main 
> St,Snowtown,SA,5999~31/12/1999~01/04/2000
> # delimiter (tilde) occurs 3 times, no quotechar at all, data characters 
> comma and slash occur 4 times each (more than delimiter).

Yes, I've already decided that frequency by itself isn't a useful
measurement.  This particular example is invalid though, as 

~'The Vaults',456

is an error (IMHO).  'The Vaults' appears quoted but isn't followed by a
delimiter or a space.

> In any case, it appears to me that you can't pronounce on the result until 
> you've parsed a large chunk of the file with each plausible hypothesis, 
> especially if the hypothesis admits (quoted) newlines inside the data. Some 
> possible decision criteria are (1) percentage of syntax errors (2) standard 
> deviation of number of columns ...

Actually, the existing implementation is able to make a pronouncement
after sniffing only a small portion of the file.  I'm going to get it
into the sandbox today so others can take a look at it.  The only real
snag is the exact scenario I mentioned earlier (no quoted data with
varying numbers of fields per row).

BTW, I'm +1 on Skip's suggestion to make the utils a package (cvs.utils)
and will check it into CVS as such.  Anyone object?

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308

From skip at pobox.com  Thu Feb 27 18:15:57 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 27 Feb 2003 11:15:57 -0600
Subject: [Csv] What's our status?
In-Reply-To: <1046365677.27223.119.camel@software1.logiplex.internal>
References: <15964.55486.242989.782539@montanaro.dyndns.org>
        <1046280720.27223.9.camel@software1.logiplex.internal>
        <1046297447.27223.19.camel@software1.logiplex.internal>
        <oprk85iqu3m50ryr@localhost>
        <1046365677.27223.119.camel@software1.logiplex.internal>
Message-ID: <15966.18381.859532.112020@montanaro.dyndns.org>

    Cliff> data,2003/02/27,08:51:00
    Cliff> data,2003/02/27,08:52:00
    Cliff> data,2003/02/27,08:53:00
    Cliff> data,2003/02/27,08:54:00

    Cliff> In this case it is difficult to know whether ,/ or : is the
    Cliff> delimiter.  It's not entirely unreasonable to use a "preferred"
    Cliff> list of delimiters but it's not entirely safe either ;) In fact,
    Cliff> the current implementation will resort to a preferred list in
    Cliff> this example and return , as the delimiter.  However, given the
    Cliff> following:

    Cliff> 2003/02/27,08:51:00
    Cliff> data,2003/02/27,08:52:00
    Cliff> 08:53:00
    Cliff> data,2003/02/27,08:54:00

    Cliff> It would most likely (without testing) return ":" as the
    Cliff> delimiter as it occurs equally consistently with "/", but is
    Cliff> higher in the preferred list.  This is wrong as the delimiter is
    Cliff> clearly ",".  That being said, I would simply consider this file
    Cliff> as being unsniffable as it has no real pattern.

How about this.  A candidate delimiter is preferred if two occurrences of it
enclose other candidate delimiters.  Conversely, a candidate delimiter in which
two occurrences only surround alphanumeric characters is deemed "less
worthy".

    Cliff> BTW, I'm +1 on Skip's suggestion to make the utils a package
    Cliff> (cvs.utils) and will check it into CVS as such.  Anyone object?

Nope, sorry I didn't get around to checking in the version you posted
yesterday.

Skip

From LogiplexSoftware at earthlink.net  Thu Feb 27 18:41:35 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 27 Feb 2003 09:41:35 -0800
Subject: [Csv] What's our status?
In-Reply-To: <15966.18381.859532.112020@montanaro.dyndns.org>
References: <15964.55486.242989.782539@montanaro.dyndns.org>
	 <1046280720.27223.9.camel@software1.logiplex.internal>
	 <1046297447.27223.19.camel@software1.logiplex.internal>
	 <oprk85iqu3m50ryr@localhost>
	 <1046365677.27223.119.camel@software1.logiplex.internal>
	 <15966.18381.859532.112020@montanaro.dyndns.org>
Message-ID: <1046367694.27222.124.camel@software1.logiplex.internal>

On Thu, 2003-02-27 at 09:15, Skip Montanaro wrote:

> How about this.  A candidate delimiter is preferred if two occurrences of it
> enclose other candidate delimiters.  Conversely, a candidate delimiter in which
> two occurrences only surround alphanumeric characters is deemed "less
> worthy".

Sounds like a possibility.  But what about:

$1,234;Wells,Cliff

where ; is the delimiter?

> 
>     Cliff> BTW, I'm +1 on Skip's suggestion to make the utils a package
>     Cliff> (cvs.utils) and will check it into CVS as such.  Anyone object?
> 
> Nope, sorry I didn't get around to checking in the version you posted
> yesterday.

No problem.

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308

From skip at pobox.com  Thu Feb 27 19:00:36 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 27 Feb 2003 12:00:36 -0600
Subject: [Csv] What's our status?
In-Reply-To: <1046367694.27222.124.camel@software1.logiplex.internal>
References: <15964.55486.242989.782539@montanaro.dyndns.org>
        <1046280720.27223.9.camel@software1.logiplex.internal>
        <1046297447.27223.19.camel@software1.logiplex.internal>
        <oprk85iqu3m50ryr@localhost>
        <1046365677.27223.119.camel@software1.logiplex.internal>
        <15966.18381.859532.112020@montanaro.dyndns.org>
        <1046367694.27222.124.camel@software1.logiplex.internal>
Message-ID: <15966.21060.408754.315103@montanaro.dyndns.org>

    Cliff> Sounds like a possibility.  But what about:

    Cliff> $1,234;Wells,Cliff

    Cliff> where ; is the delimiter?

Oh, I'm sure we can always construct perfectly reasonable (that is, not "red
team") examples where any of these heuristics fail.  That's why it's best to
use the sniffers as hints, not the word of God.

How about returning a list of candidate delimiters, ordered from most likely
to least likely?  How about counting the number of cells generated using
different candidate delimiters and returning the candidate which creates the
most cells or average row lengths with the smallest standard deviation?  How
about allowing the user to specify a sample cell value which occurs in the
data (e.g., sample="benzene" in Andrew's example, which allows you to easily
identify SPC as the delimiter)?

I've never seen any spreadsheet-like application guess the delimiter without
some user input.  Importing CSV files in Gnumeric is rather fun.  You select
the delimiters and watch it split the input on-the-fly.  It's cool to see it
go from one jumbled column of data to a nicely aligned spreadsheet.

Skip

From LogiplexSoftware at earthlink.net  Fri Feb 28 00:20:17 2003
From: LogiplexSoftware at earthlink.net (Cliff Wells)
Date: 27 Feb 2003 15:20:17 -0800
Subject: [Csv] What's our status?
In-Reply-To: <15966.21060.408754.315103@montanaro.dyndns.org>
References: <15964.55486.242989.782539@montanaro.dyndns.org>
	 <1046280720.27223.9.camel@software1.logiplex.internal>
	 <1046297447.27223.19.camel@software1.logiplex.internal>
	 <oprk85iqu3m50ryr@localhost>
	 <1046365677.27223.119.camel@software1.logiplex.internal>
	 <15966.18381.859532.112020@montanaro.dyndns.org>
	 <1046367694.27222.124.camel@software1.logiplex.internal>
	 <15966.21060.408754.315103@montanaro.dyndns.org>
Message-ID: <1046388017.29491.248.camel@software1.logiplex.internal>

On Thu, 2003-02-27 at 10:00, Skip Montanaro wrote:
>     Cliff> Sounds like a possibility.  But what about:
> 
>     Cliff> $1,234;Wells,Cliff
> 
>     Cliff> where ; is the delimiter?
> 
> Oh, I'm sure we can always construct perfectly reasonable (that is, not "red
> team") examples where any of these heuristics fail.  That's why it's best to
> use the sniffers as hints, not the word of God.

Agreed.  But I'd still like to think of some clever way of resolving the
above.

> How about returning a list of candidate delimiters, ordered from most likely
> to least likely?  How about counting the number of cells generated using
> different candidate delimiters and returning the candidate which creates the
> most cells or average row lengths with the smallest standard deviation?  How

This is basically what it does now.  Except for the most cells bit,
which I consider too unreliable.  As long as the number of cells is
supposed to be fairly consistent, it should work.

> about allowing the user to specify a sample cell value which occurs in the
> data (e.g., sample="benzene" in Andrew's example, which allows you to easily
> identify SPC as the delimiter)?

Returning a list is a possibility.  I considered it when developing DSV
but couldn't think of a good use for it since the user was going to
confirm the selections anyway via the dialog.

> I've never seen any spreadsheet-like application guess the delimiter without
> some user input.  Importing CSV files in Gnumeric is rather fun.  You select
> the delimiters and watch it split the input on-the-fly.  It's cool to see it
> go from one jumbled column of data to a nicely aligned spreadsheet.

Hmph.  And DSV gets no credit for doing the same? <wink>  Actually,
Excel (and DSV) make a pretty good stab at the delimiter and then let
you modify their guesses via a preview dialog.  That's pretty much how I
always intended the sniffer to be used, so I suppose maybe I shouldn't
worry about it too much. Can't seem to help it though ;)

BTW, as far as making utils a sub-package of csv, do you intend this:

csv.utils (contains all utils in csv/utils.py)

or do you mean:

csv.utils.sniffer (csv/utils/sniffer.py, etc)

I personally prefer the latter as I can see utils encompassing a lot of
stuff, perhaps not all of it directly related and a utils.py file would
become rather large.  However, my packaging skills aren't the greatest,
so I'm a bit confused as to what __init__.py should contain so that we
aren't required to type "from csv import csv" instead of just "import
csv"

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308