From andrewm at object-craft.com.au  Wed Jan  5 08:06:43 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Jan 2005 18:06:43 +1100
Subject: [Csv] csv module TODO list
Message-ID: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>

There's a bunch of jobs we (CSV module maintainers) have been putting
off - attached is a list (in no particular order): 

* unicode support (this will probably uglify the code considerably).

* 8 bit transparency (specifically, allow \0 characters in source string
  and as delimiters, etc).

* Reader and universal newlines don't interact well, reader doesn't
  honour Dialect's lineterminator setting. All outstanding bug id's
  (789519, 944890, 967934 and 1072404) are related to this - it's 
  a difficult problem and further discussion is needed.

* compare PEP-305 and library reference manual to the module as implemented
  and either document the differences or correct them.

* Address or document Francis Avila's issues as mentioned in this posting:

    http://www.google.com.au/groups?selm=vsb89q1d3n5qb1%40corp.supernews.com

* Several blogs complain that the CSV module is no good for parsing
  strings. Suggest making it clearer in the documentation that the reader
  accepts an iterable, rather than a file, and document why an iterable
  (as opposed to a string) is necessary (multi-line records with embedded
  newlines). We could also provide an interface that parses a single
  string (or the old Object Craft interface) for those that really feel
  the need. See:

    http://radio.weblogs.com/0124960/2003/09/12.html
    http://zephyrfalcon.org/weblog/arch_d7_2003_09_06.html#e335

* Compatability API for old Object Craft CSV module?

    http://mechanicalcat.net/cgi-bin/log/2003/08/18

  For example: "from csv.legacy import reader" or something.

* Pure python implementation? 

* Some CSV-like formats consider a quoted field a string, and an unquoted
  field a number - consider supporting this in the Reader and Writer. See:

    http://radio.weblogs.com/0124960/2004/04/23.html

* Add line number and record number counters to reader object?

* it's possible to get the csv parser to suck the whole source file
  into memory with an unmatched quote character. Need to limit size of
  internal buffer.

Also, review comments from Neal Norwitz, 22 Mar 2003 (some of these should
already have been addressed):

* remove TODO comment at top of file--it's empty
* is CSV going to be maintained outside the python tree?
  If not, remove the 2.2 compatibility macros for:
         PyDoc_STR, PyDoc_STRVAR, PyMODINIT_FUNC, etc.
* inline the following functions since they are used only in one place
        get_string, set_string, get_nullchar_as_None, set_nullchar_as_None,
        join_reset (maybe)
* rather than use PyErr_BadArgument, should you use assert?
        (first example, Dialect_set_quoting, line 218)
* is it necessary to have Dialect_methods, can you use 0 for tp_methods?
* remove commented out code (PyMem_DEL) on line 261
        Have you used valgrind on the test to find memory overwrites/leaks?
* PyString_AsString()[0] on line 331 could return NULL in which case
        you are dereferencing a NULL pointer
* note sure why there are casts on 0 pointers
        lines 383-393, 733-743, 1144-1154, 1164-1165
* Reader_getiter() can be removed and use PyObject_SelfIter()
* I think you need PyErr_NoMemory() before returning on line 768, 1178
* is PyString_AsString(self->dialect->lineterminator) on line 994
        guaranteed not to return NULL?  If not, it could crash by
        passing to memmove.
* PyString_AsString() can return NULL on line 1048 and 1063, 
        the result is passed to join_append()
* iteratable should be iterable?  (line 1088)
* why doesn't csv_writerows() have a docstring?  csv_writerow does
* any PyUnicode_* methods should be protected with #ifdef Py_USING_UNICODE
* csv_unregister_dialect, csv_get_dialect could use METH_O 
        so you don't need to use PyArg_ParseTuple
* in init_csv, recommend using 
        PyModule_AddIntConstant and PyModule_AddStringConstant
        where appropriate

Also, review comments from Jeremy Hylton, 10 Apr 2003:

    I've been reviewing extension modules looking for C types that should
    participate in garbage collection.  I think the csv ReaderObj and
    WriterObj should participate.  The ReaderObj it contains a reference to
    input_iter that could be an arbitrary Python object.  The iterator
    object could well participate in a cycle that refers to the ReaderObj.
    The WriterObj has a reference to a writeline callable, which could well
    be a method of an object that also points to the WriterObj.

    The Dialect object appears to be safe, because the only PyObject * it
    refers should be a string.  Safe until someone creates an insane string
    subclass <0.4 wink>.

    Also, an unrelated comment about the code, the lineterminator of the
    Dialect is managed by a collection of little helper functions like
    get_string, set_string, etc.  This code appears to be excessively
    general; since they're called only once, it seems clearer to inline the
    logic directly in the get/set methods for the lineterminator.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Wed Jan  5 08:33:04 2005
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 5 Jan 2005 01:33:04 -0600
Subject: [Csv] csv module TODO list
In-Reply-To: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
Message-ID: <16859.38960.9935.682429@montanaro.dyndns.org>


    Andrew> There's a bunch of jobs we (CSV module maintainers) have been
    Andrew> putting off - attached is a list (in no particular order):

    ...

In addition, it occurred to me this evening that there's functionality in
the csv module I don't think anybody uses.  For example, you can register
CSV dialects by name, then pass in the string name instead of the dialect
class.  I'd be in favor of scrapping list_dialects, register_dialect and
unregister_dialect altogether.  While they are probably trivial little
functions I don't think they add much if anything to the implementation and
just complicate the _csv extension module slightly.  I'm also not aware that
anyone really uses the Sniffer class, though it does provide some useful
functionality should you need to analyze random CSV files.

Skip

From andrewm at object-craft.com.au  Wed Jan  5 08:55:06 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Jan 2005 18:55:06 +1100
Subject: [Python-Dev] Re: [Csv] csv module TODO list 
In-Reply-To: <16859.38960.9935.682429@montanaro.dyndns.org> 
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<16859.38960.9935.682429@montanaro.dyndns.org>
Message-ID: <20050105075506.314C93C8E5@coffee.object-craft.com.au>

>    Andrew> There's a bunch of jobs we (CSV module maintainers) have been
>    Andrew> putting off - attached is a list (in no particular order):
>    ...
>
>In addition, it occurred to me this evening that there's functionality in
>the csv module I don't think anybody uses.  

It's very difficult to say for sure that nobody is using it once it's
released to the world.

>For example, you can register CSV dialects by name, then pass in the
>string name instead of the dialect class.  I'd be in favor of scrapping
>list_dialects, register_dialect and unregister_dialect altogether.  While
>they are probably trivial little functions I don't think they add much if
>anything to the implementation and just complicate the _csv extension
>module slightly.  

Yes, in hindsight, they're not really necessary, although I'm sure we
had some motivation for them initially. That said, they're there now,
and they shouldn't require much maintenance.

>I'm also not aware that anyone really uses the Sniffer class, though it
>does provide some useful functionality should you need to analyze random
>CSV files.

The comment I get repeatedly is that they don't use it because it's
"too magic/scary". That's as it should be. But if it didn't exist,
then someone would be requesting we add it... 8-)

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Wed Jan  5 10:34:14 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Jan 2005 20:34:14 +1100
Subject: [Csv] Re: [Python-Dev] csv module TODO list 
In-Reply-To: <41DBAF06.6020401@egenix.com> 
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com>
Message-ID: <20050105093414.00DFF3C8E5@coffee.object-craft.com.au>

>> Andrew McNamara wrote:
>>> There's a bunch of jobs we (CSV module maintainers) have been putting
>>> off - attached is a list (in no particular order):
>>> * unicode support (this will probably uglify the code considerably).
>> 
>Martin v. L?wis wrote:
>> Can you please elaborate on that? What needs to be done, and how is
>> that going to be done? It might be possible to avoid considerable
>> uglification.

I'm not altogether sure there. The parsing state machine is all written in
C, and deals with signed chars - I expect we'll need two versions of that
(or one version that's compiled twice using pre-processor macros). Quite
a large job. Suggestions gratefully received.

M.-A. Lemburg wrote:
>Indeed. The trick is to convert to Unicode early and to use Unicode
>literals instead of string literals in the code.

Yes, although it would be nice to also retain the 8-bit versions as well.

>Note that the only real-life Unicode format in use is UTF-16
>(with BOM mark) written by Excel. Note that there's no standard
>for specifying the encoding in CSV files, so this is also the only
>feasable format.

Yes - that's part of the problem I hadn't really thought about yet - the
csv module currently interacts directly with files as iterators, but it's 
clear that we'll need to decode as we go.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From martin at v.loewis.de  Wed Jan  5 09:39:44 2005
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 05 Jan 2005 09:39:44 +0100
Subject: [Csv] Re: [Python-Dev] csv module TODO list
In-Reply-To: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
Message-ID: <41DBA7D0.80101@v.loewis.de>

Andrew McNamara wrote:
> There's a bunch of jobs we (CSV module maintainers) have been putting
> off - attached is a list (in no particular order): 
> 
> * unicode support (this will probably uglify the code considerably).

Can you please elaborate on that? What needs to be done, and how is
that going to be done? It might be possible to avoid considerable
uglification.

Regards,
Martin

From mal at egenix.com  Wed Jan  5 10:10:30 2005
From: mal at egenix.com (M.-A. Lemburg)
Date: Wed, 05 Jan 2005 10:10:30 +0100
Subject: [Csv] Re: [Python-Dev] csv module TODO list
In-Reply-To: <41DBA7D0.80101@v.loewis.de>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<41DBA7D0.80101@v.loewis.de>
Message-ID: <41DBAF06.6020401@egenix.com>

Martin v. L?wis wrote:
> Andrew McNamara wrote:
> 
>> There's a bunch of jobs we (CSV module maintainers) have been putting
>> off - attached is a list (in no particular order):
>> * unicode support (this will probably uglify the code considerably).
> 
> 
> Can you please elaborate on that? What needs to be done, and how is
> that going to be done? It might be possible to avoid considerable
> uglification.

Indeed. The trick is to convert to Unicode early and to use Unicode
literals instead of string literals in the code.

Note that the only real-life Unicode format in use is UTF-16
(with BOM mark) written by Excel. Note that there's no standard
for specifying the encoding in CSV files, so this is also the only
feasable format.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 05 2005)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

From mal at egenix.com  Wed Jan  5 10:44:40 2005
From: mal at egenix.com (M.-A. Lemburg)
Date: Wed, 05 Jan 2005 10:44:40 +0100
Subject: [Csv] Re: [Python-Dev] csv module TODO list
In-Reply-To: <20050105093414.00DFF3C8E5@coffee.object-craft.com.au>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com>
	<20050105093414.00DFF3C8E5@coffee.object-craft.com.au>
Message-ID: <41DBB708.5030501@egenix.com>

Andrew McNamara wrote:
>>>Andrew McNamara wrote:
>>>
>>>>There's a bunch of jobs we (CSV module maintainers) have been putting
>>>>off - attached is a list (in no particular order):
>>>>* unicode support (this will probably uglify the code considerably).
>>>
>>Martin v. L?wis wrote:
>>
>>>Can you please elaborate on that? What needs to be done, and how is
>>>that going to be done? It might be possible to avoid considerable
>>>uglification.
> 
> 
> I'm not altogether sure there. The parsing state machine is all written in
> C, and deals with signed chars - I expect we'll need two versions of that
> (or one version that's compiled twice using pre-processor macros). Quite
> a large job. Suggestions gratefully received.
> 
> M.-A. Lemburg wrote:
> 
>>Indeed. The trick is to convert to Unicode early and to use Unicode
>>literals instead of string literals in the code.
> 
> 
> Yes, although it would be nice to also retain the 8-bit versions as well.

You can do so by using latin-1 as default encoding. Works great !

>>Note that the only real-life Unicode format in use is UTF-16
>>(with BOM mark) written by Excel. Note that there's no standard
>>for specifying the encoding in CSV files, so this is also the only
>>feasable format.
> 
> Yes - that's part of the problem I hadn't really thought about yet - the
> csv module currently interacts directly with files as iterators, but it's 
> clear that we'll need to decode as we go.

Depends on your needs: CSV files tend to be small enough
to do the decoding in one call in memory.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 05 2005)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

From andrewm at object-craft.com.au  Wed Jan  5 11:03:25 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Jan 2005 21:03:25 +1100
Subject: [Csv] Re: [Python-Dev] csv module TODO list 
In-Reply-To: <41DBB708.5030501@egenix.com> 
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com>
	<20050105093414.00DFF3C8E5@coffee.object-craft.com.au>
	<41DBB708.5030501@egenix.com>
Message-ID: <20050105100325.A220D3C8E5@coffee.object-craft.com.au>

>> Yes, although it would be nice to also retain the 8-bit versions as well.
>
>You can do so by using latin-1 as default encoding. Works great !

Yep, although that means we wear the cost of decoding and encoding for
all 8 bit input.

What does the _sre.c code do?

>Depends on your needs: CSV files tend to be small enough
>to do the decoding in one call in memory.

We are routinely dealing with multi-gigabyte csv files - which is why the
original 2001 vintage csv module was written as a C state machine. 

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From mal at egenix.com  Wed Jan  5 11:16:50 2005
From: mal at egenix.com (M.-A. Lemburg)
Date: Wed, 05 Jan 2005 11:16:50 +0100
Subject: [Csv] Re: [Python-Dev] csv module TODO list
In-Reply-To: <20050105100325.A220D3C8E5@coffee.object-craft.com.au>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com>
	<20050105093414.00DFF3C8E5@coffee.object-craft.com.au>
	<41DBB708.5030501@egenix.com>
	<20050105100325.A220D3C8E5@coffee.object-craft.com.au>
Message-ID: <41DBBE92.4070106@egenix.com>

Andrew McNamara wrote:
>>>Yes, although it would be nice to also retain the 8-bit versions as well.
>>
>>You can do so by using latin-1 as default encoding. Works great !
> 
> Yep, although that means we wear the cost of decoding and encoding for
> all 8 bit input.

Right, but it makes the code very clean and straight forward.
Again, it depends on what you need. If performance is critical
then you probably need a C version written using the same trick
as _sre.c...

> What does the _sre.c code do?

It comes in two versions: one for 8-bit the other for Unicode.

>>Depends on your needs: CSV files tend to be small enough
>>to do the decoding in one call in memory.
> 
> We are routinely dealing with multi-gigabyte csv files - which is why the
> original 2001 vintage csv module was written as a C state machine. 

I see, but are you sure that the typical Python user will have
the same requirements to make it worth the effort (and
complexity) ?

I've written a few CSV parsers and writers myself over the years
and the requirements were different every time, in terms
of being flexible in the parsing phase, the interfaces and
the performance needs. Haven't yet found a one fits all
solution and don't really expect to any more :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 05 2005)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

From andrewm at object-craft.com.au  Wed Jan  5 11:33:05 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Jan 2005 21:33:05 +1100
Subject: [Csv] Re: [Python-Dev] csv module TODO list 
In-Reply-To: <41DBBE92.4070106@egenix.com> 
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com>
	<20050105093414.00DFF3C8E5@coffee.object-craft.com.au>
	<41DBB708.5030501@egenix.com>
	<20050105100325.A220D3C8E5@coffee.object-craft.com.au>
	<41DBBE92.4070106@egenix.com>
Message-ID: <20050105103305.AD80B3C8E5@coffee.object-craft.com.au>

>> Yep, although that means we wear the cost of decoding and encoding for
>> all 8 bit input.
>
>Right, but it makes the code very clean and straight forward.

I agree it makes for a very clean solution, and 99% of the time I'd
chose that option.

>Again, it depends on what you need. If performance is critical
>then you probably need a C version written using the same trick
>as _sre.c...
>
>> What does the _sre.c code do?
>
>It comes in two versions: one for 8-bit the other for Unicode.

That's what I thought. I think the motivations here are similar to those
that drove the _sre developers.

>> We are routinely dealing with multi-gigabyte csv files - which is why the
>> original 2001 vintage csv module was written as a C state machine. 
>
>I see, but are you sure that the typical Python user will have
>the same requirements to make it worth the effort (and
>complexity) ?

This is open source, so I scratch my own itch (and that of my employers) - 
we need fast csv parsing more than we need unicode... 8-)

Okay, assuming we go the "produce two versions via evil macro tricks"
path, it's still not quite the same situation as _sre.c, which only has
to deal with the internal unicode representation.

One way to approach this would be to add an "encoding" keyword argument
to the readers and writers. If given, the parser would decode the input
stream to the internal representation before passing it through the
unicode state machine, which would yield tuples of unicode objects.

That leaves us with a bit of a problem where the source is already unicode
(eg, a list of unicode strings)... hmm.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From sjmachin at lexicon.net  Wed Jan  5 11:41:19 2005
From: sjmachin at lexicon.net (sjmachin at lexicon.net)
Date: Wed, 05 Jan 2005 21:41:19 +1100
Subject: [Csv] Re: [Python-Dev] csv module TODO list
In-Reply-To: <41DBB708.5030501@egenix.com>
References: <20050105093414.00DFF3C8E5@coffee.object-craft.com.au>
Message-ID: <41DC5EFF.28236.36EF6D7@localhost>

On 5 Jan 2005 at 10:44, M.-A. Lemburg wrote:

> 
> Depends on your needs: CSV files tend to be small enough
> to do the decoding in one call in memory.
> 

The CSV format is often used for exchanging large data files, not just for spreadsheet 
output.

My experience: files with over a million rows are not uncommon. FWIW, no Unicode.

My (jaundiced, but based on experience) viewpoint on newlines inside quoted strings:

Prob (spreadsheet file with newlines inside data fields) = 0.001

Prob (some programmer has not quoted their quotes properly) = 0.999

Hence I suggest an option to specify this as a bug.

Regards,

John


From andrewm at object-craft.com.au  Wed Jan  5 12:08:49 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Jan 2005 22:08:49 +1100
Subject: [Csv] csv module TODO list 
In-Reply-To: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> 
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
Message-ID: <20050105110849.CBA843C8E5@coffee.object-craft.com.au>

>Also, review comments from Neal Norwitz, 22 Mar 2003 (some of these should
>already have been addressed):

I should apologise to Neal here for not replying to him at the time.

Okay, going though the issues Neal raised...

>* remove TODO comment at top of file--it's empty

Was fixed.

>* is CSV going to be maintained outside the python tree?
>  If not, remove the 2.2 compatibility macros for:
>         PyDoc_STR, PyDoc_STRVAR, PyMODINIT_FUNC, etc.

Does anyone thing we should continue to maintain this 2.2 compatibility?

>* inline the following functions since they are used only in one place
>        get_string, set_string, get_nullchar_as_None, set_nullchar_as_None,
>        join_reset (maybe)

It was done that way as I felt we would be adding more getters and
setters to the dialect object in future.

>* rather than use PyErr_BadArgument, should you use assert?
>        (first example, Dialect_set_quoting, line 218)

You mean C assert()? I don't think I'm really following you here -
where would the type of the object be checked in a way the user could
recover from?

>* is it necessary to have Dialect_methods, can you use 0 for tp_methods?

I was assuming I would need to add methods at some point (in fact, I did
have methods, but removed them).

>* remove commented out code (PyMem_DEL) on line 261
>        Have you used valgrind on the test to find memory overwrites/leaks?

No, valgrind wasn't used.

>* PyString_AsString()[0] on line 331 could return NULL in which case
>        you are dereferencing a NULL pointer

Was fixed.

>* note sure why there are casts on 0 pointers
>        lines 383-393, 733-743, 1144-1154, 1164-1165

To make it easier when the time comes to add one of those members.

>* Reader_getiter() can be removed and use PyObject_SelfIter()

Okay, wasn't aware of PyObject_SelfIter - will fix.

>* I think you need PyErr_NoMemory() before returning on line 768, 1178

The examples I looked at in the Python core didn't do this - are you sure?
(now lines 832 and 1280). 

>* is PyString_AsString(self->dialect->lineterminator) on line 994
>        guaranteed not to return NULL?  If not, it could crash by
>        passing to memmove.
>* PyString_AsString() can return NULL on line 1048 and 1063, 
>        the result is passed to join_append()

Looking at the PyString_AsString implementation, it looks safe (we ensure
it's really a string elsewhere)?

>* iteratable should be iterable?  (line 1088)

Sorry, I don't know what you're getting at here? (now line 1162).

>* why doesn't csv_writerows() have a docstring?  csv_writerow does

Was fixed.

>* any PyUnicode_* methods should be protected with #ifdef Py_USING_UNICODE

Was fixed.

>* csv_unregister_dialect, csv_get_dialect could use METH_O 
>        so you don't need to use PyArg_ParseTuple

Was fixed.

>* in init_csv, recommend using 
>        PyModule_AddIntConstant and PyModule_AddStringConstant
>        where appropriate

Was fixed.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Wed Jan  5 12:14:02 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Jan 2005 22:14:02 +1100
Subject: [Csv] Re: [Python-Dev] csv module TODO list 
In-Reply-To: <41DC5EFF.28236.36EF6D7@localhost> 
References: <20050105093414.00DFF3C8E5@coffee.object-craft.com.au>
	<41DC5EFF.28236.36EF6D7@localhost>
Message-ID: <20050105111402.A319C3C8E6@coffee.object-craft.com.au>

>The CSV format is often used for exchanging large data files, not just for
>spreadsheet output.
>
>My experience: files with over a million rows are not uncommon. FWIW, no
>Unicode.

Matches my experience also, but I suspect we both live in English speaking
countries. Elsewhere in the world, the ratios could be reversed.

There has also been some suggestion that the native string type in Python
will become Unicode at some point in the future.

>My (jaundiced, but based on experience) viewpoint on newlines inside
>quoted strings:
>
>Prob (spreadsheet file with newlines inside data fields) = 0.001
>
>Prob (some programmer has not quoted their quotes properly) = 0.999
>
>Hence I suggest an option to specify this as a bug.

I agree. What makes this extra exciting at the moment is that the CSV
module will happily sit there slurping the whole file into memory trying
to match a stray quote (of course, I only noticed this when trying to
read a multi-gigabyte file).

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From mal at egenix.com  Wed Jan  5 13:08:35 2005
From: mal at egenix.com (M.-A. Lemburg)
Date: Wed, 05 Jan 2005 13:08:35 +0100
Subject: [Csv] Re: [Python-Dev] csv module TODO list
In-Reply-To: <20050105111402.A319C3C8E6@coffee.object-craft.com.au>
References: <20050105093414.00DFF3C8E5@coffee.object-craft.com.au>
	<41DC5EFF.28236.36EF6D7@localhost>
	<20050105111402.A319C3C8E6@coffee.object-craft.com.au>
Message-ID: <41DBD8C3.5090303@egenix.com>

Andrew McNamara wrote:
>>The CSV format is often used for exchanging large data files, not just for
>>spreadsheet output.
>>
>>My experience: files with over a million rows are not uncommon. FWIW, no
>>Unicode.
> 
> Matches my experience also, but I suspect we both live in English speaking
> countries. Elsewhere in the world, the ratios could be reversed.

Hmm, wasn't XML intended to replace CSV (among other formats) for
exchanging tons of data ;-)

As I mentioned before, there's no such thing as the one fits all
general CSV parser or writer.

If Unicode CSV data is not common enough, you might want to provide
a solution based on a UTF-8 string encoding - a decoder could
convert the input stream to UTF-8, you then process that data
using the existing CSV parser and then convert it back to Unicode
in the .next() method.

So far, I've only ever used Unicode CSV data for exchange with
Asian language spreadsheets.

> There has also been some suggestion that the native string type in Python
> will become Unicode at some point in the future.

Indeed :-)

>>My (jaundiced, but based on experience) viewpoint on newlines inside
>>quoted strings:
>>
>>Prob (spreadsheet file with newlines inside data fields) = 0.001
>>
>>Prob (some programmer has not quoted their quotes properly) = 0.999
>>
>>Hence I suggest an option to specify this as a bug.
> 
> I agree. What makes this extra exciting at the moment is that the CSV
> module will happily sit there slurping the whole file into memory trying
> to match a stray quote (of course, I only noticed this when trying to
> read a multi-gigabyte file).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 05 2005)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

From magnus at hetland.org  Wed Jan  5 13:19:21 2005
From: magnus at hetland.org (Magnus Lie Hetland)
Date: Wed, 5 Jan 2005 13:19:21 +0100
Subject: [Csv] Re: csv module TODO list
In-Reply-To: <20050105075506.314C93C8E5@coffee.object-craft.com.au>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<16859.38960.9935.682429@montanaro.dyndns.org>
	<20050105075506.314C93C8E5@coffee.object-craft.com.au>
Message-ID: <20050105121921.GB24030@idi.ntnu.no>

Quite a while ago I posted some material to the csv-list about
problems using the csv module on Unix-style colon-separated files --
it just doesn't deal properly with backslash escaping and is quite
useless for this kind of file. I seem to recall the general view was
that it wasn't intended for this kind of thing -- only the sort of csv
that Microsoft Excel outputs/inputs, but if I am mistaken about this,
perhaps fixing this issue might be put on the TODO-list? I'll be happy
to re-send or summarize the relevant emails, if needed.

-- 
Magnus Lie Hetland       Fallen flower I see / Returning to its branch
http://hetland.org       Ah! a butterfly.           [Arakida Moritake]

From andrewm at object-craft.com.au  Wed Jan  5 13:29:11 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Wed, 05 Jan 2005 23:29:11 +1100
Subject: [Csv] Re: csv module TODO list 
In-Reply-To: <20050105121921.GB24030@idi.ntnu.no> 
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<16859.38960.9935.682429@montanaro.dyndns.org>
	<20050105075506.314C93C8E5@coffee.object-craft.com.au>
	<20050105121921.GB24030@idi.ntnu.no>
Message-ID: <20050105122911.83EE93C8E5@coffee.object-craft.com.au>

>Quite a while ago I posted some material to the csv-list about
>problems using the csv module on Unix-style colon-separated files --
>it just doesn't deal properly with backslash escaping and is quite
>useless for this kind of file. I seem to recall the general view was
>that it wasn't intended for this kind of thing -- only the sort of csv
>that Microsoft Excel outputs/inputs, but if I am mistaken about this,
>perhaps fixing this issue might be put on the TODO-list? I'll be happy
>to re-send or summarize the relevant emails, if needed.

I think a related issue was included in my TODO list:

>* Address or document Francis Avila's issues as mentioned in this posting:
>
>    http://www.google.com.au/groups?selm=vsb89q1d3n5qb1%40corp.supernews.com

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From martin at v.loewis.de  Wed Jan  5 23:00:26 2005
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 05 Jan 2005 23:00:26 +0100
Subject: [Csv] Re: [Python-Dev] csv module TODO list
In-Reply-To: <20050105093414.00DFF3C8E5@coffee.object-craft.com.au>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com>
	<20050105093414.00DFF3C8E5@coffee.object-craft.com.au>
Message-ID: <41DC637A.5050105@v.loewis.de>

Andrew McNamara wrote:
>>>Can you please elaborate on that? What needs to be done, and how is
>>>that going to be done? It might be possible to avoid considerable
>>>uglification.
> 
> 
> I'm not altogether sure there. The parsing state machine is all written in
> C, and deals with signed chars - I expect we'll need two versions of that
> (or one version that's compiled twice using pre-processor macros). Quite
> a large job. Suggestions gratefully received.

I'm still trying to understand what *needs* to be done - I would move to
how this is done only later. What APIs should be extended/changed, and
in what way?

Regards,
Martin

From fumanchu at amor.org  Wed Jan  5 18:38:52 2005
From: fumanchu at amor.org (Robert Brewer)
Date: Wed, 5 Jan 2005 09:38:52 -0800
Subject: [Python-Dev] Re: [Csv] csv module TODO list
Message-ID: <3A81C87DC164034AA4E2DDFE11D258E33980EE@exchange.hqamor.amorhq.net>

Skip Montanaro wrote:
>     Andrew> There's a bunch of jobs we (CSV module 
> maintainers) have been
>     Andrew> putting off - attached is a list (in no particular order):
> 
>     ...
> 
> In addition, it occurred to me this evening that there's 
> functionality in the csv module I don't think anybody uses.
> ...
> I'm also not aware that anyone really uses the Sniffer class,
> though it does provide some useful functionality should you
> need to analyze random CSV files.

I used Sniffer quite heavily for my last contract. The client had
multiple multigig csv's which needed deduplicating, but they were all
from different sources and therefore in different formats. It would have
cost me many more hours without the Sniffer. Please keep it. <:)


Robert Brewer
MIS
Amor Ministries
fumanchu at amor.org

From andrewm at object-craft.com.au  Thu Jan  6 02:10:55 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 06 Jan 2005 12:10:55 +1100
Subject: [Csv] Re: [Python-Dev] csv module TODO list 
In-Reply-To: <41DC637A.5050105@v.loewis.de> 
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com>
	<20050105093414.00DFF3C8E5@coffee.object-craft.com.au>
	<41DC637A.5050105@v.loewis.de>
Message-ID: <20050106011055.001163C8E5@coffee.object-craft.com.au>

>>>>Can you please elaborate on that? What needs to be done, and how is
>>>>that going to be done? It might be possible to avoid considerable
>>>>uglification.
>> 
>> I'm not altogether sure there. The parsing state machine is all written in
>> C, and deals with signed chars - I expect we'll need two versions of that
>> (or one version that's compiled twice using pre-processor macros). Quite
>> a large job. Suggestions gratefully received.
>
>I'm still trying to understand what *needs* to be done - I would move to
>how this is done only later. What APIs should be extended/changed, and
>in what way?

That's certainly the first step, and I have to admit that I don't have
a clear idea at this time - the unicode issue has been in the "too hard"
basket since we started.

Marc-Andre Lemburg mentioned that he has encountered UTF-16 encoded csv
files, so a reasonable starting point would be the ability to read and
parse, as well as the ability to generate, one of these.

The reader interface currently returns a row at a time, consuming as many
lines from the supplied iterable (with the most common iterable being
a file). This suggests to me that we will need an optional "encoding"
argument to the reader constructor, and that the reader will need to
decode the source lines. That said, I'm hardly a unicode expert, so I
may be overlooking something (could a utf-16 encoded character span a
line break, for example).  The writer interface probably should have
similar facilities.

However - a number of people have complained about the "iterator"
interface, wanting to supply strings (the iterable is necessary because a
CSV row can span multiple lines). It's also conceiveable that the source
lines could already be unicode objects.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Thu Jan  6 03:03:08 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 06 Jan 2005 13:03:08 +1100
Subject: [Csv] Re: [Python-Dev] csv module TODO list 
In-Reply-To: <20050106011055.001163C8E5@coffee.object-craft.com.au> 
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com>
	<20050105093414.00DFF3C8E5@coffee.object-craft.com.au>
	<41DC637A.5050105@v.loewis.de>
	<20050106011055.001163C8E5@coffee.object-craft.com.au>
Message-ID: <20050106020308.EBE5A3C8E5@coffee.object-craft.com.au>

>>I'm still trying to understand what *needs* to be done - I would move to
>>how this is done only later. What APIs should be extended/changed, and
>>in what way?
[...]
>The reader interface currently returns a row at a time, consuming as many
>lines from the supplied iterable (with the most common iterable being
>a file). This suggests to me that we will need an optional "encoding"
>argument to the reader constructor, and that the reader will need to
>decode the source lines. That said, I'm hardly a unicode expert, so I
>may be overlooking something (could a utf-16 encoded character span a
>line break, for example).  The writer interface probably should have
>similar facilities.

Ah - I see that the codecs module provides an EncodedFile class - better
to use this than add encoding/decoding cruft to the csv module.

So, do we duplicate the current reader and writer as UnicodeReader and
UnicodeWriter (how else do we know to use the unicode parser)? What about
the "dialects"? I guess if a dialect uses no unicode strings, it can be
applied to the current parser, but if it does include unicode strings,
then the parser would need to raise an exception.

The DictReader and DictWriter classes will probably need matching
UnicodeDictReader/UnicodeDictWriter versions (use common base class,
just specify alternate parser).

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From martin at v.loewis.de  Thu Jan  6 17:05:05 2005
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu, 06 Jan 2005 17:05:05 +0100
Subject: [Csv] Re: [Python-Dev] csv module TODO list
In-Reply-To: <20050106011055.001163C8E5@coffee.object-craft.com.au>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<41DBA7D0.80101@v.loewis.de> <41DBAF06.6020401@egenix.com>
	<20050105093414.00DFF3C8E5@coffee.object-craft.com.au>
	<41DC637A.5050105@v.loewis.de>
	<20050106011055.001163C8E5@coffee.object-craft.com.au>
Message-ID: <41DD61B1.1030507@v.loewis.de>

Andrew McNamara wrote:
> Marc-Andre Lemburg mentioned that he has encountered UTF-16 encoded csv
> files, so a reasonable starting point would be the ability to read and
> parse, as well as the ability to generate, one of these.

I see. That would be reasonable, indeed. Notice that this is not so much
a "Unicode issue", but more an "encoding" issue. If you solve the
"arbitrary encodings" problem, you solve UTF-16 as a side effect.

> The reader interface currently returns a row at a time, consuming as many
> lines from the supplied iterable (with the most common iterable being
> a file). This suggests to me that we will need an optional "encoding"
> argument to the reader constructor, and that the reader will need to
> decode the source lines.

Ok. In this context, I see two possible implementation strategies:
1. Implement the csv module two times: once for bytes, and once for
    Unicode characters. It is likely that the source code would be
    the same for each case; you just need to make sure the "Dialect
    and Formatting Parameters" change their width accordingly.
    If you use the SRE approach, you would do

    #define CSV_ITEM_T char
    #define CSV_NAME_PREFIX byte_
    #include "csvimpl.c"
    #define CSV_ITEM_T Py_Unicode
    #define CSV_NAME_PREFIX unicode_
    #include "csvimpl.c"

2. Use just the existing _csv module, and represent non-byte encodings
    as UTF-8. This will work as long as the delimiters and other markup
    characters have always a single byte in UTF-8, which is the case
    for "':\, as well as for \r and \n. Then, wenn processing using
    an explicit encoding, first convert the input into Unicode objects.
    Then encode the Unicode objects into UTF-8, and pass it to _csv.
    For the results you get back, convert each element back from UTF-8
    to a Unicode object.

This could be implemented as

def reader(f, encoding=None):
     if encoding is None: return _csv.reader(f)
     enc, dec, reader, writer = codecs.lookup(encoding)
     utf8_enc, utf8_dec, utf8_r, utf8_w = codecs.lookup("UTF-8")
     # Make a recoder which can only read
     utf8_stream = codecs.StreamRecoder(f, utf8_enc, None, Reader, None)
     csv_reader = _csv.reader(utf8_stream)
     # For performance reasons, map_result could be implemented in C
     def map_result(t):
         result = [None]*len(t)
         for i, val in enumerate(t):
             result[i] = utf8_dec(val)
         return tuple(result)
     return itertools.imap(map_result, csv_reader)
# This code is untested

This approach has the disadvantage of performing three recodings:
from input charset to Unicode, from Unicode to UTF-8, from UTF-8
to Unicode. One could:
- skip the initial recoding if the encoding is already known
   to be _csv-safe (i.e. if it is a pure ASCII superset).
   This would be valid for ASCII, iso-8859-n, UTF-8, ...
- offer the user to keep the results in the input encoding,
   instead of always returning Unicode objects.

Apart from this disadvantage, I think this gives people what they want:
they can specify the encoding of the input, and they get the results not
only csv-separated, but also unicode-decode. This approach is the same
that is used for Python source code encodings: the source is first
recoded into UTF-8, then parsed, then recoded back.

> That said, I'm hardly a unicode expert, so I
> may be overlooking something (could a utf-16 encoded character span a
> line break, for example).

This cannot happen: \r, in UTF-16, is also 2 bytes (0D 00, if UTF-16LE).
There are issues that Unicode has additional line break characters,
which is probably irrelevant.

Regards,
Martin

From ajm at flonidan.dk  Thu Jan  6 17:22:12 2005
From: ajm at flonidan.dk (Anders J. Munch)
Date: Thu, 6 Jan 2005 17:22:12 +0100 
Subject: [Csv] Re: [Python-Dev] csv module TODO list 
Message-ID: <6D9E824FA10BD411BE95000629EE2EC3C6DE3C@FLONIDAN-MAIL>

Andrew McNamara wrote:
> 
> I'm not altogether sure there. The parsing state machine is all
> written in C, and deals with signed chars - I expect we'll need two
> versions of that (or one version that's compiled twice using
> pre-processor macros). Quite a large job. Suggestions gratefully
> received.

How about using UTF-8 internally?  Change nothing in _csv.c, but in
csv.py encode/decode any unicode strings into UTF-8 on the way to/from
_csv.  File-like objects passed in by the user can be wrapped in
proxies that take care of encoding and decoding user strings, as well
as trans-coding between UTF-8 and the users chosen file encoding.

All that coding work may slow things down, but your original fast _csv
module will still be there when you need it.

- Anders

From skip at pobox.com  Wed Jan  5 21:21:18 2005
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 5 Jan 2005 14:21:18 -0600
Subject: [Csv] Re: csv module TODO list
In-Reply-To: <20050105121921.GB24030@idi.ntnu.no>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
        <16859.38960.9935.682429@montanaro.dyndns.org>
        <20050105075506.314C93C8E5@coffee.object-craft.com.au>
        <20050105121921.GB24030@idi.ntnu.no>
Message-ID: <16860.19518.824788.613286@montanaro.dyndns.org>


    Magnus> Quite a while ago I posted some material to the csv-list about
    Magnus> problems using the csv module on Unix-style colon-separated
    Magnus> files -- it just doesn't deal properly with backslash escaping
    Magnus> and is quite useless for this kind of file. I seem to recall the
    Magnus> general view was that it wasn't intended for this kind of thing
    Magnus> -- only the sort of csv that Microsoft Excel outputs/inputs, 

Yes, that's my recollection as well.  It's possible that we can extend the
interpretation of the escape char.

    Magnus> I'll be happy to re-send or summarize the relevant emails, if
    Magnus> needed.

Yes, that would be helpful.  Can you send me an example (three or four
lines) of the sort of file it won't grok?

Skip

From skip at pobox.com  Wed Jan  5 20:34:09 2005
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 5 Jan 2005 13:34:09 -0600
Subject: [Csv] csv module TODO list 
In-Reply-To: <20050105110849.CBA843C8E5@coffee.object-craft.com.au>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
        <20050105110849.CBA843C8E5@coffee.object-craft.com.au>
Message-ID: <16860.16689.695012.975520@montanaro.dyndns.org>


    >> * is CSV going to be maintained outside the python tree?
    >> If not, remove the 2.2 compatibility macros for: PyDoc_STR,
    >> PyDoc_STRVAR, PyMODINIT_FUNC, etc.

    Andrew> Does anyone thing we should continue to maintain this 2.2
    Andrew> compatibility?

With the release of 2.4, 2.2 has officially dropped off the radar screen,
right (zero probability of a 2.2.n+1 release, though the probability was
vanishingly small before).  I'd say toss it.  Do just that in a single
checkin so someone who's interested can do a simple cvs diff to yield
an initial patch file for external maintenance of that feature.

    >> * inline the following functions since they are used only in one
    >> place get_string, set_string, get_nullchar_as_None,
    >> set_nullchar_as_None, join_reset (maybe)

    Andrew> It was done that way as I felt we would be adding more getters
    Andrew> and setters to the dialect object in future.

The only new dialect attribute I envision is an encoding attribute.

    >> * is it necessary to have Dialect_methods, can you use 0 for tp_methods?

    Andrew> I was assuming I would need to add methods at some point (in
    Andrew> fact, I did have methods, but removed them).

Dialect objects are really just data containers, right?  I don't see that
they would need any methods.

    >> * remove commented out code (PyMem_DEL) on line 261
    >> Have you used valgrind on the test to find memory overwrites/leaks?

    Andrew> No, valgrind wasn't used.

I have it here at work.  I'll try to find a few minutes to run the csv tests
under valgrind's control.

Skip

From andrewm at object-craft.com.au  Fri Jan  7 02:15:33 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 07 Jan 2005 12:15:33 +1100
Subject: [Csv] CSV module brain surgery
Message-ID: <20050107011533.EA2183C8E5@coffee.object-craft.com.au>

The "dialect" type in the CSV module had been bugging me for a while -
it's used to hold the C-type representation of the parser config, and
barely exposed to the user (except as an attribute on the reader and
writer objects).

There were several problems with this internal dialect type - the
primary one was that you could write to it's attributes, which meant
that cross-attribute validation was doomed. It also reported errors
terribly, typically raising something like "invalid type for builtin"
and no more information.

So, I rewrote it. The result is far more consistent about the types of
exceptions it raises, and provides more useful diagnostics to the user
(unfortunately, this means minor user visible change, but probably not
in any way that they will notice). The dialect type now does it's own
validation of options, so these should better reflect what the parser
is capable of (downside is that Skip's python validator reports more
than one error per exception, the new version can only raise one).

Previously, the conversion from Python types to C types was done in
the setter (property) functions, and the type init function called
setattr to put it's arguments onto the type, hence the wonky reporting
of type errors.  The new code makes the type's attributes read-only -
they are set directly from the init function, which makes cross-attribute
validation viable.

Note that the dialect type constructor takes either class instance (and
looks on it for the appropriate attributes), and/or keyword arguments. This
makes it more complicated that I like, but means you can say stuff like
"excel, but tab delimited": csv.reader(file, 'excel', delimiter='\t').

I'm about ready to commit this (and some minor changes to the tests).
Comments, please?

 _csv.c |  423 +++++++-----------!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 1 files changed, 51 insertions(+), 72 deletions(-), 300 modifications(!)

Index: Modules/_csv.c
===================================================================
RCS file: /cvsroot/python/python/dist/src/Modules/_csv.c,v
retrieving revision 1.16
diff -u -r1.16 _csv.c
--- Modules/_csv.c	6 Jan 2005 02:25:41 -0000	1.16
+++ Modules/_csv.c	6 Jan 2005 12:39:50 -0000
@@ -73,7 +73,7 @@
 	char escapechar;	/* escape character */
 	int skipinitialspace;	/* ignore spaces following delimiter? */
 	PyObject *lineterminator; /* string to write between records */
-	QuoteStyle quoting;	/* style of quoting to write */
+	int quoting;		/* style of quoting to write */
 
 	int strict;		/* raise exception on bad CSV */
 } DialectObj;
@@ -130,17 +130,6 @@
         return dialect_obj;
 }
 
-static int
-check_delattr(PyObject *v)
-{
-	if (v == NULL) {
-		PyErr_SetString(PyExc_TypeError, 
-                                "Cannot delete attribute");
-		return -1;
-        }
-        return 0;
-}
-
 static PyObject *
 get_string(PyObject *str)
 {
@@ -148,25 +137,6 @@
         return str;
 }
 
-static int
-set_string(PyObject **str, PyObject *v)
-{
-        if (check_delattr(v) < 0)
-                return -1;
-        if (!PyString_Check(v)
-#ifdef Py_USING_UNICODE
-&& !PyUnicode_Check(v)
-#endif
-) {
-                PyErr_BadArgument();
-                return -1;
-        }
-        Py_XDECREF(*str);
-        Py_INCREF(v);
-        *str = v;
-        return 0;
-}
-
 static PyObject *
 get_nullchar_as_None(char c)
 {
@@ -178,48 +148,22 @@
                 return PyString_FromStringAndSize((char*)&c, 1);
 }
 
-static int
-set_None_as_nullchar(char * addr, PyObject *v)
-{
-        if (check_delattr(v) < 0)
-                return -1;
-        if (v == Py_None)
-                *addr = '\0';
-        else if (!PyString_Check(v) || PyString_Size(v) != 1) {
-                PyErr_BadArgument();
-                return -1;
-        }
-        else {
-		char *s = PyString_AsString(v);
-		if (s == NULL)
-			return -1;
-		*addr = s[0];
-	}
-        return 0;
-}
-
 static PyObject *
 Dialect_get_lineterminator(DialectObj *self)
 {
         return get_string(self->lineterminator);
 }
 
-static int
-Dialect_set_lineterminator(DialectObj *self, PyObject *value)
-{
-        return set_string(&self->lineterminator, value);
-}
-
 static PyObject *
 Dialect_get_escapechar(DialectObj *self)
 {
         return get_nullchar_as_None(self->escapechar);
 }
 
-static int
-Dialect_set_escapechar(DialectObj *self, PyObject *value)
+static PyObject *
+Dialect_get_quotechar(DialectObj *self)
 {
-        return set_None_as_nullchar(&self->escapechar, value);
+        return get_nullchar_as_None(self->quotechar);
 }
 
 static PyObject *
@@ -229,51 +173,109 @@
 }
 
 static int
-Dialect_set_quoting(DialectObj *self, PyObject *v)
+_set_bool(const char *name, int *target, PyObject *src, int dflt)
+{
+	if (src == NULL)
+		*target = dflt;
+	else
+		*target = PyObject_IsTrue(src);
+	return 0;
+}
+
+static int
+_set_int(const char *name, int *target, PyObject *src, int dflt)
+{
+	if (src == NULL)
+		*target = dflt;
+	else {
+		if (!PyInt_Check(src)) {
+			PyErr_Format(PyExc_TypeError, 
+				     "\"%s\" must be an integer", name);
+			return -1;
+		}
+		*target = PyInt_AsLong(src);
+	}
+	return 0;
+}
+
+static int
+_set_char(const char *name, char *target, PyObject *src, char dflt)
+{
+	if (src == NULL)
+		*target = dflt;
+	else {
+		if (src == Py_None)
+			*target = '\0';
+		else if (!PyString_Check(src) || PyString_Size(src) != 1) {
+			PyErr_Format(PyExc_TypeError, 
+				     "\"%s\" must be an 1-character string", 
+				     name);
+			return -1;
+		}
+		else {
+			char *s = PyString_AsString(src);
+			if (s == NULL)
+				return -1;
+			*target = s[0];
+		}
+	}
+        return 0;
+}
+
+static int
+_set_str(const char *name, PyObject **target, PyObject *src, const char *dflt)
+{
+	if (src == NULL)
+		*target = PyString_FromString(dflt);
+	else {
+		if (src == Py_None)
+			*target = NULL;
+		else if (!PyString_Check(src)
+#ifdef Py_USING_UNICODE
+		    && !PyUnicode_Check(src)
+#endif
+		) {
+			PyErr_Format(PyExc_TypeError, 
+				     "\"%s\" must be an string", name);
+			return -1;
+		} else {
+			Py_XDECREF(*target);
+			Py_INCREF(src);
+			*target = src;
+		}
+	}
+        return 0;
+}
+
+static int
+dialect_check_quoting(int quoting)
 {
-        int quoting;
         StyleDesc *qs = quote_styles;
 
-        if (check_delattr(v) < 0)
-                return -1;
-        if (!PyInt_Check(v)) {
-                PyErr_BadArgument();
-                return -1;
-        }
-        quoting = PyInt_AsLong(v);
 	for (qs = quote_styles; qs->name; qs++) {
-		if (qs->style == quoting) {
-                        self->quoting = quoting;
+		if (qs->style == quoting)
                         return 0;
-                }
         }
-        PyErr_BadArgument();
+	PyErr_Format(PyExc_TypeError, "bad \"quoting\" value");
         return -1;
 }
 
-static struct PyMethodDef Dialect_methods[] = {
-	{ NULL, NULL }
-};
-
 #define D_OFF(x) offsetof(DialectObj, x)
 
 static struct PyMemberDef Dialect_memberlist[] = {
-	{ "quotechar",          T_CHAR, D_OFF(quotechar) },
-	{ "delimiter",          T_CHAR, D_OFF(delimiter) },
-	{ "skipinitialspace",   T_INT, D_OFF(skipinitialspace) },
-	{ "doublequote",        T_INT, D_OFF(doublequote) },
-	{ "strict",             T_INT, D_OFF(strict) },
+	{ "delimiter",          T_CHAR, D_OFF(delimiter), READONLY },
+	{ "skipinitialspace",   T_INT, D_OFF(skipinitialspace), READONLY },
+	{ "doublequote",        T_INT, D_OFF(doublequote), READONLY },
+	{ "strict",             T_INT, D_OFF(strict), READONLY },
 	{ NULL }
 };
 
 static PyGetSetDef Dialect_getsetlist[] = {
-        { "escapechar", (getter)Dialect_get_escapechar, 
-                (setter)Dialect_set_escapechar },
-        { "lineterminator", (getter)Dialect_get_lineterminator, 
-                (setter)Dialect_set_lineterminator },
-        { "quoting", (getter)Dialect_get_quoting, 
-                (setter)Dialect_set_quoting },
-        {NULL},
+	{ "escapechar",		(getter)Dialect_get_escapechar},
+	{ "lineterminator",	(getter)Dialect_get_lineterminator},
+	{ "quotechar",		(getter)Dialect_get_quotechar},
+	{ "quoting",		(getter)Dialect_get_quoting},
+	{NULL},
 };
 
 static void
@@ -283,107 +285,158 @@
         self->ob_type->tp_free((PyObject *)self);
 }
 
+/*
+ * Return a new reference to a dialect instance
+ *
+ * If given a string, looks up the name in our dialect registry
+ * If given a class, instantiate (which runs python validity checks)
+ * If given an instance, return a new reference to the instance
+ */
+static PyObject *
+dialect_instantiate(PyObject *dialect)
+{
+	Py_INCREF(dialect);
+	/* If dialect is a string, look it up in our registry */
+	if (PyString_Check(dialect)
+#ifdef Py_USING_UNICODE
+		|| PyUnicode_Check(dialect)
+#endif
+		) {
+		PyObject * new_dia;
+		new_dia = get_dialect_from_registry(dialect);
+		Py_DECREF(dialect);
+		return new_dia;
+	}
+	/* A class rather than an instance? Instantiate */
+	if (PyObject_TypeCheck(dialect, &PyClass_Type)) {
+		PyObject * new_dia;
+		new_dia = PyObject_CallFunction(dialect, "");
+		Py_DECREF(dialect);
+		return new_dia;
+	}
+	/* Make sure we finally have an instance */
+	if (!PyInstance_Check(dialect)) {
+		PyErr_SetString(PyExc_TypeError, "dialect must be an instance");
+		Py_DECREF(dialect);
+		return NULL;
+	}
+	return dialect;
+}
+
+static char *dialect_kws[] = {
+	"dialect",
+	"delimiter",
+	"doublequote",
+	"escapechar",
+	"lineterminator",
+	"quotechar",
+	"quoting",
+	"skipinitialspace",
+	"strict",
+	NULL
+};
+
 static int
 dialect_init(DialectObj * self, PyObject * args, PyObject * kwargs)
 {
-        PyObject *dialect = NULL, *name_obj, *value_obj;
-
-	self->quotechar = '"';
-	self->delimiter = ',';
-	self->escapechar = '\0';
-	self->skipinitialspace = 0;
-        Py_XDECREF(self->lineterminator);
-	self->lineterminator = PyString_FromString("\r\n");
-        if (self->lineterminator == NULL)
+	int ret = -1;
+        PyObject *dialect = NULL;
+	PyObject *delimiter = NULL;
+	PyObject *doublequote = NULL;
+	PyObject *escapechar = NULL;
+	PyObject *lineterminator = NULL;
+	PyObject *quotechar = NULL;
+	PyObject *quoting = NULL;
+	PyObject *skipinitialspace = NULL;
+	PyObject *strict = NULL;
+
+	if (!PyArg_ParseTupleAndKeywords(args, kwargs,
+					 "|OOOOOOOOO", dialect_kws,
+					 &dialect,
+					 &delimiter,
+					 &doublequote,
+					 &escapechar,
+					 &lineterminator,
+					 &quotechar,
+					 &quoting,
+					 &skipinitialspace,
+					 &strict))
                 return -1;
-	self->quoting = QUOTE_MINIMAL;
-	self->doublequote = 1;
-	self->strict = 0;
 
-	if (!PyArg_UnpackTuple(args, "", 0, 1, &dialect))
-                return -1;
-        Py_XINCREF(dialect);
-        if (kwargs != NULL) {
-                PyObject * key = PyString_FromString("dialect");
-                PyObject * d;
-
-                d = PyDict_GetItem(kwargs, key);
-                if (d) {
-                        Py_INCREF(d);
-                        Py_XDECREF(dialect);
-                        PyDict_DelItem(kwargs, key);
-                        dialect = d;
-                }
-                Py_DECREF(key);
-        }
-        if (dialect != NULL) {
-                int i;
-                PyObject * dir_list;
+	Py_XINCREF(delimiter);
+	Py_XINCREF(doublequote);
+	Py_XINCREF(escapechar);
+	Py_XINCREF(lineterminator);
+	Py_XINCREF(quotechar);
+	Py_XINCREF(quoting);
+	Py_XINCREF(skipinitialspace);
+	Py_XINCREF(strict);
+	if (dialect != NULL) {
+		dialect = dialect_instantiate(dialect);
+		if (dialect == NULL)
+			goto err;
+#define DIALECT_GETATTR(v, n) \
+		if (v == NULL) \
+			v = PyObject_GetAttrString(dialect, n)
+
+		DIALECT_GETATTR(delimiter, "delimiter");
+		DIALECT_GETATTR(doublequote, "doublequote");
+		DIALECT_GETATTR(escapechar, "escapechar");
+		DIALECT_GETATTR(lineterminator, "lineterminator");
+		DIALECT_GETATTR(quotechar, "quotechar");
+		DIALECT_GETATTR(quoting, "quoting");
+		DIALECT_GETATTR(skipinitialspace, "skipinitialspace");
+		DIALECT_GETATTR(strict, "strict");
+		PyErr_Clear();
+		Py_DECREF(dialect);
+	}
 
-                /* If dialect is a string, look it up in our registry */
-                if (PyString_Check(dialect)
-#ifdef Py_USING_UNICODE
-		    || PyUnicode_Check(dialect)
-#endif
-			) {
-                        PyObject * new_dia;
-                        new_dia = get_dialect_from_registry(dialect);
-                        Py_DECREF(dialect);
-                        if (new_dia == NULL)
-                                return -1;
-                        dialect = new_dia;
-                }
-                /* A class rather than an instance? Instantiate */
-                if (PyObject_TypeCheck(dialect, &PyClass_Type)) {
-                        PyObject * new_dia;
-                        new_dia = PyObject_CallFunction(dialect, "");
-                        Py_DECREF(dialect);
-                        if (new_dia == NULL)
-                                return -1;
-                        dialect = new_dia;
-                }
-                /* Make sure we finally have an instance */
-                if (!PyInstance_Check(dialect) ||
-                    (dir_list = PyObject_Dir(dialect)) == NULL) {
-                        PyErr_SetString(PyExc_TypeError,
-                                        "dialect must be an instance");
-                        Py_DECREF(dialect);
-                        return -1;
-                }
-                /* And extract the attributes */
-                for (i = 0; i < PyList_GET_SIZE(dir_list); ++i) {
-			char *s;
-                        name_obj = PyList_GET_ITEM(dir_list, i);
-			s = PyString_AsString(name_obj);
-			if (s == NULL)
-				return -1;
-                        if (s[0] == '_')
-                                continue;
-                        value_obj = PyObject_GetAttr(dialect, name_obj);
-                        if (value_obj) {
-                                if (PyObject_SetAttr((PyObject *)self, 
-                                                     name_obj, value_obj)) {
-					Py_DECREF(value_obj);
-                                        Py_DECREF(dir_list);
-					Py_DECREF(dialect);
-                                        return -1;
-                                }
-                                Py_DECREF(value_obj);
-                        }
-                }
-                Py_DECREF(dir_list);
-                Py_DECREF(dialect);
-        }
-        if (kwargs != NULL) {
-                int pos = 0;
+	/* check types and convert to C values */
+#define DIASET(meth, name, target, src, dflt) \
+	if (meth(name, target, src, dflt)) \
+		goto err
+	DIASET(_set_char, "delimiter", &self->delimiter, delimiter, ',');
+	DIASET(_set_bool, "doublequote", &self->doublequote, doublequote, 1);
+	DIASET(_set_char, "escapechar", &self->escapechar, escapechar, 0);
+	DIASET(_set_str, "lineterminator", &self->lineterminator, lineterminator, "\r\n");
+	DIASET(_set_char, "quotechar", &self->quotechar, quotechar, '"');
+	DIASET(_set_int, "quoting", &self->quoting, quoting, QUOTE_MINIMAL);
+	DIASET(_set_bool, "skipinitialspace", &self->skipinitialspace, skipinitialspace, 0);
+	DIASET(_set_bool, "strict", &self->strict, strict, 0);
+
+	/* sanity check options */
+	if (dialect_check_quoting(self->quoting))
+		goto err;
+	if (self->delimiter == 0) {
+                PyErr_SetString(PyExc_TypeError, "delimiter must be set");
+		goto err;
+	}
+	if (self->quoting != QUOTE_NONE && self->quotechar == 0) {
+                PyErr_SetString(PyExc_TypeError, 
+				"quotechar must be set if quoting enabled");
+		goto err;
+	}
+	if (self->lineterminator == 0) {
+                PyErr_SetString(PyExc_TypeError, "lineterminator must be set");
+		goto err;
+	}
+	if (self->quoting == QUOTE_NONE && self->escapechar == 0) {
+                PyErr_SetString(PyExc_TypeError, 
+				"escapechar must be set if quoting disabled");
+		goto err;
+	}
 
-                while (PyDict_Next(kwargs, &pos, &name_obj, &value_obj)) {
-                        if (PyObject_SetAttr((PyObject *)self, 
-                                             name_obj, value_obj))
-                                return -1;
-                }
-        }
-        return 0;
+	ret = 0;
+err:
+	Py_XDECREF(delimiter);
+	Py_XDECREF(doublequote);
+	Py_XDECREF(escapechar);
+	Py_XDECREF(lineterminator);
+	Py_XDECREF(quotechar);
+	Py_XDECREF(quoting);
+	Py_XDECREF(skipinitialspace);
+	Py_XDECREF(strict);
+        return ret;
 }
 
 static PyObject *
@@ -433,7 +486,7 @@
         0,                                      /* tp_weaklistoffset */
         0,                                      /* tp_iter */
         0,                                      /* tp_iternext */
-        Dialect_methods,                        /* tp_methods */
+	0,					/* tp_methods */
         Dialect_memberlist,                     /* tp_members */
         Dialect_getsetlist,                     /* tp_getset */
 	0,					/* tp_base */
@@ -1332,7 +1385,7 @@
                 return NULL;
         }
         Py_INCREF(dialect_obj);
-        /* A class rather than an instance? Instanciate */
+        /* A class rather than an instance? Instantiate */
         if (PyObject_TypeCheck(dialect_obj, &PyClass_Type)) {
                 PyObject * new_dia;
                 new_dia = PyObject_CallFunction(dialect_obj, "");

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Fri Jan  7 03:22:22 2005
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 6 Jan 2005 20:22:22 -0600
Subject: [Csv] CSV module brain surgery
In-Reply-To: <20050107011533.EA2183C8E5@coffee.object-craft.com.au>
References: <20050107011533.EA2183C8E5@coffee.object-craft.com.au>
Message-ID: <16861.62046.244101.873686@montanaro.dyndns.org>


    Andrew> The "dialect" type in the CSV module had been bugging me for a
    Andrew> while - it's used to hold the C-type representation of the
    Andrew> parser config, and barely exposed to the user (except as an
    Andrew> attribute on the reader and writer objects).

    Andrew> There were several problems with this internal dialect type -
    Andrew> the primary one was that you could write to it's attributes,
    Andrew> which meant that cross-attribute validation was doomed. It also
    Andrew> reported errors terribly, typically raising something like
    Andrew> "invalid type for builtin" and no more information.

    Andrew> So, I rewrote it.

    ....

    Andrew> I'm about ready to commit this (and some minor changes to the
    Andrew> tests).  Comments, please?

As long as I can still pass a dialect class into the constructor and have it
interpreted properly, I don't really what else happens. ;-)

Skip


From andrewm at object-craft.com.au  Fri Jan  7 04:08:33 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 07 Jan 2005 14:08:33 +1100
Subject: [Csv] CSV module brain surgery 
In-Reply-To: <16861.62046.244101.873686@montanaro.dyndns.org> 
References: <20050107011533.EA2183C8E5@coffee.object-craft.com.au>
	<16861.62046.244101.873686@montanaro.dyndns.org>
Message-ID: <20050107030833.709DB3C8E5@coffee.object-craft.com.au>

>    Andrew> The "dialect" type in the CSV module had been bugging me for a
>    Andrew> while - it's used to hold the C-type representation of the
>    Andrew> parser config, and barely exposed to the user (except as an
>    Andrew> attribute on the reader and writer objects).
>
>    Andrew> There were several problems with this internal dialect type -
>    Andrew> the primary one was that you could write to it's attributes,
>    Andrew> which meant that cross-attribute validation was doomed. It also
>    Andrew> reported errors terribly, typically raising something like
>    Andrew> "invalid type for builtin" and no more information.
>
>    Andrew> So, I rewrote it.
>
>As long as I can still pass a dialect class into the constructor and have it
>interpreted properly, I don't really what else happens. ;-)

Yes, obviously the published interface should remain the same, although
the validation done by the Dialect base class is no longer needed (the
underlying dialect type does it's own validation).

BTW, I've managed to fix several of the issues raised by:

    http://www.google.com.au/groups?selm=vsb89q1d3n5qb1%40corp.supernews.com

The tricky bit is assuring myself that I haven't introduced any regressions
in the process. 8-)

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Fri Jan  7 07:13:22 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 07 Jan 2005 17:13:22 +1100
Subject: [Csv] csv module TODO list 
In-Reply-To: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> 
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
Message-ID: <20050107061322.A6A563C8E5@coffee.object-craft.com.au>

>There's a bunch of jobs we (CSV module maintainers) have been putting
>off - attached is a list (in no particular order): 
[...]
>Also, review comments from Jeremy Hylton, 10 Apr 2003:
>
>    I've been reviewing extension modules looking for C types that should
>    participate in garbage collection.  I think the csv ReaderObj and
>    WriterObj should participate.  The ReaderObj it contains a reference to
>    input_iter that could be an arbitrary Python object.  The iterator
>    object could well participate in a cycle that refers to the ReaderObj.
>    The WriterObj has a reference to a writeline callable, which could well
>    be a method of an object that also points to the WriterObj.

I finally got around to looking at this, only to realise Jeremy did the
work back in Apr 2003 (thanks). One question, however - the GC doco in
the Python/C API seems to suggest to me that PyObject_GC_Track should be
called on the newly minted object prior to returning from the initialiser
(and correspondingly PyObject_GC_UnTrack should be called prior to
dismantling). This isn't being done in the module as it stands. Is the
module wrong, or is my understanding of the reference manual incorrect?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Fri Jan  7 08:54:54 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 07 Jan 2005 18:54:54 +1100
Subject: [Csv] Minor change to behaviour of csv module
Message-ID: <20050107075454.AC1A13C8E5@coffee.object-craft.com.au>

I'm considering a change to the csv module that could potentially break
some obscure uses of the module (but CSV files usually quote, rather
than escape, so the most common uses aren't effected).

Currently, with a non-default escapechar='\\', input like:

    field one,field \
    two,field three

Returns:

    ["field one", "field \\\ntwo", "field three"]

In the 2.5 series, I propose changing this to return:

    ["field one", "field \ntwo", "field three"]

Is this reasonable? Is the old behaviour desirable in any way (we could
add a switch to enable to new behaviour, but I feel that would only
allow the confusion to continue)?

BTW, some of my other changes have changed the exceptions raised when
bad arguments were passed to the reader and writer factory functions - 
previously, the exceptions were semi-random, including TypeError,
AttributeError and csv.Error - they should now almost always be TypeError
(like most other argument passing errors). I can't see this being a
problem, but I'm prepared to listen to arguments.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Fri Jan  7 13:06:23 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 07 Jan 2005 23:06:23 +1100
Subject: [Csv] Re: [Python-Dev] Minor change to behaviour of csv module 
In-Reply-To: <20050107075454.AC1A13C8E5@coffee.object-craft.com.au> 
References: <20050107075454.AC1A13C8E5@coffee.object-craft.com.au>
Message-ID: <20050107120623.EC0673C8E5@coffee.object-craft.com.au>

>I'm considering a change to the csv module that could potentially break
>some obscure uses of the module (but CSV files usually quote, rather
>than escape, so the most common uses aren't effected).
>
>Currently, with a non-default escapechar='\\', input like:
>
>    field one,field \
>    two,field three
>
>Returns:
>
>    ["field one", "field \\\ntwo", "field three"]
>
>In the 2.5 series, I propose changing this to return:
>
>    ["field one", "field \ntwo", "field three"]
>
>Is this reasonable? Is the old behaviour desirable in any way (we could
>add a switch to enable to new behaviour, but I feel that would only
>allow the confusion to continue)?

Thinking about this further, I suspect we have to retain the current
behaviour, as broken as it is, as the default: it's conceivable that
someone somewhere is post-processing the result to remove the backslashes,
and if we fix the csv module, we'll break their code.

Note that PEP-305 had nothing to say about escaping, nor does the module
reference manual.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From magnus at hetland.org  Fri Jan  7 14:38:17 2005
From: magnus at hetland.org (Magnus Lie Hetland)
Date: Fri, 7 Jan 2005 14:38:17 +0100
Subject: [Csv] Minor change to behaviour of csv module
In-Reply-To: <20050107075454.AC1A13C8E5@coffee.object-craft.com.au>
References: <20050107075454.AC1A13C8E5@coffee.object-craft.com.au>
Message-ID: <20050107133817.GB5503@idi.ntnu.no>

Andrew McNamara <andrewm at object-craft.com.au>:
>
[snip]
> Currently, with a non-default escapechar='\\', input like:
> 
>     field one,field \
>     two,field three
> 
> Returns:
> 
>     ["field one", "field \\\ntwo", "field three"]
> 
> In the 2.5 series, I propose changing this to return:
> 
>     ["field one", "field \ntwo", "field three"]

IMO this is the *only* reasonable behaviour. I don't understand why
the escape character should be left in; this is one of the reason why
UNIX-style colon-separated values don't work with the current module.

If one wanted the first version, one would (I presume) write

   field one,field \\\
   two,field three

-- 
Magnus Lie Hetland       Fallen flower I see / Returning to its branch
http://hetland.org       Ah! a butterfly.           [Arakida Moritake]

From mcherm at mcherm.com  Fri Jan  7 14:45:20 2005
From: mcherm at mcherm.com (Michael Chermside)
Date: Fri,  7 Jan 2005 05:45:20 -0800
Subject: [Python-Dev] Re: [Csv] Minor change to behaviour of csv module
Message-ID: <1105105520.41de927049442@mcherm.com>

Andrew explains that in the CSV module, escape characters are not
properly removed.

Magnus writes:
> IMO this is the *only* reasonable behaviour. I don't understand why
> the escape character should be left in; this is one of the reason why
> UNIX-style colon-separated values don't work with the current module.

Andrew writes back later:
> Thinking about this further, I suspect we have to retain the current
> behaviour, as broken as it is, as the default: it's conceivable that
> someone somewhere is post-processing the result to remove the backslashes,
> and if we fix the csv module, we'll break their code.

I'm with Magnus on this. No one has 4 year old code using the CSV module.
The existing behavior is just simply WRONG. Sure, of course we should
try to maintain backward compatibility, but surely SOME cases don't
require it, right? Can't we treat this misbehavior as an outright bug?

-- Michael Chermside


From tim.peters at gmail.com  Fri Jan  7 17:00:42 2005
From: tim.peters at gmail.com (Tim Peters)
Date: Fri, 7 Jan 2005 11:00:42 -0500
Subject: [Python-Dev] Re: [Csv] csv module TODO list
In-Reply-To: <20050107061322.A6A563C8E5@coffee.object-craft.com.au>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	 <20050107061322.A6A563C8E5@coffee.object-craft.com.au>
Message-ID: <1f7befae05010708005275e23d@mail.gmail.com>

[Andrew McNamara]
>> Also, review comments from Jeremy Hylton, 10 Apr 2003:
>>
>>    I've been reviewing extension modules looking for C types that should
>>    participate in garbage collection.  I think the csv ReaderObj and
>>    WriterObj should participate.  The ReaderObj it contains a reference to
>>    input_iter that could be an arbitrary Python object.  The iterator
>>    object could well participate in a cycle that refers to the ReaderObj.
>>    The WriterObj has a reference to a writeline callable, which could well
>>    be a method of an object that also points to the WriterObj.

> I finally got around to looking at this, only to realise Jeremy did the
> work back in Apr 2003 (thanks). One question, however - the GC doco in
> the Python/C API seems to suggest to me that PyObject_GC_Track should be
> called on the newly minted object prior to returning from the initialiser
> (and correspondingly PyObject_GC_UnTrack should be called prior to
> dismantling). This isn't being done in the module as it stands. Is the
> module wrong, or is my understanding of the reference manual incorrect?

The purpose of "tracking" and "untracking" is to let cyclic gc know
when it (respectively) is and isn't safe to call an object's
tp_traverse method.  Primarily, when an object is first created at the
C level, it may contain NULLs or heap trash in pointer slots, and then
the object's tp_traverse could segfault if it were called while the
object remained in an insane (wrt tp_traverse) state.  Similarly,
cleanup actions in the tp_dealloc may make a tp_traverse-sane object
tp_traverse-insane, so tp_dealloc should untrack the object before
that occurs.

If tracking is never done, then the object effectively never
participates in cyclic gc:  its tp_traverse will never get called, and
it will effectively act as an external root (keeping itself and
everything reachable from it alive).  So, yes, track it during
construction, but not before all the members referenced by its
tp_traverse are in a sane state.  Putting the track call "at the end"
of the constructor is usually best practice.

tp_dealloc should untrack it then.  In a debug build, that will
assert-fail if the object hasn't actually been tracked. 
PyObject_GC_Del will untrack it for you (if it's still tracked), but
it's risky to rely on that --  it's too easy to forget that Py_DECREFs
on contained objects can end up executing arbitrary Python code (via
__del__ and weakref callbacks, and via allowing other threads to run),
which can in turn trigger a round of cyclic gc *while* your tp_dealloc
is still running.  So it's safest to untrack the object very early in
tp_dealloc.

I doubt this happens in the csv module, but an untrack/track pair
should also be put around any block of method code that temporarily
puts the object into a tp_traverse-insane state and that contains any
C API calls that may end up triggering cyclic gc.  That's very rare.

From skip at pobox.com  Fri Jan  7 17:09:13 2005
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 7 Jan 2005 10:09:13 -0600
Subject: [Csv] Minor change to behaviour of csv module
In-Reply-To: <20050107075454.AC1A13C8E5@coffee.object-craft.com.au>
References: <20050107075454.AC1A13C8E5@coffee.object-craft.com.au>
Message-ID: <16862.46121.778915.968964@montanaro.dyndns.org>


    Andrew> I'm considering a change to the csv module that could
    Andrew> potentially break some obscure uses of the module (but CSV files
    Andrew> usually quote, rather than escape, so the most common uses
    Andrew> aren't effected).

I'm with the other respondents.  This looks like a bug that should be
squashed.

Skip

From skip at pobox.com  Sat Jan  8 06:03:07 2005
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 7 Jan 2005 23:03:07 -0600
Subject: [Csv] valgrind output
Message-ID: <16863.27019.162437.881182@montanaro.dyndns.org>

I compiled Python in an up-to-date cvs sandbox and ran

    ./python ../Lib/test/regrtest.py test_csv

under control of "valgrind --tool=memcheck" with the default valgrind
suppression file that comes with the Python distribution.  I've attached the
output.  If you search for "csv" you'll see where the "test_csv" line is
emitted and where valgrind finds suspicious memory activity during the
test.

I'm not much of a valgrind person, having only used it once or twice, so I
didn't bother at this stage to dig into the output.  If there's more I can
do, let me know and I'll make some more runs.

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: csvtest.log
Type: application/octet-stream
Size: 57154 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20050107/64914c09/attachment.obj 

From andrewm at object-craft.com.au  Mon Jan 10 01:40:06 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 10 Jan 2005 11:40:06 +1100
Subject: [Python-Dev] Re: [Csv] Minor change to behaviour of csv module 
In-Reply-To: <48F57F83-60B3-11D9-ADA4-000A95EFAE9E@aleax.it> 
References: <1105105520.41de927049442@mcherm.com>
	<48F57F83-60B3-11D9-ADA4-000A95EFAE9E@aleax.it>
Message-ID: <20050110004006.88CB63C8E5@coffee.object-craft.com.au>

>> Andrew explains that in the CSV module, escape characters are not
>> properly removed.
>>
>> Magnus writes:
>>> IMO this is the *only* reasonable behaviour. I don't understand why
>>> the escape character should be left in; this is one of the reason why
>>> UNIX-style colon-separated values don't work with the current module.
>>
>> Andrew writes back later:
>>> Thinking about this further, I suspect we have to retain the current
>>> behaviour, as broken as it is, as the default: it's conceivable that
>>> someone somewhere is post-processing the result to remove the 
>>> backslashes,
>>> and if we fix the csv module, we'll break their code.
>>
>> I'm with Magnus on this. No one has 4 year old code using the CSV 
>> module.
>> The existing behavior is just simply WRONG. Sure, of course we should
>> try to maintain backward compatibility, but surely SOME cases don't
>> require it, right? Can't we treat this misbehavior as an outright bug?
>
>+1 -- the nonremoval of escape characters smells like a bug to me, too.

Okay, I'm glad the community agrees (less work, less crustification).

For what it's worth, it wasn't a bug so much as a misfeature. I was
explicitly adding the escape character back in. The intention was to
make the feature more forgiving on users who accidently set the escape
character - in other words, only special (quoting, escaping, field
delimiter) characters received special treatment. With the benefit of
hindsight, that was an inadequately considered choice.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Mon Jan 10 04:41:09 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 10 Jan 2005 14:41:09 +1100
Subject: [Csv] valgrind output 
In-Reply-To: <16863.27019.162437.881182@montanaro.dyndns.org> 
References: <16863.27019.162437.881182@montanaro.dyndns.org>
Message-ID: <20050110034109.423273C889@coffee.object-craft.com.au>

>I compiled Python in an up-to-date cvs sandbox and ran
>
>    ./python ../Lib/test/regrtest.py test_csv
>
>under control of "valgrind --tool=memcheck" with the default valgrind
>suppression file that comes with the Python distribution.  I've attached the
>output.  If you search for "csv" you'll see where the "test_csv" line is
>emitted and where valgrind finds suspicious memory activity during the
>test.

Did you do the other things mentioned in Misc/README.valgrind (uncomment
Py_USING_MEMORY_DEBUGGER, uncomment PyObject_Free and PyObject_Realloc
supressions)? When I do the things it suggests, and use the python
suppression file, I get no errors.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Mon Jan 10 05:44:41 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 10 Jan 2005 15:44:41 +1100
Subject: [Csv] csv module and universal newlines
In-Reply-To: <20050105070643.5915B3C8E5@coffee.object-craft.com.au> 
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
Message-ID: <20050110044441.250103C889@coffee.object-craft.com.au>

This item, from the TODO list, has been bugging me for a while:

>* Reader and universal newlines don't interact well, reader doesn't
>  honour Dialect's lineterminator setting. All outstanding bug id's
>  (789519, 944890, 967934 and 1072404) are related to this - it's 
>  a difficult problem and further discussion is needed.

The csv parser consumes lines from an iterator, but it also has it's own
idea of end-of-line conventions, which are currently only used by the
writer, not the reader, which is a source of much confusion. The writer,
by default, also attempts to emit a \r\n sequence, which results in more
confusion unless the file is opened in binary mode.

I'm looking for suggestions for how we can mitigate these problems
(without breaking things for existing users).

The standard file iterator includes the end-of-line characters in the
returned string. One potentional solution is, then, to ignore the line
chunking done by the file iterator, and logically concatenate the source
lines until the csv parser's idea of lineterminator is seen - but this
defeats negates the benefits of using an iterator.

Another option might be to provide a new interface that relies on a
file-like object being supplied. The lineterminator character would only
be used with this interface, with the current interface falling back to
using only \n. Rather a drastic solution.

Any other ideas?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From neal at metaslash.com  Tue Jan 11 00:31:26 2005
From: neal at metaslash.com (Neal Norwitz)
Date: Mon, 10 Jan 2005 18:31:26 -0500
Subject: [Csv] csv module TODO list
In-Reply-To: <20050105110849.CBA843C8E5@coffee.object-craft.com.au>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<20050105110849.CBA843C8E5@coffee.object-craft.com.au>
Message-ID: <20050110233126.GA14363@janus.swcomplete.com>

On Wed, Jan 05, 2005 at 10:08:49PM +1100, Andrew McNamara wrote:
> >Also, review comments from Neal Norwitz, 22 Mar 2003 (some of these should
> >already have been addressed):
> 
> I should apologise to Neal here for not replying to him at the time.

Hey, I'm impressed you got to them.  :-) I completely forgot about it.

> >* rather than use PyErr_BadArgument, should you use assert?
> >        (first example, Dialect_set_quoting, line 218)
> 
> You mean C assert()? I don't think I'm really following you here -
> where would the type of the object be checked in a way the user could
> recover from?

IIRC, I meant C assert().  This goes back to a discussion a long time
ago about what is the preferred way to handle invalid arguments.
I doubt it's important to change.

> >* I think you need PyErr_NoMemory() before returning on line 768, 1178
> 
> The examples I looked at in the Python core didn't do this - are you sure?
> (now lines 832 and 1280). 

Originally, they were a plain PyObject_NEW().  Now they are a
PyObject_GC_New() so it seems no further change is necessary.

> >* is PyString_AsString(self->dialect->lineterminator) on line 994
> >        guaranteed not to return NULL?  If not, it could crash by
> >        passing to memmove.
> >* PyString_AsString() can return NULL on line 1048 and 1063, 
> >        the result is passed to join_append()
> 
> Looking at the PyString_AsString implementation, it looks safe (we ensure
> it's really a string elsewhere)?

Ok.  Then it should be fine.  I spot checked lineterminator and it
looked ok.

> >* iteratable should be iterable?  (line 1088)
> 
> Sorry, I don't know what you're getting at here? (now line 1162).

Heh, I had to read that twice myself.  It was a typo (assuming
I wasn't completely wrong)--an extra "at", but it doesn't exist
any longer.

I don't think there are any changes remaining to be done from my
original code review.

BTW, I always try to run valgrind before a release, especially
major releases.

Neal

From skip at pobox.com  Wed Jan 12 02:59:22 2005
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 11 Jan 2005 19:59:22 -0600
Subject: [Csv] csv module and universal newlines
In-Reply-To: <20050110044441.250103C889@coffee.object-craft.com.au>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
        <20050110044441.250103C889@coffee.object-craft.com.au>
Message-ID: <16868.33914.837771.954739@montanaro.dyndns.org>


    Andrew> The csv parser consumes lines from an iterator, but it also has
    Andrew> it's own idea of end-of-line conventions, which are currently
    Andrew> only used by the writer, not the reader, which is a source of
    Andrew> much confusion. The writer, by default, also attempts to emit a
    Andrew> \r\n sequence, which results in more confusion unless the file
    Andrew> is opened in binary mode.

    Andrew> I'm looking for suggestions for how we can mitigate these
    Andrew> problems (without breaking things for existing users).

You can argue that reading csv data from/writing csv data to a file on
Windows if the file isn't opened in binary mode is an error.  Perhaps we
should enforce that in situations where it matters.  Would this be a start?

    terminators = {"darwin": "\r",
                   "win32": "\r\n"}

    if (dialect.lineterminator != terminators.get(sys.platform, "\n") and
       "b" not in getattr(f, "mode", "b")):
       raise IOError, ("%s not opened in binary mode" %
                       getattr(f, "name", "???"))

The elements of the postulated terminators dictionary may already exist
somewhere within the sys or os modules (if not, perhaps they should be
added).  The idea of the check is to enforce binary mode on those objects
that support a mode if the desired line terminator doesn't match the
platform's line terminator.

Skip

From andrewm at object-craft.com.au  Wed Jan 12 23:55:25 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 13 Jan 2005 09:55:25 +1100
Subject: [Python-Dev] Re: [Csv] csv module and universal newlines 
In-Reply-To: <16868.33914.837771.954739@montanaro.dyndns.org> 
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<20050110044441.250103C889@coffee.object-craft.com.au>
	<16868.33914.837771.954739@montanaro.dyndns.org>
Message-ID: <20050112225525.236BE3C889@coffee.object-craft.com.au>

>You can argue that reading csv data from/writing csv data to a file on
>Windows if the file isn't opened in binary mode is an error.  Perhaps we
>should enforce that in situations where it matters.  Would this be a start?
>
>    terminators = {"darwin": "\r",
>                   "win32": "\r\n"}
>
>    if (dialect.lineterminator != terminators.get(sys.platform, "\n") and
>       "b" not in getattr(f, "mode", "b")):
>       raise IOError, ("%s not opened in binary mode" %
>                       getattr(f, "name", "???"))
>
>The elements of the postulated terminators dictionary may already exist
>somewhere within the sys or os modules (if not, perhaps they should be
>added).  The idea of the check is to enforce binary mode on those objects
>that support a mode if the desired line terminator doesn't match the
>platform's line terminator.

Where that falls down, I think, is where you want to read an alien
file - in fact, under unix, most of the CSV files I read use \r\n for
end-of-line.

Also, I *really* don't like the idea of looking for a mode attribute
on the supplied iterator - it feels like a layering violation. We've
advertised the fact that it's an iterator, so we shouldn't be using
anything but the iterator protocol.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From Jack.Jansen at cwi.nl  Thu Jan 13 00:02:39 2005
From: Jack.Jansen at cwi.nl (Jack Jansen)
Date: Thu, 13 Jan 2005 00:02:39 +0100
Subject: [Python-Dev] Re: [Csv] csv module and universal newlines
In-Reply-To: <16868.33914.837771.954739@montanaro.dyndns.org>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<20050110044441.250103C889@coffee.object-craft.com.au>
	<16868.33914.837771.954739@montanaro.dyndns.org>
Message-ID: <0E6093F4-64EE-11D9-B7C6-000D934FF6B4@cwi.nl>


On 12-jan-05, at 2:59, Skip Montanaro wrote:
>     terminators = {"darwin": "\r",
>                    "win32": "\r\n"}
>
>     if (dialect.lineterminator != terminators.get(sys.platform, "\n") 
> and
>        "b" not in getattr(f, "mode", "b")):
>        raise IOError, ("%s not opened in binary mode" %
>                        getattr(f, "name", "???"))

On MacOSX you really want universal newlines. CSV files produced by 
older software (such as AppleWorks) will have \r line terminators, but 
lots of other programs will have files with normal \n terminators.
--
Jack Jansen, <Jack.Jansen at cwi.nl>, http://www.cwi.nl/~jack
If I can't dance I don't want to be part of your revolution -- Emma 
Goldman


From skip at pobox.com  Thu Jan 13 03:36:54 2005
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 12 Jan 2005 20:36:54 -0600
Subject: [Python-Dev] Re: [Csv] csv module and universal newlines 
In-Reply-To: <20050112225525.236BE3C889@coffee.object-craft.com.au>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
        <20050110044441.250103C889@coffee.object-craft.com.au>
        <16868.33914.837771.954739@montanaro.dyndns.org>
        <20050112225525.236BE3C889@coffee.object-craft.com.au>
Message-ID: <16869.57030.306263.612202@montanaro.dyndns.org>


    >> The idea of the check is to enforce binary mode on those objects that
    >> support a mode if the desired line terminator doesn't match the
    >> platform's line terminator.

    Andrew> Where that falls down, I think, is where you want to read an
    Andrew> alien file - in fact, under unix, most of the CSV files I read
    Andrew> use \r\n for end-of-line.

Well, you can either require 'b' in that situation or "know" that 'b' isn't
needed on Unix systems.

    Andrew> Also, I *really* don't like the idea of looking for a mode
    Andrew> attribute on the supplied iterator - it feels like a layering
    Andrew> violation. We've advertised the fact that it's an iterator, so
    Andrew> we shouldn't be using anything but the iterator protocol.

The fundamental problem is that the iterator protocol on files is designed
for use only with text mode (or universal newline mode, but that's just as
much of a problem in this context).  I think you either have to abandon the
iterator protocol or peek under the iterator's covers to make sure it reads
and writes in binary mode.  Right now, people on windows create writers like
this

    writer = csv.writer(open("somefile", "w"))

and are confused when their csv files contain blank lines.  I think the
reader and writer objects have to at least emit a warning when they discover
a source or destination that violates the requirements.

Skip

From skip at pobox.com  Thu Jan 13 03:39:41 2005
From: skip at pobox.com (Skip Montanaro)
Date: Wed, 12 Jan 2005 20:39:41 -0600
Subject: [Python-Dev] Re: [Csv] csv module and universal newlines
In-Reply-To: <0E6093F4-64EE-11D9-B7C6-000D934FF6B4@cwi.nl>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
        <20050110044441.250103C889@coffee.object-craft.com.au>
        <16868.33914.837771.954739@montanaro.dyndns.org>
        <0E6093F4-64EE-11D9-B7C6-000D934FF6B4@cwi.nl>
Message-ID: <16869.57197.95323.656027@montanaro.dyndns.org>

    Jack> On MacOSX you really want universal newlines. CSV files produced
    Jack> by older software (such as AppleWorks) will have \r line
    Jack> terminators, but lots of other programs will have files with
    Jack> normal \n terminators.

Won't work.  You have to be able to write a Windows csv file on any
platform.  Binary mode is the only way to get that.

Skip


From bob at redivi.com  Thu Jan 13 03:56:05 2005
From: bob at redivi.com (Bob Ippolito)
Date: Wed, 12 Jan 2005 21:56:05 -0500
Subject: [Python-Dev] Re: [Csv] csv module and universal newlines
In-Reply-To: <16869.57197.95323.656027@montanaro.dyndns.org>
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<20050110044441.250103C889@coffee.object-craft.com.au>
	<16868.33914.837771.954739@montanaro.dyndns.org>
	<0E6093F4-64EE-11D9-B7C6-000D934FF6B4@cwi.nl>
	<16869.57197.95323.656027@montanaro.dyndns.org>
Message-ID: <AB07A6FA-650E-11D9-B569-000A95BA5446@redivi.com>


On Jan 12, 2005, at 21:39, Skip Montanaro wrote:

>     Jack> On MacOSX you really want universal newlines. CSV files 
> produced
>     Jack> by older software (such as AppleWorks) will have \r line
>     Jack> terminators, but lots of other programs will have files with
>     Jack> normal \n terminators.
>
> Won't work.  You have to be able to write a Windows csv file on any
> platform.  Binary mode is the only way to get that.

Isn't universal newlines only used for reading?

I have had no problems using the csv module for reading files with 
universal newlines by opening the file myself or providing an iterator.

Unicode, on the other hand, I have had problems with.

-bob


From andrewm at object-craft.com.au  Thu Jan 13 04:21:41 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 13 Jan 2005 14:21:41 +1100
Subject: [Python-Dev] Re: [Csv] csv module and universal newlines 
In-Reply-To: <AB07A6FA-650E-11D9-B569-000A95BA5446@redivi.com> 
References: <20050105070643.5915B3C8E5@coffee.object-craft.com.au>
	<20050110044441.250103C889@coffee.object-craft.com.au>
	<16868.33914.837771.954739@montanaro.dyndns.org>
	<0E6093F4-64EE-11D9-B7C6-000D934FF6B4@cwi.nl>
	<16869.57197.95323.656027@montanaro.dyndns.org>
	<AB07A6FA-650E-11D9-B569-000A95BA5446@redivi.com>
Message-ID: <20050113032141.78EB13C889@coffee.object-craft.com.au>

>Isn't universal newlines only used for reading?

That right. And the CSV reader has it's own version of univeral newlines
anyway (from the py1.5 days).

>I have had no problems using the csv module for reading files with 
>universal newlines by opening the file myself or providing an iterator.

Neither have I, funnily enough.

>Unicode, on the other hand, I have had problems with.

Ah, so somebody does want it then? Good to hear. Hard to get motivated
to make radical changes without feedback.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Thu Jan 13 04:49:05 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Thu, 13 Jan 2005 14:49:05 +1100
Subject: [Csv] Something's fishy w/ Mac line endings... 
In-Reply-To: <16192.61853.960831.703844@montanaro.dyndns.org> 
References: <16189.38378.352326.481821@montanaro.dyndns.org>
	<20030818010256.31A8E3CA49@coffee.object-craft.com.au>
	<16192.16871.968296.398935@montanaro.dyndns.org>
	<20030818042033.A219F3CA49@coffee.object-craft.com.au>
	<16192.61853.960831.703844@montanaro.dyndns.org>
Message-ID: <20050113034905.18AF43C889@coffee.object-craft.com.au>

Just going back through old mail, and I came across this from last time we
considered this issue:

On Mon, 18 Aug 2003, Skip Montanaro wrote:
>Unfortunately, I think the correct fix is to not require a NUL following
>every \r or \n character encountered.  I think that places the ball in your
>court for the moment.  Can you evaluate how hard that would be?

This would actually result in us losing data, unfortunately (the data
between the \r and the "end-of-string" \0 is part of the file).

What's happening is that the file iterator on the mac is not recognising
\r as end-of-line, and it's presumably returning the whole file as
one line.

I could make the csv parser treat \r as end-of-line and continue
processing the string, but papering over it in the CSV module is only
going to lead to worse problems (what happens if someone tries to read a
2GB file?) - better the user knows they've made an error earlier rather
than later. The problem is that the error message doesn't obviously lead
one to the cause.

I suspect the only answer is to add a caveats, or usage section to the
reference manual.


-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Thu Jan 20 22:04:13 2005
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 20 Jan 2005 15:04:13 -0600
Subject: [Csv] csv module generating an invalid line?
Message-ID: <16880.7373.8765.516395@montanaro.dyndns.org>


We use the csv module in the SpamBayes project as an interchange format (*).
It's generating, in part, a file like this:

    ...
    simplymaya,0,1
    entitled,1,1
    "subject:          
",0,1
    depression.,1,0
    ...

Note the CR inside the quoted field (third line).  When I try to read that
file, blammo!  This example consists of a junk.csv file with just the above
four lines:

    >>> for row in csv.reader(open("junk.csv")):
    ...   print row
    ... 
    ['simplymaya', '0', '1']
    ['entitled', '1', '1']
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    Error: newline inside string

I think one way or the other the csv module is broken.  Either it should be
able to read this csv file or it should somehow generate it differently.

I've confirmed this with Python from CVS (as of Jan 5 05), the 2.4
maintenance branch (as of Dec 26 04) and Python 2.3.4.

Thoughts?

Skip

* See the sb_dbexpimp.py script:

    http://cvs.sourceforge.net/viewcvs.py/spambayes/spambayes/scripts/sb_dbexpimp.py?rev=1.17&view=log

The above has this import:

    try:
	import csv
	# might get the old object craft csv module - has no reader attr 
	if not hasattr(csv, "reader"): 
	    raise ImportError 
    except ImportError:
	import spambayes.compatcsv as csv

Note that I am getting the Python-sourced csv file, not the compatibility
module that's part of the SpamBayes code.

From skip at pobox.com  Thu Jan 20 23:13:05 2005
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 20 Jan 2005 16:13:05 -0600
Subject: [Csv] csv module generating an invalid line?
In-Reply-To: <16880.7373.8765.516395@montanaro.dyndns.org>
References: <16880.7373.8765.516395@montanaro.dyndns.org>
Message-ID: <16880.11505.110836.163654@montanaro.dyndns.org>


    Skip>     ...
    Skip>     simplymaya,0,1
    Skip>     entitled,1,1
    Skip>     "subject:          
    Skip> ",0,1
    Skip>     depression.,1,0
    Skip>     ...

Ack...  Attached to prevent email corruption...

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: junk.csv
Type: application/octet-stream
Size: 73 bytes
Desc: not available
Url : http://mail.python.org/pipermail/csv/attachments/20050120/6269cb94/attachment.obj 

From sjmachin at lexicon.net  Fri Jan 21 02:17:47 2005
From: sjmachin at lexicon.net (sjmachin at lexicon.net)
Date: Fri, 21 Jan 2005 12:17:47 +1100
Subject: [Csv] csv module generating an invalid line?
In-Reply-To: <16880.7373.8765.516395@montanaro.dyndns.org>
Message-ID: <41F0F2EB.23584.13AC6A1@localhost>

On 20 Jan 2005 at 15:04, Skip Montanaro wrote:

> 
> We use the csv module in the SpamBayes project as an interchange
> format (*). It's generating, in part, a file like this:
> 
>     ...
>     simplymaya,0,1
>     entitled,1,1
>     "subject:          
> ",0,1
>     depression.,1,0
>     ...
> 
> Note the CR inside the quoted field (third line).  When I try to read
> that file, blammo!  This example consists of a junk.csv file with just
> the above four lines:
> 
>     >>> for row in csv.reader(open("junk.csv")):
>     ...   print row
>     ... 
>     ['simplymaya', '0', '1']
>     ['entitled', '1', '1']
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in ?
>     Error: newline inside string
> 
> I think one way or the other the csv module is broken.  Either it
> should be able to read this csv file or it should somehow generate it
> differently.
> 
> I've confirmed this with Python from CVS (as of Jan 5 05), the 2.4
> maintenance branch (as of Dec 26 04) and Python 2.3.4.
> 
> Thoughts?
> 
> Skip


>>> file('junk.csv', 'rb').read()
'simplymaya,0,1\r\nentitled,1,1\r\n"subject:          \r",0,1\r\ndepression.,1,0\r\n'

Your junk.csv appears to be a valid csv file. The field containing the embedded \r is 
quoted properly. 

It's the _reader_ that's broken. Doubly so: (1) chucking an exception (2) calling \r a 
"newline".

As you say, it's broken in 2.3 as well:

Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)] on win32
>>> import csv
>>> r = csv.reader(file('junk.csv','rb'))
>>> contents = list(r)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
_csv.Error: newline inside string
>>>

Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on win32
>>> import csv
>>> list(csv.reader(file('junk.csv', 'rb')))
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
_csv.Error: newline inside string
>>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/csv/attachments/20050121/f0e4199a/attachment.html 

From skip at pobox.com  Fri Jan 21 03:31:59 2005
From: skip at pobox.com (Skip Montanaro)
Date: Thu, 20 Jan 2005 20:31:59 -0600
Subject: [Csv] Test case
Message-ID: <16880.27039.553666.583956@montanaro.dyndns.org>

Here's a test script for the problem I described earlier:

    #!/usr/bin/env python

    import csv
    import os

    row = ["sub\rject:","0","1"]

    writer = csv.writer(open("tmp.csv", "wb"))
    writer.writerow(row)
    del writer
    reader = csv.reader(open("tmp.csv", "rb"))
    for row in reader:
        try:
            print row
        except csv.Error, msg:
            print msg
    del reader
    os.remove("tmp.csv")

I cvs up'd my Python source and confirmed that the problem is fixed there.
It's a problem in 2.3 and 2.4 though.  Any chance this can be fixed in time
for 2.3.5?

Skip

From sjmachin at lexicon.net  Sat Jan 22 00:06:57 2005
From: sjmachin at lexicon.net (sjmachin at lexicon.net)
Date: Sat, 22 Jan 2005 10:06:57 +1100
Subject: [Csv] bugs in parsing csv?
Message-ID: <41F225C1.10004.C6B5A1@localhost>

I came across this example in the online version of "Programming in Lua" by Roberto 
Ieru.+y:

>>> weird = '"hello "" hello", "",""\r\n'

This is not IMHO a correctly formed CSV string. It would not be produced by csv.writer.

However csv.reader accepts it without complaint:
>>> import csv
>>> rdr = csv.reader([weird])
>>> weird2 = rdr.next()
>>> weird2
['hello " hello', ' ""', '']

>>> wtr = csv.writer(file('weird2.csv', 'wb'))
>>> wtr.writerow(weird2)
>>> del wtr
>>> file('weird2.csv', 'rb').read()
'"hello "" hello"," """"",\r\n'
# correctly quoted.

Here are some more examples:

>>> csv.reader([' "\r\n']).next()
[' "']
>>> csv.reader([' ""\r\n']).next()
[' ""']
>>> csv.reader(['x ""\r\n']).next()
['x ""']
>>> csv.reader(['x "\r\n']).next()
['x "']

Looks like we don't give a damn if the field doesn't start with a quote. In the real world 
this result might be OK for a field like 'Pat O"Brien' but it does indicate that the data 
source is probably _NOT_ quoting at all.

However a not-infrequent mistake made by people generating what they call csv files is 
to wrap quotes around some/all fields without doubling any pre-existing quotes:

>>> csv.reader(['"Pat O"Brien"\r\n']).next()
['Pat OBrien"'] <<<<<<<<<<<============== aarrbejaysus!!!

Further examples of where the data source needs head alignment and csv.reader 
doesn't complain, giving an unfortunate result:

>>> csv.reader(['spot",the",mistake"\r\n']).next()
['spot"', 'the"', 'mistake"']

>>> csv.reader(['"attempt", "at", "pretty", "formatting"\r\n']).next()
['attempt', ' "at"', ' "pretty"', ' "formatting"']

From skip at pobox.com  Sat Jan 22 21:32:06 2005
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 22 Jan 2005 14:32:06 -0600
Subject: [Csv] List/email migration coming up for mail services host by
	mojam.com
Message-ID: <16882.47174.580103.928772@montanaro.dyndns.org>


Folks,

Mojam.com has a new email server.  This note is a heads up to let everybody
know that I plan to migrate all email services (including mailing lists)
hosted on mail.mojam.com (aka manatee.mojam.com) in the next week or two.
My current preferred date is Saturday, January 29th.  If that presents a
problem for anyone, let me know.  At that time I will make the following
changes:

    * All POP mailboxes will be moved to the new machine

    * Any mailing lists of the form <somelist>@manatee.mojam.com will be
      converted to <somelist>@mail.mojam.com.

Since I'm migrating from a rather old Mandrake Linux machine running
Sendmail as its MTA to a new Fedora Core 2 machine running Postfix as its
MTA, I expect email hosting/forwarding and mailing lists to be unavailable
for a good part of the day.  I will send a message out when I start the
migration and another message once I've finished (hopefully just to tell you
that all changes were successful).  Chris D, I may need to get some phone
time with you to discuss how this will affect any non-Mojam domains you are
responsible for.

If you have any questions, feel free to drop me a note (skip at pobox.com) or
give me a call (847-971-7098), especially if your need is urgent and I've
failed to respond to an email in a timely (< 1 day) fashion.

-- 
Skip Montanaro
skip at mojam.com
http://www.mojam.com/

From andrewm at object-craft.com.au  Mon Jan 24 00:00:28 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 24 Jan 2005 10:00:28 +1100
Subject: [Csv] Test case 
In-Reply-To: <16880.27039.553666.583956@montanaro.dyndns.org> 
References: <16880.27039.553666.583956@montanaro.dyndns.org>
Message-ID: <20050123230028.807C63C889@coffee.object-craft.com.au>

>I cvs up'd my Python source and confirmed that the problem is fixed there.
>It's a problem in 2.3 and 2.4 though.  Any chance this can be fixed in time
>for 2.3.5?

The fix involved some radical surgery, so I doubt it's appropriate for
2.3.5 - sorry.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Mon Jan 24 13:09:09 2005
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 24 Jan 2005 06:09:09 -0600
Subject: [Csv] Test case 
In-Reply-To: <20050123230028.807C63C889@coffee.object-craft.com.au>
References: <16880.27039.553666.583956@montanaro.dyndns.org>
        <20050123230028.807C63C889@coffee.object-craft.com.au>
Message-ID: <16884.58725.780104.976776@montanaro.dyndns.org>


    >> I cvs up'd my Python source and confirmed that the problem is fixed
    >> there.  It's a problem in 2.3 and 2.4 though.  Any chance this can be
    >> fixed in time for 2.3.5?

    Andrew> The fix involved some radical surgery, so I doubt it's
    Andrew> appropriate for 2.3.5 - sorry.

Bummer.  Okay, we have a workaround in SpamBayes, and it is a pretty rare
corner case for that app.  Since we've bumped into this before do you think
it warrants a note in the docs?

Skip

From andrewm at object-craft.com.au  Mon Jan 24 14:17:23 2005
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Tue, 25 Jan 2005 00:17:23 +1100
Subject: [Csv] Test case 
In-Reply-To: <16884.58725.780104.976776@montanaro.dyndns.org> 
References: <16880.27039.553666.583956@montanaro.dyndns.org>
	<20050123230028.807C63C889@coffee.object-craft.com.au>
	<16884.58725.780104.976776@montanaro.dyndns.org>
Message-ID: <20050124131723.E63F73C889@coffee.object-craft.com.au>

>    >> I cvs up'd my Python source and confirmed that the problem is fixed
>    >> there.  It's a problem in 2.3 and 2.4 though.  Any chance this can be
>    >> fixed in time for 2.3.5?
>
>    Andrew> The fix involved some radical surgery, so I doubt it's
>    Andrew> appropriate for 2.3.5 - sorry.
>
>Bummer.  

For reference, the parser was partially doing EOL processing in the
line iterator code, partially in the state machine. This meant the EOL
processing had no idea whether it was in a quoted field or not. In 2.5, I
moved all the EOL processing into the state machine. 

>Okay, we have a workaround in SpamBayes, and it is a pretty rare
>corner case for that app.  Since we've bumped into this before do you think
>it warrants a note in the docs?

Possibly. There's a bunch of other stuff of a similar nature that could do
with documenting. I'm inclined to think of it as a bug - while documenting
bugs is a nice thing, there doesn't seem to be much of a precedent for
it in the reference manual.. 8-)

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Sat Jan 29 23:09:02 2005
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 29 Jan 2005 16:09:02 -0600
Subject: [Csv] testing 1 2 3 ...
Message-ID: <16892.2430.53152.587960@montanaro.dyndns.org>


This is a test message from skip to see if the new email server/mailman is
processing mail well (or at all).  Please disregard.

Skip


From skip at pobox.com  Sun Jan 30 00:00:23 2005
From: skip at pobox.com (Skip Montanaro)
Date: Sat, 29 Jan 2005 17:00:23 -0600
Subject: [Csv] List/email migration complete
Message-ID: <16892.5511.804630.221374@montanaro.dyndns.org>


I believe the new Mojam.com mail server is up and running.  If you have
saved addresses of the form

    somewhere at manatee.mojam.com

please change them to

    somewhere at mojam.com

or

    somewhere at mail.mojam.com

Manatee will continue to forward email for awhile, so there's no immediate
urgency.  Still, I would like to shut off mail server on that machine in the
next couple weeks, so tend to your housekeeping now.

If you notice that any of the Mojam.comm mailing lists or email addresses
you normally use seem to be a black hole, or if you have any other questions
about the Mojam.com mail server, drop me a note directly (skip at pobox.com).

-- 
Skip Montanaro
skip at mojam.com
http://www.mojam.com/