From skip at pobox.com  Sat Aug 16 04:24:42 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri, 15 Aug 2003 21:24:42 -0500
Subject: [Csv] Something's fishy w/ Mac line endings...
Message-ID: <16189.38378.352326.481821@montanaro.dyndns.org>

Folks,

Here's a bug reported against the csv module:

    http://python.org/sf/789519

There seems to be a problem with what it expects to see after the \r
character.  It wants to see either a NUL or a \n followed by a NUL.  In this
case, it sees the '0' which starts the next line.

I've assigned it to myself for now and I'll try to take a look at it over
the weekend, but Andrew or Dave are welcome to investigate.

Skip

From andrewm at object-craft.com.au  Mon Aug 18 03:02:56 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 18 Aug 2003 11:02:56 +1000
Subject: [Csv] Something's fishy w/ Mac line endings... 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<16189.38378.352326.481821@montanaro.dyndns.org> 
References: <16189.38378.352326.481821@montanaro.dyndns.org> 
Message-ID: <20030818010256.31A8E3CA49@coffee.object-craft.com.au>

>Here's a bug reported against the csv module:
>
>    http://python.org/sf/789519
>
>There seems to be a problem with what it expects to see after the \r
>character.  It wants to see either a NUL or a \n followed by a NUL.  In this
>case, it sees the '0' which starts the next line.

I wonder if it's a unicode issue?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Mon Aug 18 05:03:03 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 17 Aug 2003 22:03:03 -0500
Subject: [Csv] Something's fishy w/ Mac line endings... 
In-Reply-To: <20030818010256.31A8E3CA49@coffee.object-craft.com.au>
References: <16189.38378.352326.481821@montanaro.dyndns.org>
        <20030818010256.31A8E3CA49@coffee.object-craft.com.au>
Message-ID: <16192.16871.968296.398935@montanaro.dyndns.org>

    >> There seems to be a problem with what it expects to see after the \r
    >> character.  It wants to see either a NUL or a \n followed by a NUL.
    >> In this case, it sees the '0' which starts the next line.

    Andrew> I wonder if it's a unicode issue?

Shouldn't be.  The test case the submitter posted only uses ASCII.

Looking at the problem a bit, I see this call chain:

    Reader_iternext ->
        PyIter_Next ->
            file_iternext ->
                readahead_get_line_skip

readahead_get_line_skip notes the presence of \n and NUL terminates the line
it returns, but not the presense of \r.

I see one of two possible solutions:

    1. See if readahead_get_line_skip should special-case \r when not
       followed by \n.  I think this may be the "most correct" approach.

    2. Change Reader_iternext to not rely on a NUL following the putative
       end-of-line character.

Skip

From skip at pobox.com  Mon Aug 18 05:35:52 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun, 17 Aug 2003 22:35:52 -0500
Subject: [Csv] Something's fishy w/ Mac line endings... 
Message-ID: <16192.18840.483109.460137@montanaro.dyndns.org>

I wrote:

    Looking at the problem a bit, I see this call chain:

        Reader_iternext ->
            PyIter_Next ->
                file_iternext ->
                    readahead_get_line_skip

On second thought, I think the problem may be that we're calling PyIter_Next
at all.  That's probably only supposed to work if the file is opened in text
mode.  Since we expect files to be opened in binary mode, Reader_iternext
should probably be doing its own EOL detection based upon the setting of the
lineterminator.  That's a lot of extra labor, but may be the correct
solution.

Skip

From andrewm at object-craft.com.au  Mon Aug 18 06:20:33 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 18 Aug 2003 14:20:33 +1000
Subject: [Csv] Something's fishy w/ Mac line endings... 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<16192.16871.968296.398935@montanaro.dyndns.org> 
References: <16189.38378.352326.481821@montanaro.dyndns.org>
	<20030818010256.31A8E3CA49@coffee.object-craft.com.au>
	<16192.16871.968296.398935@montanaro.dyndns.org> 
Message-ID: <20030818042033.A219F3CA49@coffee.object-craft.com.au>

>    Andrew> I wonder if it's a unicode issue?
>
>Shouldn't be.  The test case the submitter posted only uses ASCII.

However, OS-X deals with unicode natively - the standard terminal window
interprets UTF-8 correctly, and, presumably, can also generate it as input
to a character mode application...

>readahead_get_line_skip notes the presence of \n and NUL terminates the line
>it returns, but not the presense of \r.
>
>I see one of two possible solutions:
>
>    1. See if readahead_get_line_skip should special-case \r when not
>       followed by \n.  I think this may be the "most correct" approach.

I can't remember - is the EOL character a property of the Reader?

We need to do a more comprehensive update for Unicode (while making the
string handling 8 bit clean), but the most expedient fix is appropriate
for Python 2.3.1.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Mon Aug 18 08:35:12 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 18 Aug 2003 16:35:12 +1000
Subject: [Csv] Something's fishy w/ Mac line endings... 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<16192.18840.483109.460137@montanaro.dyndns.org> 
References: <16192.18840.483109.460137@montanaro.dyndns.org> 
Message-ID: <20030818063512.C7FAB3CA49@coffee.object-craft.com.au>

>I wrote:
>
>    Looking at the problem a bit, I see this call chain:
>
>        Reader_iternext ->
>            PyIter_Next ->
>                file_iternext ->
>                    readahead_get_line_skip
>
>On second thought, I think the problem may be that we're calling PyIter_Next
>at all.  That's probably only supposed to work if the file is opened in text
>mode.  Since we expect files to be opened in binary mode, Reader_iternext
>should probably be doing its own EOL detection based upon the setting of the
>lineterminator.  That's a lot of extra labor, but may be the correct
>solution.

I think the intention was that by using PyIter_Next, we'd get the advantage
of the universal EOL support in 2.3 - in which case, maybe we should drop
our own EOL detection...

I wonder if the user's problems go away when they open their file in text
mode?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Mon Aug 18 17:27:41 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 18 Aug 2003 10:27:41 -0500
Subject: [Csv] Something's fishy w/ Mac line endings... 
In-Reply-To: <20030818063512.C7FAB3CA49@coffee.object-craft.com.au>
References: <16192.18840.483109.460137@montanaro.dyndns.org>
        <20030818063512.C7FAB3CA49@coffee.object-craft.com.au>
Message-ID: <16192.61549.621429.454836@montanaro.dyndns.org>

    Andrew> I think the intention was that by using PyIter_Next, we'd get
    Andrew> the advantage of the universal EOL support in 2.3 - in which
    Andrew> case, maybe we should drop our own EOL detection...

I think we would sacrifice 2.2 compatibility and the ability to set any eol
besides \n, \r\n or \r.

    Andrew> I wonder if the user's problems go away when they open their
    Andrew> file in text mode?

The author's test did open the files in text mode.  I added the 'b' to make
the test conform to our current expectations.

How hard would it be for you to modify _csv to not require a NUL after the
putative EOL character?

Skip

From skip at pobox.com  Mon Aug 18 17:32:45 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 18 Aug 2003 10:32:45 -0500
Subject: [Csv] Something's fishy w/ Mac line endings... 
In-Reply-To: <20030818042033.A219F3CA49@coffee.object-craft.com.au>
References: <16189.38378.352326.481821@montanaro.dyndns.org>
        <20030818010256.31A8E3CA49@coffee.object-craft.com.au>
        <16192.16871.968296.398935@montanaro.dyndns.org>
        <20030818042033.A219F3CA49@coffee.object-craft.com.au>
Message-ID: <16192.61853.960831.703844@montanaro.dyndns.org>

    Andrew> I can't remember - is the EOL character a property of the Reader?

It's a property of the dialect object.  Currently, I don't think we restrict
the lineterminator attribute, so it would probably be valid for it to be
":", \b or '47'.

    Andrew> We need to do a more comprehensive update for Unicode (while
    Andrew> making the string handling 8 bit clean), but the most expedient
    Andrew> fix is appropriate for Python 2.3.1.

Unfortunately, I think the correct fix is to not require a NUL following
every \r or \n character encountered.  I think that places the ball in your
court for the moment.  Can you evaluate how hard that would be?

I note that ReaderObj does contain a dialect field, so you do have access to
the lineterminator while reading the file.

Skip

From andrewm at object-craft.com.au  Tue Aug 19 04:15:03 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Tue, 19 Aug 2003 12:15:03 +1000
Subject: [Csv] Something's fishy w/ Mac line endings... 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<16192.61549.621429.454836@montanaro.dyndns.org> 
References: <16192.18840.483109.460137@montanaro.dyndns.org>
	<20030818063512.C7FAB3CA49@coffee.object-craft.com.au>
	<16192.61549.621429.454836@montanaro.dyndns.org> 
Message-ID: <20030819021503.7BFE83CA49@coffee.object-craft.com.au>

>    Andrew> I think the intention was that by using PyIter_Next, we'd get
>    Andrew> the advantage of the universal EOL support in 2.3 - in which
>    Andrew> case, maybe we should drop our own EOL detection...
>
>I think we would sacrifice 2.2 compatibility and the ability to set any eol
>besides \n, \r\n or \r.

It's still think it's the right thing to do: there should only be one
line splitting implementation in Python. If the user has conventions
that don't match, they're a) not dealing with a csv file, and b) can
provide their own line iterator (which is a more general solution anyway).

And as there is no separate distribution of the new csv module, the 2.2
compatibility is pretty moot (you'd have to download 2.3 and extract
the module yourself).

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From skip at pobox.com  Tue Aug 19 05:31:05 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon, 18 Aug 2003 22:31:05 -0500
Subject: [Csv] Something's fishy w/ Mac line endings... 
In-Reply-To: <20030819021503.7BFE83CA49@coffee.object-craft.com.au>
References: <16192.18840.483109.460137@montanaro.dyndns.org>
        <20030818063512.C7FAB3CA49@coffee.object-craft.com.au>
        <16192.61549.621429.454836@montanaro.dyndns.org>
        <20030819021503.7BFE83CA49@coffee.object-craft.com.au>
Message-ID: <16193.39417.902830.721032@montanaro.dyndns.org>

    Andrew> And as there is no separate distribution of the new csv module,
    Andrew> the 2.2 compatibility is pretty moot (you'd have to download 2.3
    Andrew> and extract the module yourself).

One of the arguments for making new modules work with the previous minor
release is that they get adopted faster.  If people are stuck on 2.2.x for
some reason, they can still parse csv files and either not have to wait for
2.3 or change the way they do that when 2.3 is released.

There's also the problem that 2.3.1 is supposed to be a bugfix release.
Even though the csv module has only been around a short time and we aren't
likely to break much, if any, code, changing the semantics needs to be
considered carefully.  The assumption here is that to fix the bug properly
we have to change the module's semantics.

Also, what about writing?  If a user says they want Mac line endings, we
have to guarantee that, right?  That means for writing we still require
files be opened as 'wb', not 'wU', otherwise \r would get translated into
the platform's actual EOL sequence.

Skip

From andrewm at object-craft.com.au  Tue Aug 19 05:56:50 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Tue, 19 Aug 2003 13:56:50 +1000
Subject: [Csv] Something's fishy w/ Mac line endings... 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<16193.39417.902830.721032@montanaro.dyndns.org> 
References: <16192.18840.483109.460137@montanaro.dyndns.org>
	<20030818063512.C7FAB3CA49@coffee.object-craft.com.au>
	<16192.61549.621429.454836@montanaro.dyndns.org>
	<20030819021503.7BFE83CA49@coffee.object-craft.com.au>
	<16193.39417.902830.721032@montanaro.dyndns.org> 
Message-ID: <20030819035650.DB9C83CA4A@coffee.object-craft.com.au>

>    Andrew> And as there is no separate distribution of the new csv module,
>    Andrew> the 2.2 compatibility is pretty moot (you'd have to download 2.3
>    Andrew> and extract the module yourself).
>
>One of the arguments for making new modules work with the previous minor
>release is that they get adopted faster.  If people are stuck on 2.2.x for
>some reason, they can still parse csv files and either not have to wait for
>2.3 or change the way they do that when 2.3 is released.
>
>There's also the problem that 2.3.1 is supposed to be a bugfix release.
>Even though the csv module has only been around a short time and we aren't
>likely to break much, if any, code, changing the semantics needs to be
>considered carefully.  The assumption here is that to fix the bug properly
>we have to change the module's semantics.
>
>Also, what about writing?  If a user says they want Mac line endings, we
>have to guarantee that, right?  That means for writing we still require
>files be opened as 'wb', not 'wU', otherwise \r would get translated into
>the platform's actual EOL sequence.

The problem is that our end of line processing is incompatible with the
use of an iterator as the source of input lines - there is no satisfactory
answer that allows us to retain both.

The requirement that the input file be opened in binary mode for what
is obviously a text format is going to a never ending source of suprise
for people using the module, and seems like a bigger wart than the one
we're now facing.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From sjmachin at lexicon.net  Thu Aug 21 00:59:24 2003
From: sjmachin at lexicon.net (sjmachin at lexicon.net)
Date: Thu, 21 Aug 2003 08:59:24 +1000
Subject: [Csv] Something's fishy w/ Mac line endings... 
In-Reply-To: <20030819035650.DB9C83CA4A@coffee.object-craft.com.au>
References: Message from Skip Montanaro <skip@pobox.com>
	<16193.39417.902830.721032@montanaro.dyndns.org> 
Message-ID: <3F4489EC.31161.BC1B6A@localhost>

On 19 Aug 2003 at 13:56, Andrew McNamara wrote:

> The problem is that our end of line processing is incompatible with the
> use of an iterator as the source of input lines - there is no satisfactory
> answer that allows us to retain both.

Using an iterator as a source of what? Lines, you say? The documentation says it 
"iterates over lines" [what does that mean?] and that the iterator should return "strings", 
without saying what they should contain, how they should be terminated, etc. See 
examples with commentary below.

> 
> The requirement that the input file be opened in binary mode for what
> is obviously a text format is going to a never ending source of suprise
> for people using the module, and seems like a bigger wart than the one
> we're now facing.
> 

I agree on the surprise factor with binary mode. It's not obvious what the purpose is. 
How does Excel on the Mac terminate lines in CSV files? CR or CRLF?

>>> alist= ['aaa,bbb,ccc', 'ddd,eee', 'fff']
>>> [x for x in csv.reader(alist)]
[['aaa', 'bbb', 'ccc'], ['ddd', 'eee'], ['fff']]
# so we don't need line terminators

>>> blist= ['aaa,bbb,ccc\n', 'ddd,eee\n', 'fff\n']
>>> [x for x in csv.reader(blist)]
[['aaa', 'bbb', 'ccc'], ['ddd', 'eee'], ['fff']]
# but if they are supplied, they are ignored

>>> clist= ['aaa,bbb\nccc\n', 'ddd,eee\n', 'fff\n']
>>> [x for x in csv.reader(clist)]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
_csv.Error: newline inside string
# except when embedded in an unquoted string/line

>>> dlist= ['aaa,"bbb\nccc",qqq\n', 'ddd,eee\n', 'fff\n']
>>> [x for x in csv.reader(dlist)]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
_csv.Error: newline inside string
# whoops, we really do have to pretend we are reading a file in *TEXT* mode (see next 
example)

>>> elist= ['aaa,"bbb\n', 'ccc",qqq\n', 'ddd,eee\n', 'fff\n']
>>> [x for x in csv.reader(elist)]
[['aaa', 'bbb\nccc', 'qqq'], ['ddd', 'eee'], ['fff']]
# Wow, how do we explain all that to J. Random Newbie?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/csv/attachments/20030821/5b030561/attachment.htm 

From sdyer at dyermail.net  Wed Aug 20 21:40:39 2003
From: sdyer at dyermail.net (Shawn Dyer)
Date: Wed, 20 Aug 2003 14:40:39 -0500 (CDT)
Subject: [Csv] PEP 305
Message-ID: <13129.204.167.177.68.1061408439.squirrel@dyermail.net>

In studying the new CSV module, I find two problems, particularly in
interpreting csv files used for database import/export. Currently we use
our own csv parsing/writing utility, but would like to use the language
supported facility if possible.

1. When reading a field with adjacent delimiters (an empty field), your
code always maps that to an empty string. When interpreting DB output (at
least for DB2), an empty string is a pair of quotes. An empty field
represents NULL in the database and we parse that as the Python object
None (same result as from an SQL query). Using the csv module as is, an
empty string and None export identically. If this behavior were encoded
into the dialect, we could easily modify this behavior to suit our needs.

2. The other problem for my application, is the differentiation between
numeric data and strings of numbers in the csv file (this again is related
to DB2 import/export files). Our needs are to map anything with quotes in
the csv to a string (even if it is numeric). Anything without quotes
should map to a Python numeric type (or, as mentioned above, None when
adjacent delimiters appear). Of course, this would imply the possibility
of a ValueError when reading a csv. Again, it seems this behavior could be
parameterized out into the dialect.

Possibly both items could be addressed by a map_to_python_object
parameter.

If you are interested in including these modifications, I can try to come
up with a patch.

From bdelmee at advalvas.be  Thu Aug 21 20:35:17 2003
From: bdelmee at advalvas.be (=?iso-8859-1?Q?Bernard_Delm=E9e?=)
Date: Thu, 21 Aug 2003 20:35:17 +0200
Subject: [Csv] How to use a non-default delimiter with DictReader?
Message-ID: <003d01c36812$fa7b4af0$6702a8c0@shazam.be>

Hello,

I am not sure this is the right place to post, else let me know.
I mean, is this address dedicated to the development of the CSV  
module, or to its mere usage as well?

Anyway, I can't seem to be able to specify the delimiter when 
building a DictReader()

I can do:

    inf = file('data.csv')
    rd = csv.reader( inf, delimiter=';' )
    for row in rd:
        # ...

But this is rejected:

    inf = file('data.csv')
    headers = inf.readline().split(';')
    rd = csv.DictReader( inf, headers, delimiter=';' )
    for row in rd:
        # ...

The DictReader constructor fails with a TypeError: 
_init_() got an unexpected keyword argument 'delimiter'
Maybe I am missing something here?

One rather convoluted workaround is the following:

    inf = file('data.csv')
    d = csv.Sniffer().sniff(s)
    inf.seek(0)
    headers = inf.readline().split(';')
    rd = csv.DictReader( inf, headers, dialect=d )
    for row in rd:
        # ...

If DialectReader does indeed not accept the optional "fmtparam"
then at least the documentation needs fixing ;-) But then again
I may just be misreading it....

TIA,

Bernard.

From andrewm at object-craft.com.au  Fri Aug 22 02:58:49 2003
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Fri, 22 Aug 2003 10:58:49 +1000
Subject: [Csv] PEP 305 
In-Reply-To: Message from "Shawn Dyer" <sdyer@dyermail.net> 
	<13129.204.167.177.68.1061408439.squirrel@dyermail.net> 
References: <13129.204.167.177.68.1061408439.squirrel@dyermail.net> 
Message-ID: <20030822005849.3CA3B3CA4A@coffee.object-craft.com.au>

>In studying the new CSV module, I find two problems, particularly in
>interpreting csv files used for database import/export. Currently we use
>our own csv parsing/writing utility, but would like to use the language
>supported facility if possible.
>
>1. When reading a field with adjacent delimiters (an empty field), your
>code always maps that to an empty string. When interpreting DB output (at
>least for DB2), an empty string is a pair of quotes. An empty field
>represents NULL in the database and we parse that as the Python object
>None (same result as from an SQL query). Using the csv module as is, an
>empty string and None export identically. If this behavior were encoded
>into the dialect, we could easily modify this behavior to suit our needs.
>
>2. The other problem for my application, is the differentiation between
>numeric data and strings of numbers in the csv file (this again is related
>to DB2 import/export files). Our needs are to map anything with quotes in
>the csv to a string (even if it is numeric). Anything without quotes
>should map to a Python numeric type (or, as mentioned above, None when
>adjacent delimiters appear). Of course, this would imply the possibility
>of a ValueError when reading a csv. Again, it seems this behavior could be
>parameterized out into the dialect.
>
>Possibly both items could be addressed by a map_to_python_object
>parameter.

You raise valid points, and it's something we argued over for some time
when preparing the module for Python 2.3. I tend to agree that a switch
of some sort should enable this behaviour, but I suspect it will need
to be at least partially implemented in the underlying C parser (which
makes it a little less trivial).

As you note, there are two separate problems here - the first is that it
is impossible to distinguish between an empty field and an empty string:
this will need changes to the C parser. The second is that of typing the
results: I'm not convinced this belongs in the csv module - the database
user probably has a better idea of the required types than the csv module
could ever have. A layer on top of the csv parser that takes hints from
the database and casts columns to the appropriate type would be the best
option - possibly a list of type converters would be passed in.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/