From skip at pobox.com  Tue Mar  2 16:23:38 2004
From: skip at pobox.com (Skip Montanaro)
Date: Tue, 2 Mar 2004 09:23:38 -0600
Subject: [Csv] Re: csv bugs
In-Reply-To: <slrnc48oph.8ob.mlh@furu.idi.ntnu.no>
References: <slrnc48oph.8ob.mlh@furu.idi.ntnu.no>
Message-ID: <16452.42746.753582.600239@montanaro.dyndns.org>


(A better place for this discussion would probably be csv at mail.mojam.com.
I'm adding it to the cc list.)

    Magnus> It seems that when a line termination is escaped (using the
    Magnus> current escape character), csv.reader treats it as a line
    Magnus> continuation, which is well an good -- but it doesn't discard
    Magnus> the escape character; instead, it escapes it implicitly. This
    Magnus> seems like a bug to me. E.g.

    Magnus>   foo:bar:baz\
    Magnus>   frozz:bozz

    Magnus> with separator ':' and escape character '\\' is parsed into

    Magnus>   ['foo', 'bar', 'baz\\\nfrozz', 'bozz']

    Magnus> In my opinion, it *ought* to be parsed into

    Magnus>   ['foo', 'bar', 'baz\nfrozz', 'bozz']

    Magnus> As far as I know, this is the UNIX convention, as used in (e.g.)
    Magnus> /etc/passwd.

That may be, however development of the csv module's parser was driven by
how Microsoft Excel behaves.  The assumption was (rightly I think) that
Excel reads or writes more CSV files than anything else.  I don't believe it
does anything with backslashes.

    Magnus> Am I off target here? If the current behaviour is desirable
    Magnus> (although I can't see why it should be) then at least I think
    Magnus> there should be a way of implementing "normal" line
    Magnus> continuations (as in my example), which is the standard UNIX
    Magnus> behavior, and the behavior of Python source, for that
    Magnus> matter. Otherwise, csv can't be used to parse (e.g.)
    Magnus> /etc/passwd...

You're welcome to submit a patch.  I don't have time for it.

    Magnus> And another thing: Perhaps a 'passwd' dialect could be added
    Magnus> alongside 'excel'? Something like:

    Magnus> class passwd(Dialect):
    Magnus>     delimiter = ':'
    Magnus>     doublequote = False
    Magnus>     escapechar = '\\'
    Magnus>     lineterminator = '\n'
    Magnus>     quotechar = '?'
    Magnus>     quoting = QUOTE_NONE
    Magnus>     skipinitialspace = False
    Magnus> register_dialect("passwd", passwd)

I'll take a look at that.

    Magnus> For some reason you *have* to supply a quotechar, even if you
    Magnus> set QUOTE_NONE... I guess that's a bug too, in my book.

Maybe.  Maybe just a feature.

    Magnus> If there are no objections, I might submit some of this as a bug
    Magnus> report or two (or even a patch).

Please do.

Skip

From magnus at hetland.org  Tue Mar  2 18:24:46 2004
From: magnus at hetland.org (Magnus Lie Hetland)
Date: Tue, 2 Mar 2004 18:24:46 +0100
Subject: [Csv] Re: csv bugs
Message-ID: <20040302172446.GA17004@idi.ntnu.no>

> (A better place for this discussion would probably be
> csv at mail.mojam.com.  I'm adding it to the cc list.)

Ah -- sorry. I wasn't aware of the list. I've subscribed now.

[snip]

> That may be, however development of the csv module's parser was
> driven by how Microsoft Excel behaves.

But wasn't also a driving force to allow "full" customization?

> The assumption was (rightly I think) that Excel reads or writes more
> CSV files than anything else. I don't believe it does anything with
> backslashes.

I'm sure you're right. The point is that the csv module supports
escape characters, and I believe the thing I pointed out is a missing
piece of functionality for those.

In other words: The Excel dialect uses quoting to deal with in-field
separators, quotes and newlines. The passwd dialect uses escapes to
deal with these. *However*, the csv module only supports dealing with
separators and escape characters using the escape character (quotes
are a non-issue, of course), not newlines. In other words, if you
choose to use an escape character rather than quotes, you can't have
newlines in your fields.

Almost, anyway. The fact is, as far as I can see, that you *can*
escape newlines, but in that special case, the escape character
*isn't* removed (as it is when you escape separators or escape
characters). This seems inconsistent, and has nothing to do with
backslashes in particular, just how escape characters should behave.

[snip]
> You're welcome to submit a patch.  I don't have time for it.

OK -- I guess I'm mainly looking for some feedback about whether this
seems like a reasonable behavior. (I'm quite thoroughly convinced that
it is, but I may very well be wrong :)

I haven't looked at the C implementation, so no promises about a patch
there... :/

> > And another thing: Perhaps a 'passwd' dialect could be added
> > alongside 'excel'? Something like:
[snip]
> I'll take a look at that.

Not sure about setting the quote character to '?' here, but since it
doesn't matter and you need to have one, it seemed like a natural
choice. (None wasn't allowed.)

> > For some reason you *have* to supply a quotechar, even if you
> > set QUOTE_NONE... I guess that's a bug too, in my book.
>
> Maybe.  Maybe just a feature.

Well, maybe ;)

But if you don't need an escape character when you're using quotes, I
don't think you should need quotes when you're using an escape
character.

Then again: I guess you do use an escape character (i.e. a double
quote) in the quoted mode as well, which may be what's complicating
the semantics and confusing me. Not sure how

  "foo "
  bar"

should be interpreted, for example. In this case removing the quote
may not make sense.

And... Adding another switch (or something) dictating the behavior of
the escape character doesn't seem good...

> Skip

-- 
Magnus Lie Hetland           "The mind is not a vessel to be filled,
http://hetland.org            but a fire to be lighted."  [Plutarch]

From mamo19 at handelsbanken.se  Fri Mar  5 17:48:54 2004
From: mamo19 at handelsbanken.se (Marjaneh Mojaverian)
Date: Fri, 5 Mar 2004 17:48:54 +0100
Subject: [Csv] PEP 305
Message-ID: <OFC9AF8E98.C8EB7CF7-ONC1256E4E.005C43E6-C1256E4E.005C5D4C@notes.handelsbanken.se>

I would like to know which extenstion library to use if I need
to use CSV module in an external application.

Regards

Marjaneh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/csv/attachments/20040305/2e154f5d/attachment.htm 

From magnus at hetland.org  Sat Mar 13 16:48:08 2004
From: magnus at hetland.org (Magnus Lie Hetland)
Date: Sat, 13 Mar 2004 16:48:08 +0100
Subject: [Csv] Thoughts about a patch
Message-ID: <20040313154808.GA8421@idi.ntnu.no>

(First some background -- se bottom for my suggestion.)

I mentioned offering a patch to support Unix-style password syntax.
I've had a look at the _csv.c code, and it seems quite possible to do.
I've also had the following pointed out to me:

http://groups.google.com/groups?selm=vsb89q1d3n5qb1%40corp.supernews.com

I guess I could take a look at some of the issues there too, while I'm
at it (since they seem related)?

The syntax I'm thinking about is described in ESR's newest book (The
Art of Unix Programming):

http://www.catb.org/~esr/writings/taoup/html/ch05s02.html#id2901882

Basically there is no quoting, only escaping. Also, escaped newlines
are ignored (or so he says -- I expect the convention here would be to
treat it as a single space character) and it is possible to include
c-style backslash escapes (just like in Python strings; \n is a
newline and so forth).

The current behavior of _csv.c is simply to put in the escape
character verbatim, unless it precedes another escape character, a
delimiter, or a quote character. So I guess it's only necessary to
expand slightly the 'case ESCAPED_CHAR:' bit.

Suggestion (ignoring the bugs reported int he Usenet post above, for
now):

Don't go all-out on this. Simply interpret '\\\n' as '\n', just like
we interpret '\\:' as ':' (if ':' is the field separator). After all,
'\n' (or, in general, the record separator) is just as much a special
character in need of quoting as the other three (escape, delimiter,
and quote character).

C-style escapes, however, aren't as integral to the CSV language --
they can be handled afterward, when interpreting the contents.

Does this seem OK? It would mean slight backward-breakage, but it
seems odd that someone should have escaped newlines and still wanted
the escape character to be left in place, doesn't it?

If this seems OK I'll be happy to write up a patch.

-- 
Magnus Lie Hetland           "The mind is not a vessel to be filled,
http://hetland.org            but a fire to be lighted."  [Plutarch]

From magnus at hetland.org  Sat Mar 13 16:54:04 2004
From: magnus at hetland.org (Magnus Lie Hetland)
Date: Sat, 13 Mar 2004 16:54:04 +0100
Subject: [Csv] Thoughts about a patch
Message-ID: <20040313155404.GA9516@idi.ntnu.no>

I guess I just haven't understood the code well enough yet, but in the
parsing code there are comparisons of the type

  if (c == '\n')

I suppose the newlines are normalized versions of lineterminator? In
other words, no matter what the line terminator is, it is safe to
pretend that it has been changed to '\n' in the parsing case
statement? Or? (I mean, I've tried to use lineterminator='|' and that
worked just nicely, but I don't see the use of lineterminator in the
case statement anywhere.)

-- 
Magnus Lie Hetland           "The mind is not a vessel to be filled,
http://hetland.org            but a fire to be lighted."  [Plutarch]

From andrewm at object-craft.com.au  Mon Mar 15 00:21:58 2004
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 15 Mar 2004 10:21:58 +1100
Subject: [Csv] Thoughts about a patch 
In-Reply-To: Message from Magnus Lie Hetland <magnus@hetland.org> 
   of "Sat, 13 Mar 2004 16:48:08 BST." <20040313154808.GA8421@idi.ntnu.no> 
References: <20040313154808.GA8421@idi.ntnu.no> 
Message-ID: <20040314232158.7C7543C0BA@coffee.object-craft.com.au>

>Don't go all-out on this. Simply interpret '\\\n' as '\n', just like
>we interpret '\\:' as ':' (if ':' is the field separator). After all,
>'\n' (or, in general, the record separator) is just as much a special
>character in need of quoting as the other three (escape, delimiter,
>and quote character).

I guess that sounds reasonable. 

It's often very difficult to make changes to code that is in the standard
distribution - there always seems to be someone relying on the previous
behaviour... 8-)

You might want to make sure that, inside quotes, the special meaning of
the escape character is removed (on the basis that Excel uses quotes
exclusively (no quote character). However - I suspect we didn't get
this right, and still honour the escape within a quoted string - if you
find that we still honour the escape within a quoted string, your change
should too (to remain consistent).

Did that make any sense?

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From andrewm at object-craft.com.au  Mon Mar 15 00:26:51 2004
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 15 Mar 2004 10:26:51 +1100
Subject: [Csv] Thoughts about a patch 
In-Reply-To: Message from Magnus Lie Hetland <magnus@hetland.org> 
   of "Sat, 13 Mar 2004 16:54:04 BST." <20040313155404.GA9516@idi.ntnu.no> 
References: <20040313155404.GA9516@idi.ntnu.no> 
Message-ID: <20040314232651.E48C43C0BA@coffee.object-craft.com.au>

>I guess I just haven't understood the code well enough yet, but in the
>parsing code there are comparisons of the type
>
>  if (c == '\n')
>
>I suppose the newlines are normalized versions of lineterminator? In
>other words, no matter what the line terminator is, it is safe to
>pretend that it has been changed to '\n' in the parsing case
>statement? Or? (I mean, I've tried to use lineterminator='|' and that
>worked just nicely, but I don't see the use of lineterminator in the
>case statement anywhere.)

One thing to bear in mind is the history of the CSV module - it dates back
to Python 1.5 times, when python didn't have universal newline support.

If I remember correctly, lineterminator is only used when generating CSV
output, not when parsing input. On input, the value of lineterminator
is ignored, and \r and \n are hard-coded.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From magnus at hetland.org  Mon Mar 15 08:44:33 2004
From: magnus at hetland.org (Magnus Lie Hetland)
Date: Mon, 15 Mar 2004 08:44:33 +0100
Subject: [Csv] Thoughts about a patch
In-Reply-To: <20040314232651.E48C43C0BA@coffee.object-craft.com.au>
References: <20040313155404.GA9516@idi.ntnu.no>
	<20040314232651.E48C43C0BA@coffee.object-craft.com.au>
Message-ID: <20040315074433.GA16067@idi.ntnu.no>

Andrew McNamara <andrewm at object-craft.com.au>:
>
> >I guess I just haven't understood the code well enough yet, but in the
> >parsing code there are comparisons of the type
> >
> >  if (c == '\n')
> >
> >I suppose the newlines are normalized versions of lineterminator? In
> >other words, no matter what the line terminator is, it is safe to
> >pretend that it has been changed to '\n' in the parsing case
> >statement? Or? (I mean, I've tried to use lineterminator='|' and that
> >worked just nicely, but I don't see the use of lineterminator in the
> >case statement anywhere.)
> 
> One thing to bear in mind is the history of the CSV module - it
> dates back to Python 1.5 times, when python didn't have universal
> newline support.

I see. Even so -- I don't see how universal newline support is needed
for this...?

> If I remember correctly, lineterminator is only used when generating CSV
> output, not when parsing input. On input, the value of lineterminator
> is ignored, and \r and \n are hard-coded.

Oh -- how unfortunate :]

Is this documented in the PEP/standard docs? I've just browsed them,
but couldn't find the distinction between parameters that affect
reading and those affecting writing. To quote the PEP:

  "In addition to the dialect argument, both the reader and writer
   constructors take several specific formatting parameters, specified
   as keyword parameters."

One of the parameters listed under this (which, then, applies to the
reader) is:

  "lineterminator specifies the character sequence which should
   terminate rows."

It seems highly natural to me that reader and writer should be
completely symmetrical here -- i.e. you should *definitely* be able to
read back your own output, using the same Dialect (IMO).

(I do see something hinting at this problem in item 5 of the issue
list, though.)

I guess I had my eyes crossed when I did my experiment with
lineterminator set to '|' -- I thought it worked when reading, but
you're right -- it doesn't.

In other words, a potential patch should probably also add support for
parsing arbitrary line terminators -- or?

It could, of course, be that I should simply write the parsing code
into my own projects in Python. It just seems a shame not to use the
csv module when it exists. It seems to sit on the brink of generality,
just a tad biased toward the Microsoft dialect (which was, I gather,
part of the original design goals).

-- 
Magnus Lie Hetland           "The mind is not a vessel to be filled,
http://hetland.org            but a fire to be lighted."  [Plutarch]

From magnus at hetland.org  Mon Mar 15 09:09:45 2004
From: magnus at hetland.org (Magnus Lie Hetland)
Date: Mon, 15 Mar 2004 09:09:45 +0100
Subject: [Csv] Thoughts about a patch
In-Reply-To: <20040314232158.7C7543C0BA@coffee.object-craft.com.au>
References: <20040313154808.GA8421@idi.ntnu.no>
	<20040314232158.7C7543C0BA@coffee.object-craft.com.au>
Message-ID: <20040315080945.GC16067@idi.ntnu.no>

Andrew McNamara <andrewm at object-craft.com.au>:
>
> >Don't go all-out on this. Simply interpret '\\\n' as '\n', just like
> >we interpret '\\:' as ':' (if ':' is the field separator). After all,
> >'\n' (or, in general, the record separator) is just as much a special
> >character in need of quoting as the other three (escape, delimiter,
> >and quote character).
> 
> I guess that sounds reasonable. 

OK. Now, this applies to reading, so it would imply making
lineterminator work for readers as well.

> It's often very difficult to make changes to code that is in the
> standard distribution - there always seems to be someone relying on
> the previous behaviour... 8-)

Yes, indeed. I've been thinking about that. Perhaps there should be
some flag or mode or something that decides how things work? For
example, there could be a "compatibility" flag that is True by
default; or there could be an "ESCAPE_ONLY" value for quoting... Or
even separate functions or a separate submodule... I don't know.

It seems that, perhaps, even though this is a relatively minor issue,
it might warrant a PEP...?

> You might want to make sure that, inside quotes, the special meaning
> of the escape character is removed (on the basis that Excel uses
> quotes exclusively (no quote character).

Hm. How about a quoted field like this, then?

  "Foo bar \" baz"

With '"' as quotechar and '\\' as escapechar. Wouldn't it be natural
to allow this, and to interpret '\\"' as '"'? I mean, if you *didn't*
want this behavior, you'd set escapechar to None -- or?

> However - I suspect we didn't get this right, and still honour the
> escape within a quoted string - if you find that we still honour the
> escape within a quoted string, your change should too (to remain
> consistent).

I'm not sure exactly how you mean it should behave. I understand that,
for example

  "foo \, bar"

should become

  ['foo \\, bar']

and not

  ['foo , bar']

But still,

  "foo \" bar"

should become

  ['foo " bar']

in my opinion. Don't you agree?

However, as it is, "foo \, bar" is interpreted as ['foo , bar'].

It almost seems like this should be dialect-dependent -- but, then
again, lots of interacting parameters is a recipy for (combinatorial)
disaster. (And the vagueness and complexity of the Microsoft CSV
dialect isn't helping :)

> Did that make any sense?

Sure. I think the core issue, IMO, is what the escape character really
means, and whether that meaning can be constant or whether it must
depend on something else. 

OTOH: It could be possible to say that the behavior when using quoting
*and* an escape character together is undefined -- that quoting and
escaping are two mutually exclusive ways of dealing with separators
(both field and record (i.e. line) separators) in fields.

Does that seem reasonable? One could even issue a warning if the user
has quotechar and escapechar set at the same time, maybe? Then we'd
get away from the pesky interactions between the two... (Similar
warnings would apply to doublequote, of course.)

And the behavior of the escape character, when quotes are out of the
picture, could be defined as something like: "when preceding either
separator, lineterminator or escapechar, the escapechar is removed and
the separator/lineterminator/escapechar is included verbatim in the
field."

There would still be two remaining issue, however:

 1. How should an escapechar preceding some *other* character be
    interpreted? The most backward-compatible approach would simply be
    to include the escape character verbatim -- but then escaping the
    escape character becomes redundant. It would also make it hard to
    interpret special sequences such as \n or \t for the client code,
    because the backslash in these sequences would end up at the same
    "escape level" as the \\. For example,

      foo \\n bar \n

    would be read in as "foo \n bar \n" -- and the client code
    couldn't tell the two apart. Not good.
 
 2. Is it really okay for an escape character to escape a
    multi-character sequence? If it is to escape the lineterminator,
    it must work for multi-character sequences such as '\r\n'. This
    *might* lead to confusion, as the convention for escape characters
    is to escape only the following character.

A possibility is to let the escape character mean "reproduce the
following character verbatim and remove me, no matter what". Then '\n'
and '\t' would simply mean 'n' and 't' -- possibly surprising -- and
each character in the line terminator would be escaped separately.

Oh, well. Maybe I should just go with XML after all. <sigh/wink>

-- 
Magnus Lie Hetland           "The mind is not a vessel to be filled,
http://hetland.org            but a fire to be lighted."  [Plutarch]

From magnus at hetland.org  Mon Mar 15 09:20:45 2004
From: magnus at hetland.org (Magnus Lie Hetland)
Date: Mon, 15 Mar 2004 09:20:45 +0100
Subject: [Csv] Thoughts about a patch
In-Reply-To: <20040315080945.GC16067@idi.ntnu.no>
References: <20040313154808.GA8421@idi.ntnu.no>
	<20040314232158.7C7543C0BA@coffee.object-craft.com.au>
	<20040315080945.GC16067@idi.ntnu.no>
Message-ID: <20040315082045.GA19927@idi.ntnu.no>

Magnus Lie Hetland <magnus at hetland.org>:
[snip]
> Does that seem reasonable? One could even issue a warning if the user
> has quotechar and escapechar set at the same time, maybe? Then we'd
> get away from the pesky interactions between the two... (Similar
> warnings would apply to doublequote, of course.)
> 
> And the behavior of the escape character, when quotes are out of the
> picture, could be defined as something like: "when preceding either
> separator, lineterminator or escapechar, the escapechar is removed and
> the separator/lineterminator/escapechar is included verbatim in the
> field."

A possible "strictification" would be to disallow the use of the
escape character *except* in the places where it has an obvious
meaning (in front of a field/record separator or another escape char).

Still not sure how to escape with multi-character line terminators,
though.

-- 
Magnus Lie Hetland           "The mind is not a vessel to be filled,
http://hetland.org            but a fire to be lighted."  [Plutarch]