From tom.brown.code at gmail.com  Sun Oct 19 01:15:01 2008
From: tom.brown.code at gmail.com (Tom Brown)
Date: Sat, 18 Oct 2008 16:15:01 -0700
Subject: [Csv] skipfinalspace
Message-ID: <9789242b0810181615w5b1325c6h2149990854cff83d@mail.gmail.com>

Hello python csv gurus!
I use the csv module pretty heavily in
http://code.google.com/p/googletransitdatafeed/source/browse/trunk/python/transitfeed.pyand
someone recently complained that it doesn't handle white space before
and after fields. I can fix this skipinitialspace and a little
post-processing to remove trailing whitespace but thought it would be nice
to add skipifinalspace to the csv module.
We have 476 feeds generated by different tools (plugins to various
proprietary software, Python's csv, by hand, Excel, ...). Of these 9 have a
space after fields in the header and 22 have spaces before fields in the
header.

I downloaded the 2.6 source tar ball, but is it too late for new features to
get into versions <3?

How would you feel about adding the following tests to Lib/test/test_csv.py
and getting them to pass?

Also http://www.python.org/doc/2.5.2/lib/csv-fmt-params.html says
"*skipinitialspace *When True, whitespace immediately following the
delimiter is ignored."
but my tests show whitespace at the start of any field is ignored, including
the first field.

Thanks,
Tom

class TestDialectOption(TestCsvBase):
    @staticmethod
    def makeDialect(dct):
      name = "dialect-%d" % (hash(tuple(dct.items())))
      return type(name, (csv.excel, object), dct)()

    def test_no_skip(self):
      self.dialect = self.makeDialect({})
      self.readerAssertEqual(' foo,bar', [[' foo', 'bar']])
      self.readerAssertEqual('foo, bar', [['foo', ' bar']])
      self.readerAssertEqual(' foo, bar', [[' foo', ' bar']])
      self.readerAssertEqual(' foo , bar', [[' foo ', ' bar']])
      self.readerAssertEqual(' foo , bar ', [[' foo ', ' bar ']])

    def test_skip_initial(self):
      self.dialect = self.makeDialect({"skipinitialspace": True})
      self.readerAssertEqual(' foo,bar', [['foo', 'bar']])
      self.readerAssertEqual('foo, bar', [['foo', 'bar']])
      self.readerAssertEqual(' foo, bar', [['foo', 'bar']])
      self.readerAssertEqual(' foo , bar', [['foo ', 'bar']])
      self.readerAssertEqual(' foo , bar ', [['foo ', 'bar ']])

    def test_skip_final(self):
      self.dialect = self.makeDialect({"skipfinalspace": True})
      self.readerAssertEqual(' foo,bar', [[' foo', 'bar']])
      self.readerAssertEqual('foo, bar', [['foo', ' bar']])
      self.readerAssertEqual(' foo, bar', [[' foo', ' bar']])
      self.readerAssertEqual(' foo , bar', [[' foo', ' bar']])
      self.readerAssertEqual(' foo , bar ', [[' foo', ' bar']])

    def test_skip_both(self):
      self.dialect = self.makeDialect({"skipinitialspace": True,
                                       "skipfinalspace": True})
      self.readerAssertEqual(' foo,bar', [['foo', 'bar']])
      self.readerAssertEqual('foo, bar', [['foo', 'bar']])
      self.readerAssertEqual(' foo, bar', [['foo', 'bar']])
      self.readerAssertEqual(' foo , bar', [['foo', 'bar']])
      self.readerAssertEqual(' foo , bar ', [['foo', 'bar']])
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/csv/attachments/20081018/cb4dcf85/attachment.htm>

From andrewm at object-craft.com.au  Mon Oct 20 01:46:04 2008
From: andrewm at object-craft.com.au (Andrew McNamara)
Date: Mon, 20 Oct 2008 10:46:04 +1100
Subject: [Csv] skipfinalspace
In-Reply-To: <9789242b0810181615w5b1325c6h2149990854cff83d@mail.gmail.com> 
References: <9789242b0810181615w5b1325c6h2149990854cff83d@mail.gmail.com>
Message-ID: <20081019234604.B70DB59C001@longblack.object-craft.com.au>

>I downloaded the 2.6 source tar ball, but is it too late for new features to
>get into versions <3?

Yep.

>How would you feel about adding the following tests to Lib/test/test_csv.py
>and getting them to pass?
>
>Also http://www.python.org/doc/2.5.2/lib/csv-fmt-params.html says
>"*skipinitialspace *When True, whitespace immediately following the
>delimiter is ignored."
>but my tests show whitespace at the start of any field is ignored, including
>the first field.

I suspect (but I haven't checked) that it means "after the delimiter and
before any quoted field (or some variation on that).

All of the "dialect" parameters are there to allow parsing of a specific
common form of CSV file. Because there is no formal definition of the
format, the module simply aims to parse (and produce the same result)
as common applications such as Excel and Access. Changing the behaviour
in any non-backwards compatible way is sure to get screams of anguish
from many users. Even when the behaviour appears to be a bug, you can
be sure people are counting on it working like that.

BTW, this discussion probably should move to python-dev.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/

From tom.brown.code at gmail.com  Mon Oct 20 07:06:51 2008
From: tom.brown.code at gmail.com (Tom Brown)
Date: Sun, 19 Oct 2008 22:06:51 -0700
Subject: [Csv] skipfinalspace
In-Reply-To: <9789242b0810192154w557dedd8seba60c3deb168f12@mail.gmail.com>
References: <9789242b0810181615w5b1325c6h2149990854cff83d@mail.gmail.com>
	<20081019234604.B70DB59C001@longblack.object-craft.com.au>
	<9789242b0810192154w557dedd8seba60c3deb168f12@mail.gmail.com>
Message-ID: <9789242b0810192206n318fb7cao8bba9341695af053@mail.gmail.com>

(Continuing thread started at
http://mail.python.org/pipermail/csv/2008-October/000688.html)
On Sun, Oct 19, 2008 at 16:46, Andrew McNamara
<andrewm at object-craft.com.au>wrote:

> >I downloaded the 2.6 source tar ball, but is it too late for new features
> to
> >get into versions <3?
>
> Yep.
>
> >How would you feel about adding the following tests to
> Lib/test/test_csv.py
> >and getting them to pass?
> >
> >Also http://www.python.org/doc/2.5.2/lib/csv-fmt-params.html says
> >"*skipinitialspace *When True, whitespace immediately following the
> >delimiter is ignored."
> >but my tests show whitespace at the start of any field is ignored,
> including
> >the first field.
>
> I suspect (but I haven't checked) that it means "after the delimiter and
> before any quoted field (or some variation on that).

I agree that whitespace after the delimiter and before any quoted field is
skipped. Also whitespace after the start of the line and before any quoted
field is skipped.

>
>
> All of the "dialect" parameters are there to allow parsing of a specific
> common form of CSV file. Because there is no formal definition of the
> format, the module simply aims to parse (and produce the same result)
> as common applications such as Excel and Access. Changing the behaviour
> in any non-backwards compatible way is sure to get screams of anguish
> from many users. Even when the behaviour appears to be a bug, you can
> be sure people are counting on it working like that.

skipinitialspace defaults to false and by the same logic skipfinalspace
should default to false to preserve compatibility with the csv module in
2.6. On the other hand, the switch to version 3 is as good a time as any to
break backwards compatibility to adopt something that works better for new
users.

Based on my experience parsing several hundred csv generated by many
different people I think it would be nice to at least have a dialect that is
excel + skipinitialspace=True + skipfinalspace=True.

>
> BTW, this discussion probably should move to python-dev.
>
> --
> Andrew McNamara, Senior Developer, Object Craft
> http://www.object-craft.com.au/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/csv/attachments/20081019/9f073f19/attachment.htm>

From sjmachin at lexicon.net  Mon Oct 20 09:48:10 2008
From: sjmachin at lexicon.net (John Machin)
Date: Mon, 20 Oct 2008 18:48:10 +1100
Subject: [Csv] skipfinalspace
In-Reply-To: <9789242b0810192206n318fb7cao8bba9341695af053@mail.gmail.com>
References: <9789242b0810181615w5b1325c6h2149990854cff83d@mail.gmail.com>	<20081019234604.B70DB59C001@longblack.object-craft.com.au>	<9789242b0810192154w557dedd8seba60c3deb168f12@mail.gmail.com>
	<9789242b0810192206n318fb7cao8bba9341695af053@mail.gmail.com>
Message-ID: <48FC37BA.20303@lexicon.net>

Tom Brown wrote:
> (Continuing thread started at 
> http://mail.python.org/pipermail/csv/2008-October/000688.html)
> 
> On Sun, Oct 19, 2008 at 16:46, Andrew McNamara 
> <andrewm at object-craft.com.au <mailto:andrewm at object-craft.com.au>> wrote:
> 
>      >I downloaded the 2.6 source tar ball, but is it too late for new
>     features to
>      >get into versions <3?
> 
>     Yep.
> 
>      >How would you feel about adding the following tests to
>     Lib/test/test_csv.py
>      >and getting them to pass?
>      >
>      >Also http://www.python.org/doc/2.5.2/lib/csv-fmt-params.html says
>      >"*skipinitialspace *When True, whitespace immediately following the
>      >delimiter is ignored."
>      >but my tests show whitespace at the start of any field is ignored,
>     including
>      >the first field.
> 
>     I suspect (but I haven't checked) that it means "after the delimiter and
>     before any quoted field (or some variation on that).
> 
> I agree that whitespace after the delimiter and before any quoted field 
> is skipped. Also whitespace after the start of the line and before any 
> quoted field is skipped.

>     All of the "dialect" parameters are there to allow parsing of a specific
>     common form of CSV file. Because there is no formal definition of the
>     format, the module simply aims to parse (and produce the same result)
>     as common applications such as Excel and Access. Changing the behaviour
>     in any non-backwards compatible way is sure to get screams of anguish
>     from many users. Even when the behaviour appears to be a bug, you can
>     be sure people are counting on it working like that.
> 
> 
> skipinitialspace defaults to false and by the same logic skipfinalspace 
> should default to false to preserve compatibility with the csv module in 
> 2.6. On the other hand, the switch to version 3 is as good a time as any 
> to break backwards compatibility to adopt something that works better 
> for new users.

Read Andrew's lips: They don't want "better", they want "the same as MS".

> Based on my experience parsing several hundred csv generated by many 
> different people I think it would be nice to at least have a dialect 
> that is excel + skipinitialspace=True + skipfinalspace=True.

Based on my experience extracting data from innumerable csv files (and 
infinite varieties thereof), spreadsheet files, and database tables, in 
99.99% of cases one should automatically apply the following 
transformations to each text field:
    * strip leading whitespace
    * strip trailing whitespace
    * replace embedded runs of whitespace by a single space
and one needs to ensure that the definition of whitespace includes the 
no-break space (NBSP) character.

As this "space normalisation" is needed for all input sources, the csv 
module is IMHO the wrong place to put it. A string method would be a 
better idea.

Cheers,
John

From tom.brown.code at gmail.com  Tue Oct 21 09:21:50 2008
From: tom.brown.code at gmail.com (Tom Brown)
Date: Tue, 21 Oct 2008 00:21:50 -0700
Subject: [Csv] skipfinalspace
In-Reply-To: <9789242b0810210021re3dd771o3a7a19f177d8be41@mail.gmail.com>
References: <9789242b0810181615w5b1325c6h2149990854cff83d@mail.gmail.com>
	<20081019234604.B70DB59C001@longblack.object-craft.com.au>
	<9789242b0810192154w557dedd8seba60c3deb168f12@mail.gmail.com>
	<9789242b0810192206n318fb7cao8bba9341695af053@mail.gmail.com>
	<48FC37BA.20303@lexicon.net>
	<9789242b0810210021re3dd771o3a7a19f177d8be41@mail.gmail.com>
Message-ID: <9789242b0810210021v344a6d86sec0859633a639724@mail.gmail.com>

On Mon, Oct 20, 2008 at 00:48, John Machin <sjmachin at lexicon.net> wrote:

> Tom Brown wrote:
>
>> (Continuing thread started at
>> http://mail.python.org/pipermail/csv/2008-October/000688.html)
>>
>> On Sun, Oct 19, 2008 at 16:46, Andrew McNamara <
>> andrewm at object-craft.com.au <mailto:andrewm at object-craft.com.au>> wrote:
>>
>>     >I downloaded the 2.6 source tar ball, but is it too late for new
>>    features to
>>     >get into versions <3?
>>
>>    Yep.
>>
>>     >How would you feel about adding the following tests to
>>    Lib/test/test_csv.py
>>     >and getting them to pass?
>>     >
>>     >Also http://www.python.org/doc/2.5.2/lib/csv-fmt-params.html says
>>     >"*skipinitialspace *When True, whitespace immediately following the
>>     >delimiter is ignored."
>>     >but my tests show whitespace at the start of any field is ignored,
>>    including
>>     >the first field.
>>
>>    I suspect (but I haven't checked) that it means "after the delimiter
>> and
>>    before any quoted field (or some variation on that).
>>
>> I agree that whitespace after the delimiter and before any quoted field is
>> skipped. Also whitespace after the start of the line and before any quoted
>> field is skipped.
>>
>
>     All of the "dialect" parameters are there to allow parsing of a
>> specific
>>    common form of CSV file. Because there is no formal definition of the
>>    format, the module simply aims to parse (and produce the same result)
>>    as common applications such as Excel and Access. Changing the behaviour
>>    in any non-backwards compatible way is sure to get screams of anguish
>>    from many users. Even when the behaviour appears to be a bug, you can
>>    be sure people are counting on it working like that.
>>
>>
>> skipinitialspace defaults to false and by the same logic skipfinalspace
>> should default to false to preserve compatibility with the csv module in
>> 2.6. On the other hand, the switch to version 3 is as good a time as any to
>> break backwards compatibility to adopt something that works better for new
>> users.
>>
>
> Read Andrew's lips: They don't want "better", they want "the same as MS".

okay.

>
>
>  Based on my experience parsing several hundred csv generated by many
>> different people I think it would be nice to at least have a dialect that is
>> excel + skipinitialspace=True + skipfinalspace=True.
>>
>
> Based on my experience extracting data from innumerable csv files (and
> infinite varieties thereof),

Wow, that is a _lot_ of files :-P

spreadsheet files, and database tables, in 99.99% of cases one should
> automatically apply the following transformations to each text field:
>   * strip leading whitespace
>   * strip trailing whitespace
>   * replace embedded runs of whitespace by a single space
> and one needs to ensure that the definition of whitespace includes the
> no-break space (NBSP) character.
>
> As this "space normalisation" is needed for all input sources, the csv
> module is IMHO the wrong place to put it. A string method would be a better
> idea.

I agree that strip() and something like re.sub(r"\s+", " " are handy. If
99.99% percent of csv readers should be applying these fixes to every field
perhaps there should be easy-to-enable option to apply it. Why force almost
everyone to discover they need the transformations and put a line of code
around csv reader?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/csv/attachments/20081021/1293709e/attachment.htm>

From magnus at hetland.org  Tue Oct 21 13:03:41 2008
From: magnus at hetland.org (Magnus Lie Hetland)
Date: Tue, 21 Oct 2008 13:03:41 +0200
Subject: [Csv] skipfinalspace
In-Reply-To: <48FC37BA.20303@lexicon.net>
References: <9789242b0810181615w5b1325c6h2149990854cff83d@mail.gmail.com>	<20081019234604.B70DB59C001@longblack.object-craft.com.au>	<9789242b0810192154w557dedd8seba60c3deb168f12@mail.gmail.com>
	<9789242b0810192206n318fb7cao8bba9341695af053@mail.gmail.com>
	<48FC37BA.20303@lexicon.net>
Message-ID: <87772659-8D79-4CEF-BF7A-E633A38D4A25@hetland.org>

On Oct 20, 2008, at 09:48, John Machin wrote:

> Based on my experience extracting data from innumerable csv files  
> (and infinite varieties thereof), spreadsheet files, and database  
> tables, in 99.99% of cases one should automatically apply the  
> following transformations to each text field:
>   * strip leading whitespace
>   * strip trailing whitespace
>   * replace embedded runs of whitespace by a single space
> and one needs to ensure that the definition of whitespace includes  
> the no-break space (NBSP) character.
>
> As this "space normalisation" is needed for all input sources, the  
> csv module is IMHO the wrong place to put it. A string method would  
> be a better idea.

Hm. It seems quite familiar, somehow...

You could certainly do the following (for each field)...

   " ".join(field.split())

... but I seem to recall running across something that did this?  
(Maybe I'm confusing it with some other issue, with the  
string.capwords function versis str.title :)

-- 
Magnus Lie Hetland
http://hetland.org