Re: [Python-Dev] [Csv] skipfinalspace
(Continuing thread started at
http://mail.python.org/pipermail/csv/2008-October/000688.html)
On Sun, Oct 19, 2008 at 16:46, Andrew McNamara
I downloaded the 2.6 source tar ball, but is it too late for new features to get into versions <3?
Yep.
How would you feel about adding the following tests to Lib/test/test_csv.py and getting them to pass?
Also http://www.python.org/doc/2.5.2/lib/csv-fmt-params.html says "*skipinitialspace *When True, whitespace immediately following the delimiter is ignored." but my tests show whitespace at the start of any field is ignored, including the first field.
I suspect (but I haven't checked) that it means "after the delimiter and before any quoted field (or some variation on that).
I agree that whitespace after the delimiter and before any quoted field is skipped. Also whitespace after the start of the line and before any quoted field is skipped.
All of the "dialect" parameters are there to allow parsing of a specific common form of CSV file. Because there is no formal definition of the format, the module simply aims to parse (and produce the same result) as common applications such as Excel and Access. Changing the behaviour in any non-backwards compatible way is sure to get screams of anguish from many users. Even when the behaviour appears to be a bug, you can be sure people are counting on it working like that.
skipinitialspace defaults to false and by the same logic skipfinalspace should default to false to preserve compatibility with the csv module in 2.6. On the other hand, the switch to version 3 is as good a time as any to break backwards compatibility to adopt something that works better for new users. Based on my experience parsing several hundred csv generated by many different people I think it would be nice to at least have a dialect that is excel + skipinitialspace=True + skipfinalspace=True.
BTW, this discussion probably should move to python-dev.
-- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/
I downloaded the 2.6 source tar ball, but is it too late for new features to get into versions <3?
Yep.
Sigh - I should slow down and actually read the e-mail I'm replying to. It is not too late to get features into versions <3. It is, however, too late to get features into 2.6, which was not what you asked, but what I was answering "Yep" to.
How would you feel about adding the following tests to Lib/test/test_csv.py and getting them to pass?
I have no real objection to someone adding a skipfinalspace parameter and associated tests, although I have no time to do it myself at the moment.
Also http://www.python.org/doc/2.5.2/lib/csv-fmt-params.html says "*skipinitialspace *When True, whitespace immediately following the delimiter is ignored." but my tests show whitespace at the start of any field is ignored, including the first field.
I suspect (but I haven't checked) that it means "after the delimiter and before any quoted field (or some variation on that).
I agree that whitespace after the delimiter and before any quoted field is skipped. Also whitespace after the start of the line and before any quoted field is skipped.
I'm not sure if we're talking about the same thing - it seems to work as I expect it to work: >>> list(csv.reader([' foo, bar'])) [[' foo', ' bar']] >>> list(csv.reader([' foo, bar'], skipinitialspace=1)) [['foo', 'bar']] BTW, I think the reason "skipinitialspace" exists at all is to support this: >>> list(csv.reader([' foo, " bar"'])) [[' foo', ' " bar"']] >>> list(csv.reader([' foo, " bar"'], skipinitialspace=1)) [['foo', ' bar']] The quoting is only valid if the quote is the first character encountered in the field (this is how Excel works). However, some other CSV generators insert a space after the comma, and expect the parser to still treat it as a quoted field - so skipinitialspace eats the space leading up the quote, but does not eat any space after the quote (hence the "initial" in the name). For symmetry, a "skipfinalspace" option should do the same - only eat space after the quote (if quotes are used) - however this will be rather hard to implement as the parser state has already rolled on, and you no longer know that whether the field was quoted. Eating spaces that appeared within the quotes is the wrong thing to do.
skipinitialspace defaults to false and by the same logic skipfinalspace should default to false to preserve compatibility with the csv module in 2.6. On the other hand, the switch to version 3 is as good a time as any to break backwards compatibility to adopt something that works better for new users.
No, by default it needs to work like Excel, because this is the defacto standard.
Based on my experience parsing several hundred csv generated by many different people I think it would be nice to at least have a dialect that is excel + skipinitialspace=True + skipfinalspace=True.
Once the "skipfinalspace" parameter is implemented, there is nothing stopping you creating such a dialect in your code, but I don't support adding it to the standard library - the dialects in the std lib should be well defined (in some way). BTW, it's not necessary to create dialect objects: as I've done above, users can pass keyword parameters to the parser if it's more convenient. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/
Tom Brown wrote:
(Continuing thread started at http://mail.python.org/pipermail/csv/2008-October/000688.html)
On Sun, Oct 19, 2008 at 16:46, Andrew McNamara
mailto:andrewm@object-craft.com.au> wrote: >I downloaded the 2.6 source tar ball, but is it too late for new features to >get into versions <3?
Yep.
>How would you feel about adding the following tests to Lib/test/test_csv.py >and getting them to pass? > >Also http://www.python.org/doc/2.5.2/lib/csv-fmt-params.html says >"*skipinitialspace *When True, whitespace immediately following the >delimiter is ignored." >but my tests show whitespace at the start of any field is ignored, including >the first field.
I suspect (but I haven't checked) that it means "after the delimiter and before any quoted field (or some variation on that).
I agree that whitespace after the delimiter and before any quoted field is skipped. Also whitespace after the start of the line and before any quoted field is skipped.
All of the "dialect" parameters are there to allow parsing of a specific common form of CSV file. Because there is no formal definition of the format, the module simply aims to parse (and produce the same result) as common applications such as Excel and Access. Changing the behaviour in any non-backwards compatible way is sure to get screams of anguish from many users. Even when the behaviour appears to be a bug, you can be sure people are counting on it working like that.
skipinitialspace defaults to false and by the same logic skipfinalspace should default to false to preserve compatibility with the csv module in 2.6. On the other hand, the switch to version 3 is as good a time as any to break backwards compatibility to adopt something that works better for new users.
Read Andrew's lips: They don't want "better", they want "the same as MS".
Based on my experience parsing several hundred csv generated by many different people I think it would be nice to at least have a dialect that is excel + skipinitialspace=True + skipfinalspace=True.
Based on my experience extracting data from innumerable csv files (and infinite varieties thereof), spreadsheet files, and database tables, in 99.99% of cases one should automatically apply the following transformations to each text field: * strip leading whitespace * strip trailing whitespace * replace embedded runs of whitespace by a single space and one needs to ensure that the definition of whitespace includes the no-break space (NBSP) character. As this "space normalisation" is needed for all input sources, the csv module is IMHO the wrong place to put it. A string method would be a better idea. Cheers, John
On Oct 20, 2008, at 09:48, John Machin wrote:
Based on my experience extracting data from innumerable csv files (and infinite varieties thereof), spreadsheet files, and database tables, in 99.99% of cases one should automatically apply the following transformations to each text field: * strip leading whitespace * strip trailing whitespace * replace embedded runs of whitespace by a single space and one needs to ensure that the definition of whitespace includes the no-break space (NBSP) character.
As this "space normalisation" is needed for all input sources, the csv module is IMHO the wrong place to put it. A string method would be a better idea.
Hm. It seems quite familiar, somehow... You could certainly do the following (for each field)... " ".join(field.split()) ... but I seem to recall running across something that did this? (Maybe I'm confusing it with some other issue, with the string.capwords function versis str.title :) -- Magnus Lie Hetland http://hetland.org
participants (4)
-
Andrew McNamara
-
John Machin
-
Magnus Lie Hetland
-
Tom Brown