'r' vs 'rb' in csv (was Re: Python SHA-1 as a method for unique file identification ? [help!])
sjmachin at lexicon.net
Tue Jun 27 02:37:48 CEST 2006
On 27/06/2006 6:39 AM, Mike Orr wrote:
> Tim Peters wrote:
>> [EP <eric.pederson at gmail.com>]
>>> This inquiry may either turn out to be about the suitability of the
>>> SHA-1 (160 bit digest) for file identification, the sha function in
>>> Python ... or about some error in my script
>> It's your script. Always open binary files in binary mode. It's a
>> disaster on Windows if you don't (if you open a file in text mode on
>> Windows, the OS pretends that EOF occurs at the first instance of byte
>> chr(26) -- this is an ancient Windows behavior that made an odd kind
>> of sense in the mists of history, and has persisted in worship of
>> Backward Compatibility despite that the original reason for it went
>> away _long_ ago).
> On a semi-related note, I have a database on Linux that imports from a
> Macintosh CSV file. The 'csv' module says to always open files in
> binary mode, but this didn't work in my case: I had to open it as 'rU'
> (text with universal newlines) or 'csv' misparsed it. I'd like the
> program to be portable to Windows and Mac. Is there a way around this?
> Will I really burn in hell for using 'rU'?
Yes, you will burn in hell for using any old kludge that gets results
(by accident) instead of reading the manual to find a principled solution:
The string used to terminate lines in the CSV file. It defaults to '\r\n'.
In the case of a Mac CSV file, '\r' is probably required.
You will burn in hell for asking questions w/o supplying sufficient
information, like (a) repr(first few lines of your Mac CSV file) (b)
what was the result from the csv module ("didn't work" doesn't cut it).
> What was the odd bit of sense? I know you end console input by typing
> ctrl-Z, but I thought it was just like Unix ctrl-D which ends the input
> but doesn't actually insert that character.
Pace timbot, the "ancient Windows behavior" was inherited via MS-DOS
from CP/M. Sectors on disk were 128 bytes. File sizes were recorded as
numbers of sectors, not numbers of bytes. The convention was that the
end of a text file was indicated by ^Z.
You are correct, modern software shouldn't and usually doesn't
gratuitously write ^Z to files, but there is is some software out there
that still does, hence the preservation of the convention on reading.
More importantly for CSV files, the data may contain *embedded* CRs and
LFs that the users had in their spreadsheet file. Reading that with "r"
or "rU" will certainly result in "didn't work".
More information about the Python-list