
Hi, On Tue, Apr 5, 2011 at 4:12 PM, Christopher Barker <Chris.Barker@noaa.gov> wrote:
On 4/5/11 3:36 PM, josef.pktd@gmail.com wrote:
I disagree that U makes no sense for binary file reading.
I wasn't saying that it made no sense to have a "U" mode for binary file reading, what I meant is that by the python2 definition, it made no sense. In Python 2, the ONLY difference between binary and text mode is line-feed translation.
I think it's right to say that the difference between a text and a binary file in python 2 is - none for unix, and '\r\n' -> '\n' translation in windows. The difference between 'rt' and 'U' is (this is for my own benefit): For 'rt', a '\r' does not cause a line break - with 'U' - it does. For 'rt' _not_ on Windows, '\r\n' stays the same - it is stripped to '\n' with 'U'.
As for Python 3:
In python 3:
'b' means, "return byte objects" 't' means "return decoded strings"
'U' means two things:
1) When iterating by line, split lines at any of '\r', '\r\n', '\n' 2) When returning lines split this way, convert '\r' and '\r\n' to '\n'
a) 'U' is default -- it's essentially the same as 't' (in PY3), so 't' means "return decoded and line-feed translated unicode objects"
Right - my argument is that the behavior implied by 'U' and 't' is conceptually separable. 'U' is for how to do line-breaks, and line-termination translations, 't' is for whether to decode the text or not. In python 3.
b) I think the line-feed conversion is done regardless of if you are iterating by lines, i.e. with a full-on .read(). At least that's how it works in py2 -- not running py3 here to test.
Yes, that looks right.
If you support returning lines from a binary file (which python 3 does), then I think 'U' is a sensible thing to allow - as in this case.
but what is a "binary file"?
In python 3 a binary file is a file which is not decoded, and returns bytes. It still has a concept of a 'line', as defined by line terminators - you can iterate over one, or do .readlines(). In python 2, as you say, a binary file is essentially the same as a text file, with the single exception of the windows \r\n -> \n translation.
I THINK what you are proposing is that we'd want to be able to have both linefeed translation and no decoding done. But I think that's impossible -- aren't the linefeeds themselves encoded differently with different encodings?
Right - so obviously if you open a utf-16 file as binary, terrible things may happen - this was what Pauli was pointing out before. His point was that utf-8 is the standard, and that we probably would not hit many other encodings. I agree with you if you are saying that it would be good to be able to deal with them if we can - presumably by allowing 'rt' file objects, producing python 3 strings.
U looks appropriate in this case, better than the workarounds. However, to me the python 3.2 docs seem to say that U only works for text mode
Agreed -- but I don't see the problem -- your files are either encoded in something that might treat newlines differently (UCS32, maybe?), in which case you'd want it decoded, or you are working with ascii or ansi or utf-8, in which case you can specify the encoding anyway.
I don't understand why we'd want a binary blob for text parsing -- the parsing code is going to have to know something about the encoding to work -- it might as well get passed in to the file open call, and work with unicode. I suppose if we still want to assume ascii for parsing, then we could use 't' and then re-encode to ascii to work with it. Which I agree does seem heavy handed just for fixing newlines.
Also, one problem I've often had with encodings is what happens if I think I have ascii, but really have a couple characters above 127 -- then the default is to get an error in decoding. I'd like to be able to pass in a flag that either skips the un-decodable characters or replaces them with something, but it doesn't look like you can do that with the file open function in py3.
The line terminator is always b'\n' for binary files;
Once you really make the distiction between text and binary, the concept of a "line terminator" doesn't really make sense anyway.
Well - I was arguing that, given we can iterate over lines in binary files, then there must be the concept of what a line is, in a binary file, and that means that we need the concept of a line terminator. I realize this is a discussion that would have to happen on the python-dev list... See you, Matthew