
On 4/5/11 3:36 PM, josef.pktd@gmail.com wrote:
I disagree that U makes no sense for binary file reading.
I wasn't saying that it made no sense to have a "U" mode for binary file reading, what I meant is that by the python2 definition, it made no sense. In Python 2, the ONLY difference between binary and text mode is line-feed translation. As for Python 3:
In python 3:
'b' means, "return byte objects" 't' means "return decoded strings"
'U' means two things:
1) When iterating by line, split lines at any of '\r', '\r\n', '\n' 2) When returning lines split this way, convert '\r' and '\r\n' to '\n'
a) 'U' is default -- it's essentially the same as 't' (in PY3), so 't' means "return decoded and line-feed translated unicode objects" b) I think the line-feed conversion is done regardless of if you are iterating by lines, i.e. with a full-on .read(). At least that's how it works in py2 -- not running py3 here to test.
If you support returning lines from a binary file (which python 3 does), then I think 'U' is a sensible thing to allow - as in this case.
but what is a "binary file"? I THINK what you are proposing is that we'd want to be able to have both linefeed translation and no decoding done. But I think that's impossible -- aren't the linefeeds themselves encoded differently with different encodings?
U looks appropriate in this case, better than the workarounds. However, to me the python 3.2 docs seem to say that U only works for text mode
Agreed -- but I don't see the problem -- your files are either encoded in something that might treat newlines differently (UCS32, maybe?), in which case you'd want it decoded, or you are working with ascii or ansi or utf-8, in which case you can specify the encoding anyway. I don't understand why we'd want a binary blob for text parsing -- the parsing code is going to have to know something about the encoding to work -- it might as well get passed in to the file open call, and work with unicode. I suppose if we still want to assume ascii for parsing, then we could use 't' and then re-encode to ascii to work with it. Which I agree does seem heavy handed just for fixing newlines. Also, one problem I've often had with encodings is what happens if I think I have ascii, but really have a couple characters above 127 -- then the default is to get an error in decoding. I'd like to be able to pass in a flag that either skips the un-decodable characters or replaces them with something, but it doesn't look like you can do that with the file open function in py3.
The line terminator is always b'\n' for binary files;
Once you really make the distiction between text and binary, the concept of a "line terminator" doesn't really make sense anyway. In the ansi world, everyone should always have used 'U' for text. It probably would have been the default if it had been there from the beginning. People got away without it because: 1) dos line feeds have a "\n" in them anyway 2) most if the time it doesn't matter that there is an extra whitespace charater inther 3) darn few of us ever had to deal with the mac "\r" Now that we are in a unicode world (at least a little) there is simply no way around the fact that you can't reliably read a file without knowing how it is encoded. My thought at this point is to say that the numpy text file reading stuff only works on 1byte, ansi encoding (nad maybe only ascii), and be done with it. utf-8 might be OK -- I don't know if there are any valid files in, say latin-1 that utf-8 will barf on -- you may not get the non-ascii symbols right, but that's better than barfing. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov