[Numpy-discussion] ANN: Numpy 1.6.0 beta 2
Chris.Barker at noaa.gov
Tue Apr 5 19:12:20 EDT 2011
On 4/5/11 3:36 PM, josef.pktd at gmail.com wrote:
>> I disagree that U makes no sense for binary file reading.
I wasn't saying that it made no sense to have a "U" mode for binary file
reading, what I meant is that by the python2 definition, it made no
sense. In Python 2, the ONLY difference between binary and text mode is
As for Python 3:
>> In python 3:
>> 'b' means, "return byte objects"
>> 't' means "return decoded strings"
>> 'U' means two things:
>> 1) When iterating by line, split lines at any of '\r', '\r\n', '\n'
>> 2) When returning lines split this way, convert '\r' and '\r\n' to '\n'
a) 'U' is default -- it's essentially the same as 't' (in PY3), so 't'
means "return decoded and line-feed translated unicode objects"
b) I think the line-feed conversion is done regardless of if you are
iterating by lines, i.e. with a full-on .read(). At least that's how it
works in py2 -- not running py3 here to test.
>> If you support returning lines from a binary file (which python 3
>> does), then I think 'U' is a sensible thing to allow - as in this
but what is a "binary file"?
I THINK what you are proposing is that we'd want to be able to have both
linefeed translation and no decoding done. But I think that's impossible
-- aren't the linefeeds themselves encoded differently with different
> U looks appropriate in this case, better than the workarounds.
> However, to me the python 3.2 docs seem to say that U only works for
> text mode
Agreed -- but I don't see the problem -- your files are either encoded
in something that might treat newlines differently (UCS32, maybe?), in
which case you'd want it decoded, or you are working with ascii or ansi
or utf-8, in which case you can specify the encoding anyway.
I don't understand why we'd want a binary blob for text parsing -- the
parsing code is going to have to know something about the encoding to
work -- it might as well get passed in to the file open call, and work
with unicode. I suppose if we still want to assume ascii for parsing,
then we could use 't' and then re-encode to ascii to work with it. Which
I agree does seem heavy handed just for fixing newlines.
Also, one problem I've often had with encodings is what happens if I
think I have ascii, but really have a couple characters above 127 --
then the default is to get an error in decoding. I'd like to be able to
pass in a flag that either skips the un-decodable characters or replaces
them with something, but it doesn't look like you can do that with the
file open function in py3.
> The line terminator is always b'\n' for binary files;
Once you really make the distiction between text and binary, the concept
of a "line terminator" doesn't really make sense anyway.
In the ansi world, everyone should always have used 'U' for text. It
probably would have been the default if it had been there from the
beginning. People got away without it because:
1) dos line feeds have a "\n" in them anyway
2) most if the time it doesn't matter that there is an extra
whitespace charater inther
3) darn few of us ever had to deal with the mac "\r"
Now that we are in a unicode world (at least a little) there is simply
no way around the fact that you can't reliably read a file without
knowing how it is encoded.
My thought at this point is to say that the numpy text file reading
stuff only works on 1byte, ansi encoding (nad maybe only ascii), and be
done with it. utf-8 might be OK -- I don't know if there are any valid
files in, say latin-1 that utf-8 will barf on -- you may not get the
non-ascii symbols right, but that's better than barfing.
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
More information about the NumPy-Discussion