[Numpy-discussion] ANN: Numpy 1.6.0 beta 2

Christopher Barker Chris.Barker at noaa.gov
Tue Apr 5 19:12:20 EDT 2011


On 4/5/11 3:36 PM, josef.pktd at gmail.com wrote:
>> I disagree that U makes no sense for binary file reading.

I wasn't saying that it made no sense to have a "U" mode for binary file 
reading, what I meant is that by the python2 definition, it made no 
sense. In Python 2, the ONLY difference between binary and text mode is 
line-feed translation.

As for Python 3:

>> In python 3:
>>
>> 'b' means, "return byte objects"
>> 't' means "return decoded strings"
>>
>> 'U' means two things:
>>
>> 1) When iterating by line, split lines at any of '\r', '\r\n', '\n'
>> 2) When returning lines split this way, convert '\r' and '\r\n' to '\n'

a) 'U' is default -- it's essentially the same as 't' (in PY3), so 't' 
means "return decoded and line-feed translated unicode objects"

b) I think the line-feed conversion is done regardless of if you are 
iterating by lines, i.e. with a full-on .read(). At least that's how it 
works in py2 -- not running py3 here to test.

>> If you support returning lines from a binary file (which python 3
>> does), then I think 'U' is a sensible thing to allow - as in this
>> case.

but what is a "binary file"?

I THINK what you are proposing is that we'd want to be able to have both 
linefeed translation and no decoding done. But I think that's impossible 
-- aren't the linefeeds themselves encoded differently with different 
encodings?

> U looks appropriate in this case, better than the workarounds.
> However, to me the python 3.2 docs seem to say that U only works for
> text mode

Agreed -- but I don't see the problem -- your files are either encoded 
in something that might treat newlines differently (UCS32, maybe?), in 
which case you'd want it decoded, or you are working with ascii or ansi 
or utf-8, in which case you can specify the encoding anyway.

I don't understand why we'd want a binary blob for text parsing -- the 
parsing code is going to have to know something about the encoding to 
work -- it might as well get passed in to the file open call, and work 
with unicode. I suppose if we still want to assume ascii for parsing, 
then we could use 't' and then re-encode to ascii to work with it. Which 
I agree does seem heavy handed just for fixing newlines.

Also, one problem I've often had with encodings is what happens if I 
think I have ascii, but really have a couple characters above 127 -- 
then the default is to get an error in decoding. I'd like to be able to 
pass in a flag that either skips the un-decodable characters or replaces 
them with something, but it doesn't look like you can do that with the 
file open function in py3.

> The line terminator is always b'\n' for binary files;

Once you really make the distiction between text and binary, the concept 
of a "line terminator" doesn't really make sense anyway.

In the ansi world, everyone should always have used 'U' for text. It 
probably would have been the default if it had been there from the 
beginning. People got away without it because:
  1) dos line feeds have a "\n" in them anyway
  2) most if the time it doesn't matter that there is an extra 
whitespace charater inther
  3) darn few of us ever had to deal with the mac "\r"

Now that we are in a unicode world (at least a little) there is simply 
no way around the fact that you can't reliably read a file without 
knowing how it is encoded.

My thought at this point is to say that the numpy text file reading 
stuff only works on 1byte, ansi encoding (nad maybe only ascii), and be 
done with it. utf-8 might be OK -- I don't know if there are any valid 
files in, say latin-1 that utf-8 will barf on -- you may not get the 
non-ascii symbols right, but that's better than barfing.

-Chris




-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov



More information about the NumPy-Discussion mailing list