[Numpy-discussion] ANN: Numpy 1.6.0 beta 2

Wed Apr 6 11:46:39 EDT 2011

Sorry to keep harping on this, but for history's sake, I was one of the 
folks that got 'U' introduced in the first place. I was dealing with a 
nightmare of unix, mac and dos test files, 'U' was a  godsend.

On 4/5/11 4:51 PM, Matthew Brett wrote:
> The difference between 'rt' and 'U' is (this is for my own benefit):
>
> For 'rt', a '\r' does not cause a line break - with 'U' - it does.

Perhaps semantics, but what 'U' does is actually change any of the line 
breaks to '\n' -- any line breaking happens after the fact. In Py2, the 
difference between 'U' and 't' is that 't' assumes that any file read 
uses the native line endings -- a bad idea, IMHO. Back in the day, Guido 
argued that text file line ending conversion was the job of file 
transfer tools. The reality, however, is that users don't always use 
file transfer tools correctly, nor even understand the implications of 
line endings.

All that being said,  mac-style files are pretty rare these days. 
(though I bet I've got a few still kicking around)

> Right - my argument is that the behavior implied by 'U' and 't' is
> conceptually separable.   'U' is for how to do line-breaks, and
> line-termination translations, 't' is for whether to decode the text
> or not.  In python 3.

but 't' and 'U' are the same in python 3 -- there is no distinction. It 
seems you are arguing that there could/should be a way to translate line 
termination without decoding the text, but ...

> In python 3 a binary file is a file which is not decoded, and returns
> bytes.  It still has a concept of a 'line', as defined by line
> terminators - you can iterate over one, or do .readlines().

I'll take your word for it that it does, but that's not really a binary 
file then, it's a file that you are assuming is encoded in an 
ascii-compatible way.

While I know that "practicality beats purity", we really should be 
opening the file as a text file (it is text, after all), and specifying 
utf-8 or latin-1 or something as the encoding.

However, IIUC, then the issue here is later on down the line, numpy uses 
regular old C code, which expects ascii strings. In that case, we could 
encode the text as ascii, into a bytes object.

That's a lot of overhead for line ending translation, so probably not 
worth it. But if nothing else, we should be clear in the docs that numpy 
text file reading code is expecting ascii-compatible data.

(and it would be nice to get the line-ending translation)

> Right - so obviously if you open a utf-16 file as binary, terrible
> things may happen - this was what Pauli was pointing out before.  His
> point was that utf-8 is the standard,

but it's not the standard -- it's a common use, but not a standard -- 
ideally numpy wouldn't enforce any particular encoding (though it could 
default to one, and utf-8 would be a good choice for that)

>> Once you really make the distiction between text and binary, the concept
>> of a "line terminator" doesn't really make sense anyway.
>
> Well - I was arguing that, given we can iterate over lines in binary
> files, then there must be the concept of what a line is, in a binary
> file, and that means that we need the concept of a line terminator.

maybe, but that concept is built on a assumption that your file is 
ascii-compatible (for \n anyway), and you know what they say about 
assumptions...

> I realize this is a discussion that would have to happen on the
> python-dev list...

I'm not sure -- I was thinking that python missed something here, but I 
don't think so anymore. In the unicode world, there is not choice but to 
be explicit about encodings, and if you do that, then python's "text or 
binary" distinction makes sense. .readline() for binary file doesn't, 
but so be it.

Honestly, I've never been sure in this discussion what code actually 
needs fixing, so I'm done now -- we've talked enough that the issues 
MUST have been covered by now!

-Chris