
Sorry to keep harping on this, but for history's sake, I was one of the folks that got 'U' introduced in the first place. I was dealing with a nightmare of unix, mac and dos test files, 'U' was a godsend. On 4/5/11 4:51 PM, Matthew Brett wrote:
The difference between 'rt' and 'U' is (this is for my own benefit):
For 'rt', a '\r' does not cause a line break - with 'U' - it does.
Perhaps semantics, but what 'U' does is actually change any of the line breaks to '\n' -- any line breaking happens after the fact. In Py2, the difference between 'U' and 't' is that 't' assumes that any file read uses the native line endings -- a bad idea, IMHO. Back in the day, Guido argued that text file line ending conversion was the job of file transfer tools. The reality, however, is that users don't always use file transfer tools correctly, nor even understand the implications of line endings. All that being said, mac-style files are pretty rare these days. (though I bet I've got a few still kicking around)
Right - my argument is that the behavior implied by 'U' and 't' is conceptually separable. 'U' is for how to do line-breaks, and line-termination translations, 't' is for whether to decode the text or not. In python 3.
but 't' and 'U' are the same in python 3 -- there is no distinction. It seems you are arguing that there could/should be a way to translate line termination without decoding the text, but ...
In python 3 a binary file is a file which is not decoded, and returns bytes. It still has a concept of a 'line', as defined by line terminators - you can iterate over one, or do .readlines().
I'll take your word for it that it does, but that's not really a binary file then, it's a file that you are assuming is encoded in an ascii-compatible way. While I know that "practicality beats purity", we really should be opening the file as a text file (it is text, after all), and specifying utf-8 or latin-1 or something as the encoding. However, IIUC, then the issue here is later on down the line, numpy uses regular old C code, which expects ascii strings. In that case, we could encode the text as ascii, into a bytes object. That's a lot of overhead for line ending translation, so probably not worth it. But if nothing else, we should be clear in the docs that numpy text file reading code is expecting ascii-compatible data. (and it would be nice to get the line-ending translation)
Right - so obviously if you open a utf-16 file as binary, terrible things may happen - this was what Pauli was pointing out before. His point was that utf-8 is the standard,
but it's not the standard -- it's a common use, but not a standard -- ideally numpy wouldn't enforce any particular encoding (though it could default to one, and utf-8 would be a good choice for that)
Once you really make the distiction between text and binary, the concept of a "line terminator" doesn't really make sense anyway.
Well - I was arguing that, given we can iterate over lines in binary files, then there must be the concept of what a line is, in a binary file, and that means that we need the concept of a line terminator.
maybe, but that concept is built on a assumption that your file is ascii-compatible (for \n anyway), and you know what they say about assumptions...
I realize this is a discussion that would have to happen on the python-dev list...
I'm not sure -- I was thinking that python missed something here, but I don't think so anymore. In the unicode world, there is not choice but to be explicit about encodings, and if you do that, then python's "text or binary" distinction makes sense. .readline() for binary file doesn't, but so be it. Honestly, I've never been sure in this discussion what code actually needs fixing, so I'm done now -- we've talked enough that the issues MUST have been covered by now! -Chris