[Numpy-discussion] ANN: Numpy 1.6.0 beta 2
Christopher Barker
Chris.Barker at noaa.gov
Wed Apr 6 11:46:39 EDT 2011
Sorry to keep harping on this, but for history's sake, I was one of the
folks that got 'U' introduced in the first place. I was dealing with a
nightmare of unix, mac and dos test files, 'U' was a godsend.
On 4/5/11 4:51 PM, Matthew Brett wrote:
> The difference between 'rt' and 'U' is (this is for my own benefit):
>
> For 'rt', a '\r' does not cause a line break - with 'U' - it does.
Perhaps semantics, but what 'U' does is actually change any of the line
breaks to '\n' -- any line breaking happens after the fact. In Py2, the
difference between 'U' and 't' is that 't' assumes that any file read
uses the native line endings -- a bad idea, IMHO. Back in the day, Guido
argued that text file line ending conversion was the job of file
transfer tools. The reality, however, is that users don't always use
file transfer tools correctly, nor even understand the implications of
line endings.
All that being said, mac-style files are pretty rare these days.
(though I bet I've got a few still kicking around)
> Right - my argument is that the behavior implied by 'U' and 't' is
> conceptually separable. 'U' is for how to do line-breaks, and
> line-termination translations, 't' is for whether to decode the text
> or not. In python 3.
but 't' and 'U' are the same in python 3 -- there is no distinction. It
seems you are arguing that there could/should be a way to translate line
termination without decoding the text, but ...
> In python 3 a binary file is a file which is not decoded, and returns
> bytes. It still has a concept of a 'line', as defined by line
> terminators - you can iterate over one, or do .readlines().
I'll take your word for it that it does, but that's not really a binary
file then, it's a file that you are assuming is encoded in an
ascii-compatible way.
While I know that "practicality beats purity", we really should be
opening the file as a text file (it is text, after all), and specifying
utf-8 or latin-1 or something as the encoding.
However, IIUC, then the issue here is later on down the line, numpy uses
regular old C code, which expects ascii strings. In that case, we could
encode the text as ascii, into a bytes object.
That's a lot of overhead for line ending translation, so probably not
worth it. But if nothing else, we should be clear in the docs that numpy
text file reading code is expecting ascii-compatible data.
(and it would be nice to get the line-ending translation)
> Right - so obviously if you open a utf-16 file as binary, terrible
> things may happen - this was what Pauli was pointing out before. His
> point was that utf-8 is the standard,
but it's not the standard -- it's a common use, but not a standard --
ideally numpy wouldn't enforce any particular encoding (though it could
default to one, and utf-8 would be a good choice for that)
>> Once you really make the distiction between text and binary, the concept
>> of a "line terminator" doesn't really make sense anyway.
>
> Well - I was arguing that, given we can iterate over lines in binary
> files, then there must be the concept of what a line is, in a binary
> file, and that means that we need the concept of a line terminator.
maybe, but that concept is built on a assumption that your file is
ascii-compatible (for \n anyway), and you know what they say about
assumptions...
> I realize this is a discussion that would have to happen on the
> python-dev list...
I'm not sure -- I was thinking that python missed something here, but I
don't think so anymore. In the unicode world, there is not choice but to
be explicit about encodings, and if you do that, then python's "text or
binary" distinction makes sense. .readline() for binary file doesn't,
but so be it.
Honestly, I've never been sure in this discussion what code actually
needs fixing, so I'm done now -- we've talked enough that the issues
MUST have been covered by now!
-Chris
More information about the NumPy-Discussion
mailing list