[Numpy-discussion] using loadtxt to load a text file in to a numpy array
Charles R Harris
charlesr.harris at gmail.com
Sat Jan 25 11:33:40 EST 2014
On Thu, Jan 23, 2014 at 11:49 AM, Chris Barker <chris.barker at noaa.gov>wrote:
> Thanks for poking into this all. I've lost track a bit, but I think:
> The 'S' type is clearly broken on py3 (at least). I think that gives us
> room to change it, and backward compatibly is less of an issue because it's
> broken already -- do we need to preserve bug-for-bug compatibility? Maybe,
> but I suspect in this case, not -- the code the "works fine" on py3 with
> the 'S' type is probably only lucky that it hasn't encountered the issues
> And no matter how you slice it, code being ported to py3 needs to deal
> with text handling issues.
> But here is where we stand:
> The 'S' dtype:
> - was designed for one-byte-per-char text data.
> - was mapped to the py2 string type.
> - used the classic C null-terminated approach.
> - can be used for arbitrary bytes (as the py2 string type can), but not
> quite, as it truncates null bytes -- so it really a bad idea to use it that
> Under py3:
> The 'S' type maps to the py3 bytes type, because that's the closest to
> the py2 string type. But it also does some inconsistent things with
> encoding, and does treat a lot of other things as text. But the py3 bytes
> type does not have the same text handling as the py2 string type, so things
> s = 'a string'
> np.array((s,), dtype='S') == s
> Gives you False, rather than True on py2. This is because a py3 string is
> translated to the 'S' type (presumable with the default encoding, another
> maybe not a good idea, but returns a bytes object, which does not compare
> true to a py3 string. YOu can work aroudn this with varios calls to
> encode() and decode, and/or using b'a string', but that is ugly, kludgy,
> and doesn't work well with the py3 text model.
> The py2 => py3 transition separated bytes and strings: strings are
> unicode, and bytes are not to be used for text (directly). While there is
> some text-related functionality still in bytes, the core devs are quite
> clear that that is for special cases only, and not for general text
> I don't think numpy should fight this, but rather embrace the py3 text
> model. The most natural way to do that is to use the existing 'U' dtype for
> text. Really the best solution for most cases. (Like the above case)
> However, there is a use case for a more efficient way to deal with text.
> There are a couple ways to go about that that have been brought up here:
> 1: have a more efficient unicode dtype: variable length,
> multiple encoding options, etc....
> - This is a fine idea that would support better text handling in
> numpy, and _maybe_ better interaction with external libraries (HDF, etc...)
> 2: Have a one-byte-per-char text dtype:
> - This would be much easier to implement fit into the current numpy
> model, and satisfy a lot of common use cases for scientific data sets.
We could certainly do both, but I'd like to see (2) get done sooner than
This is pretty much my sense of things at the moment. I think 1) is needed
in the long term but that 2) is a quick fix that solves most problems in
the short term.
> A related issue is whether numpy needs a dtype analogous to py3 bytes --
> I'm still not sure of the use-case there, so can't comment -- would it need
> to be fixed length (fitting into the numpy data model better) or variable
> length, or ??? Some folks are (apparently) using the current 'S' type in
> this way, but I think that's ripe for errors, due to the null bytes issue.
> Though maybe there is a null-bytes-are-special binary format that isn't
> text -- I have no idea.
> So what do we do with 'S'? It really is pretty broken, so we have a
> couple choices:
> (1) depricate it, so that it stays around for backward compatibility
> but encourage people to either use 'U' for text, or one of the new dtypes
> that are yet to be implemented (maybe 's' for a one-byte-per-char dtype),
> and use either uint8 or the new bytes dtype that is yet to be implemented.
> (2) fix it -- in this case, I think we need to be clear what it is:
> -- A one-byte-char-text type? If so, it should map to a py3 string,
> and have a defined encoding (ascii or latin-1, probably), or even better a
> settable encoding (but only for one-byte-per-char encodings -- I don't
> think utf-8 is a good idea here, as a utf-8 encoded string is of unknown
> length. (there is some room for debate here, as the 'S' type is fixed
> length and truncates anyway, maybe it's fine for it to truncate utf-8 -- as
> long as it doesn't partially truncate in teh middle of a charactor)
I think we should make it a one character encoded type compatible with str
in python 2, and maybe latin-1 in python 3. I'm thinking latin-1 because of
pep 393 where it is effectively a UCS-1, but ascii might be a bit more
flexible because it is a subset of utf-8 and might serve better in python 2.
> -- a bytes type? in which case, we should clean out all teh
> automatic conversion to-from text that iare in it now.
I'm not sure what to do about a bytes type.
> I vote for it being our one-byte text type -- it almost is already, and it
> would make the easiest transition for folks from py2 to py3. But backward
> compatibility is backward compatibility.
Not sure what to do here. It would be nice if S was a string type of given
encoding. Might be worth an experiment to see how much breaks.
> > numpy arrays need a decode and encode method
> I'm not sure that they do. Rather there needs to be a text dtype that
>> knows what encoding to use in order to have a binary interface as
>> exposed by .tostring() and friends and but produce unicode strings
>> when indexed from Python code. Having both a text and a binary
>> interface to the same data implies having an encoding.
> I agree with Oscar here -- let's not conflate encode and decoded data --
> the py3 text model is a fine one, we should work with it as much
> as practical.
> UNLESS: if we do add a bytes dtype, then it would be a reasonable use case
> to use it to store encoded text (just like the py3 bytes types), in which
> case it would be good to have encode() and decode() methods or ufuncs --
> probably ufuncs. But that should be for special purpose, at the I/O
> interface kind of stuff.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion