[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Thu Jan 23 13:49:42 EST 2014

Thanks for poking into this all. I've lost track a bit, but I think:

The 'S' type is clearly broken on py3 (at least). I think that gives us
room to change it, and backward compatibly is less of an issue because it's
broken already -- do we need to preserve bug-for-bug compatibility? Maybe,
but I suspect in this case, not --  the code the "works fine" on py3 with
the 'S' type is probably only lucky that it hasn't encountered the issues
yet.

And no matter how you slice it, code being ported to py3 needs to deal with
text handling issues.

But here is where we stand:

The 'S' dtype:

 - was designed for one-byte-per-char text data.
 - was mapped to the py2 string type.
 - used the classic C null-terminated approach.
 - can be used for arbitrary bytes (as the py2 string type can), but not
quite, as it truncates null bytes -- so it really a bad idea to use it that
way.

Under py3:
  The 'S' type maps to the py3 bytes type, because that's the closest to
the py2 string type. But it also does some inconsistent things with
encoding, and does treat a lot of other things as text. But the py3 bytes
type does not have the same text handling as the py2 string type, so things
like:

s = 'a string'
np.array((s,), dtype='S')[0] == s

Gives you False, rather than True on py2. This is because a py3 string is
translated to the 'S' type (presumable with the default encoding, another
maybe not a good idea, but returns a bytes object, which does not compare
true to a py3 string. YOu can work aroudn this with varios calls to
encode() and decode, and/or using b'a string', but that is ugly, kludgy,
and doesn't work well with the py3 text model.

The py2 => py3 transition separated bytes and strings: strings are unicode,
and bytes are not to be used for text (directly). While there is some
text-related functionality still in bytes, the core devs are quite clear
that that is for special cases only, and not for general text processing.

I don't think numpy should fight this, but rather embrace the py3 text
model. The most natural way to do that is to use the existing 'U' dtype for
text. Really the best solution for most cases. (Like the above case)

However, there is a use case for a more efficient way to deal with text.
There are a couple ways to go about that that have been brought up here:

1: have a more efficient unicode dtype: variable length,
multiple encoding options, etc....
    - This is a fine idea that would support better text handling in numpy,
and _maybe_ better interaction with external libraries (HDF, etc...)

2: Have a one-byte-per-char text dtype:
  - This would be much easier to implement  fit into the current numpy
model, and satisfy a lot of common use cases for scientific data sets.

We could certainly do both, but I'd like to see (2) get done sooner than
later....

A related issue is whether numpy needs a dtype analogous to py3 bytes --
I'm still not sure of the use-case there, so can't comment -- would it need
to be fixed length (fitting into the numpy data model better) or variable
length, or ??? Some folks are (apparently) using the current 'S' type in
this way, but I think that's ripe for errors, due to the null bytes issue.
Though maybe there is a null-bytes-are-special binary format that isn't
text -- I have no idea.

So what do we  do with 'S'? It really is pretty broken, so we have a couple
choices:

 (1)  depricate it, so that it stays around for backward compatibility
but encourage people to either use 'U' for text, or one of the new dtypes
that are yet to be implemented (maybe 's' for a one-byte-per-char dtype),
and use either uint8 or the new bytes dtype that is yet to be implemented.

 (2) fix it -- in this case, I think we need to be clear what it is:
     -- A one-byte-char-text type? If so, it should map to a py3 string,
and have a defined encoding (ascii or latin-1, probably), or even better a
settable encoding (but only for one-byte-per-char encodings -- I don't
think utf-8 is a good idea here, as a utf-8 encoded string is of unknown
length. (there is some room for debate here, as the 'S' type is fixed
length and truncates anyway, maybe it's fine for it to truncate utf-8 -- as
long as it doesn't partially truncate in teh middle of a charactor)

   -- a bytes type? in which  case, we should clean out all teh
automatic conversion to-from text that iare in it now.

I vote for it being our one-byte text type -- it almost is already, and it
would make the easiest transition for folks from py2 to py3. But backward
compatibility is backward compatibility.

> numpy arrays need a decode and encode method

I'm not sure that they do. Rather there needs to be a text dtype that
> knows what encoding to use in order to have a binary interface as
> exposed by .tostring() and friends and but produce unicode strings
> when indexed from Python code. Having both a text and a binary
> interface to the same data implies having an encoding.

I  agree with Oscar here -- let's not conflate encode and decoded data --
the py3 text model is a fine one, we should work with it as much
as practical.

UNLESS: if we do add a bytes dtype, then it would be a reasonable use case
to use it to store encoded text (just like the py3 bytes types), in which
case it would be good to have encode() and decode() methods or ufuncs --
probably  ufuncs. But that should be for special purpose, at the I/O
interface kind of stuff.

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140123/6da1bd74/attachment.html>