[Numpy-discussion] Text array dtype for numpy

Chris Barker chris.barker at noaa.gov
Fri Jan 24 17:43:46 EST 2014


Oscar,

Cool stuff, thanks!

I'm wondering though what the use-case really is. The P3 text  model
(actually the py2 one, too), is quite clear that you want users to think
of, and work with, text as text -- and not care how things are encoding in
the underlying implementation. You only want the user to think about
encodings on I/O -- transferring stuff between systems where you can't
avoid it. And you might choose different encodings based on different needs.

So why have a different, the-user-needs-to-think-about-encodings numpy
 dtype? We already have 'U' for full-on unicode support for text. There is
a good argument for a more compact internal representation for text
compatible with one-byte-per-char encoding, thus the suggestion for such a
dtype. But I don't see the need for quite this. Maybe I'm not being a
creative enough thinker.

Also, we may want numpy to interact at a low level with other libs that
might have binary encoded text (HDF, etc) -- in which case we need a bytes
dtype that can store that data, and perhaps encoding and decoding ufuncs.

If we want a more efficient and compact unicode implementation  then the
py3 one is a good  place to start -it's pretty slick! Though maybe harder
to due in numpy as text in numpy probably wouldn't be immutable.

To make a slightly more concrete proposal, I've implemented a pure
> Python ndarray subclass that I believe can consistently handle
> text/bytes in Python 3.


this scares me right there -- is it text or bytes??? We really don't want
something that is both.


> The idea is that the array has an encoding. It stores strings as
> bytes. The bytes are encoded/decoded on insertion/access. Methods
> accessing the binary content of the array will see the encoded bytes.
> Methods accessing the elements of the array will see unicode strings.
>
> I believe it would not be as hard to implement as the proposals for
> variable length string arrays.


except that with some encodings, the number of bytes required is a function
of what the content of teh text is -- so it either has to be variable
length, or a fixed number of bytes, which is not a fixed number
of characters  which require both careful truncation (a pain), and
surprising results for users  "why can't I fit 10 characters is a length-10
text object? And I can if they are different characters?)


> The one caveat is that it will strip
> null characters from the end of any string.


which is fatal, but you do want a new dtype after all, which presumably
wouldn't do that.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140124/dad57254/attachment.html>


More information about the NumPy-Discussion mailing list