[Numpy-discussion] Text array dtype for numpy
chris.barker at noaa.gov
Fri Jan 24 17:43:46 EST 2014
Cool stuff, thanks!
I'm wondering though what the use-case really is. The P3 text model
(actually the py2 one, too), is quite clear that you want users to think
of, and work with, text as text -- and not care how things are encoding in
the underlying implementation. You only want the user to think about
encodings on I/O -- transferring stuff between systems where you can't
avoid it. And you might choose different encodings based on different needs.
So why have a different, the-user-needs-to-think-about-encodings numpy
dtype? We already have 'U' for full-on unicode support for text. There is
a good argument for a more compact internal representation for text
compatible with one-byte-per-char encoding, thus the suggestion for such a
dtype. But I don't see the need for quite this. Maybe I'm not being a
creative enough thinker.
Also, we may want numpy to interact at a low level with other libs that
might have binary encoded text (HDF, etc) -- in which case we need a bytes
dtype that can store that data, and perhaps encoding and decoding ufuncs.
If we want a more efficient and compact unicode implementation then the
py3 one is a good place to start -it's pretty slick! Though maybe harder
to due in numpy as text in numpy probably wouldn't be immutable.
To make a slightly more concrete proposal, I've implemented a pure
> Python ndarray subclass that I believe can consistently handle
> text/bytes in Python 3.
this scares me right there -- is it text or bytes??? We really don't want
something that is both.
> The idea is that the array has an encoding. It stores strings as
> bytes. The bytes are encoded/decoded on insertion/access. Methods
> accessing the binary content of the array will see the encoded bytes.
> Methods accessing the elements of the array will see unicode strings.
> I believe it would not be as hard to implement as the proposals for
> variable length string arrays.
except that with some encodings, the number of bytes required is a function
of what the content of teh text is -- so it either has to be variable
length, or a fixed number of bytes, which is not a fixed number
of characters which require both careful truncation (a pain), and
surprising results for users "why can't I fit 10 characters is a length-10
text object? And I can if they are different characters?)
> The one caveat is that it will strip
> null characters from the end of any string.
which is fatal, but you do want a new dtype after all, which presumably
wouldn't do that.
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion