[Numpy-discussion] A one-byte string dtype?

Oscar Benjamin oscar.j.benjamin at gmail.com
Mon Jan 20 16:27:48 EST 2014


On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.harris at gmail.com>
wrote:
>
> I think we may want something like PEP 393. The S datatype may be the
wrong place to look, we might want a modification of U instead so as to
transparently get the benefit of python strings.

The approach taken in PEP 393 (the FSR) makes more sense for str than it
does for numpy arrays for two reasons: str is immutable and opaque.

Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can get a
pointer to the string buffer.

Since it is opaque no one can rightly expect it to expose a particular
binary format so it is free to choose without compromising any expected
semantics.

If someone can call buffer on an array then the FSR is a semantic change.

If a numpy 'U' array used the FSR and consisted only of ASCII characters
then it would have a one byte per char buffer. What then happens if you put
a higher code point in? The buffer needs to be resized and the data copied
over. But then what happens to any buffer objects or array views? They
would be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views and
vice versa.

I don't think that this can be done transparently since users of a numpy
array need to know about the binary representation. That's why I suggest a
dtype that has an encoding. Only in that way can it consistently have both
a binary and a text interface.

Oscar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140120/0d01d5da/attachment.html>


More information about the NumPy-Discussion mailing list