[Numpy-discussion] A one-byte string dtype?

Charles R Harris charlesr.harris at gmail.com
Tue Jan 21 08:55:29 EST 2014


On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:

>
>
>
> On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <
> charlesr.harris at gmail.com> wrote:
>
>>
>>
>>
>> On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <
>> charlesr.harris at gmail.com> wrote:
>>
>>>
>>>
>>>
>>> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>>
>>>> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
>>>> <charlesr.harris at gmail.com> wrote:
>>>> >
>>>> >
>>>> >
>>>> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
>>>> oscar.j.benjamin at gmail.com>
>>>> > wrote:
>>>> >>
>>>> >>
>>>> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" <
>>>> charlesr.harris at gmail.com>
>>>> >> wrote:
>>>> >> >
>>>> >> > I think we may want something like PEP 393. The S datatype may be
>>>> the
>>>> >> > wrong place to look, we might want a modification of U instead so
>>>> as to
>>>> >> > transparently get the benefit of python strings.
>>>> >>
>>>> >> The approach taken in PEP 393 (the FSR) makes more sense for str
>>>> than it
>>>> >> does for numpy arrays for two reasons: str is immutable and opaque.
>>>> >>
>>>> >> Since str is immutable the maximum code point in the string can be
>>>> >> determined once when the string is created before anything else can
>>>> get a
>>>> >> pointer to the string buffer.
>>>> >>
>>>> >> Since it is opaque no one can rightly expect it to expose a
>>>> particular
>>>> >> binary format so it is free to choose without compromising any
>>>> expected
>>>> >> semantics.
>>>> >>
>>>> >> If someone can call buffer on an array then the FSR is a semantic
>>>> change.
>>>> >>
>>>> >> If a numpy 'U' array used the FSR and consisted only of ASCII
>>>> characters
>>>> >> then it would have a one byte per char buffer. What then happens if
>>>> you put
>>>> >> a higher code point in? The buffer needs to be resized and the data
>>>> copied
>>>> >> over. But then what happens to any buffer objects or array views?
>>>> They would
>>>> >> be pointing at the old buffer from before the resize. Subsequent
>>>> >> modifications to the resized array would not show up in other views
>>>> and vice
>>>> >> versa.
>>>> >>
>>>> >> I don't think that this can be done transparently since users of a
>>>> numpy
>>>> >> array need to know about the binary representation. That's why I
>>>> suggest a
>>>> >> dtype that has an encoding. Only in that way can it consistently
>>>> have both a
>>>> >> binary and a text interface.
>>>> >
>>>> >
>>>> > I didn't say we should change the S type, but that we should have
>>>> something,
>>>> > say 's', that appeared to python as a string. I think if we want
>>>> transparent
>>>> > string interoperability with python together with a compressed
>>>> > representation, and I think we need both, we are going to have to
>>>> deal with
>>>> > the difficulties of utf-8. That means raising errors if the string
>>>> doesn't
>>>> > fit in the allotted size, etc. Mind, this is a workaround for the
>>>> mass of
>>>> > ascii data that is already out there, not a substitute for 'U'.
>>>>
>>>> If we're going to be taking that much trouble, I'd suggest going ahead
>>>> and adding a variable-length string type (where the array itself
>>>> contains a pointer to a lookaside buffer, maybe with an optimization
>>>> for stashing short strings directly). The fixed-length requirement is
>>>> pretty onerous for lots of applications (e.g., pandas always uses
>>>> dtype="O" for strings -- and that might be a good workaround for some
>>>> people in this thread for now). The use of a lookaside buffer would
>>>> also make it practical to resize the buffer when the maximum code
>>>> point changed, for that matter...
>>>>
>>>
>> The more I think about it, the more I think we may need to do that. Note
>> that dynd has ragged arrays and I think they are implemented as pointers to
>> buffers. The easy way for us to do that would be a specialization of object
>> arrays to string types only as you suggest.
>>
>
> Is this approach intended to be in *addition to* the latin-1 "s" type
> originally proposed by Chris, or *instead of* that?
>
>
Well, that's open for discussion. The problem is to have something that is
both compact (latin-1) and interoperates transparently with python 3
strings (utf-8). A latin-1 type would be easier to implement and would
probably be a better choice for something available in both python 2 and
python 3, but unless the python 3 developers come up with something clever
I don't  see how to make it behave transparently as a string in python 3.
OTOH, it's not clear to me how to make utf-8 operate transparently with
python 2 strings, especially as the unicode representation choices in
python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8
is unlikely to be backported. The problem may be unsolvable in a completely
satisfactory way.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140121/7aac3da9/attachment.html>


More information about the NumPy-Discussion mailing list