[Numpy-discussion] A one-byte string dtype?

Mon Jan 20 17:35:12 EST 2014

On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
>
> On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <oscar.j.benjamin at gmail.com>
> wrote:
>>
>>
>> On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.harris at gmail.com>
>> wrote:
>> >
>> > I think we may want something like PEP 393. The S datatype may be the
>> > wrong place to look, we might want a modification of U instead so as to
>> > transparently get the benefit of python strings.
>>
>> The approach taken in PEP 393 (the FSR) makes more sense for str than it
>> does for numpy arrays for two reasons: str is immutable and opaque.
>>
>> Since str is immutable the maximum code point in the string can be
>> determined once when the string is created before anything else can get a
>> pointer to the string buffer.
>>
>> Since it is opaque no one can rightly expect it to expose a particular
>> binary format so it is free to choose without compromising any expected
>> semantics.
>>
>> If someone can call buffer on an array then the FSR is a semantic change.
>>
>> If a numpy 'U' array used the FSR and consisted only of ASCII characters
>> then it would have a one byte per char buffer. What then happens if you put
>> a higher code point in? The buffer needs to be resized and the data copied
>> over. But then what happens to any buffer objects or array views? They would
>> be pointing at the old buffer from before the resize. Subsequent
>> modifications to the resized array would not show up in other views and vice
>> versa.
>>
>> I don't think that this can be done transparently since users of a numpy
>> array need to know about the binary representation. That's why I suggest a
>> dtype that has an encoding. Only in that way can it consistently have both a
>> binary and a text interface.
>
>
> I didn't say we should change the S type, but that we should have something,
> say 's', that appeared to python as a string. I think if we want transparent
> string interoperability with python together with a compressed
> representation, and I think we need both, we are going to have to deal with
> the difficulties of utf-8. That means raising errors if the string doesn't
> fit in the allotted size, etc. Mind, this is a workaround for the mass of
> ascii data that is already out there, not a substitute for 'U'.

If we're going to be taking that much trouble, I'd suggest going ahead
and adding a variable-length string type (where the array itself
contains a pointer to a lookaside buffer, maybe with an optimization
for stashing short strings directly). The fixed-length requirement is
pretty onerous for lots of applications (e.g., pandas always uses
dtype="O" for strings -- and that might be a good workaround for some
people in this thread for now). The use of a lookaside buffer would
also make it practical to resize the buffer when the maximum code
point changed, for that matter...

Though, IMO any new dtype here would need a cleanup of the dtype code
first so that it doesn't require yet more massive special cases all
over umath.so.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org