On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
<charlesr.harris@gmail.com> wrote:
>
>
>
> On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com>
> wrote:
>>
>>
>> On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.harris@gmail.com>
>> wrote:
>> >
>> > I think we may want something like PEP 393. The S datatype may be the
>> > wrong place to look, we might want a modification of U instead so as to
>> > transparently get the benefit of python strings.
>>
>> The approach taken in PEP 393 (the FSR) makes more sense for str than it
>> does for numpy arrays for two reasons: str is immutable and opaque.
>>
>> Since str is immutable the maximum code point in the string can be
>> determined once when the string is created before anything else can get a
>> pointer to the string buffer.
>>
>> Since it is opaque no one can rightly expect it to expose a particular
>> binary format so it is free to choose without compromising any expected
>> semantics.
>>
>> If someone can call buffer on an array then the FSR is a semantic change.
>>
>> If a numpy 'U' array used the FSR and consisted only of ASCII characters
>> then it would have a one byte per char buffer. What then happens if you put
>> a higher code point in? The buffer needs to be resized and the data copied
>> over. But then what happens to any buffer objects or array views? They would
>> be pointing at the old buffer from before the resize. Subsequent
>> modifications to the resized array would not show up in other views and vice
>> versa.
>>
>> I don't think that this can be done transparently since users of a numpy
>> array need to know about the binary representation. That's why I suggest a
>> dtype that has an encoding. Only in that way can it consistently have both a
>> binary and a text interface.
>
>
> I didn't say we should change the S type, but that we should have something,
> say 's', that appeared to python as a string. I think if we want transparent
> string interoperability with python together with a compressed
> representation, and I think we need both, we are going to have to deal with
> the difficulties of utf-8. That means raising errors if the string doesn't
> fit in the allotted size, etc. Mind, this is a workaround for the mass of
> ascii data that is already out there, not a substitute for 'U'.

If we're going to be taking that much trouble, I'd suggest going ahead
and adding a variable-length string type (where the array itself
contains a pointer to a lookaside buffer, maybe with an optimization
for stashing short strings directly). The fixed-length requirement is
pretty onerous for lots of applications (e.g., pandas always uses
dtype="O" for strings -- and that might be a good workaround for some
people in this thread for now). The use of a lookaside buffer would
also make it practical to resize the buffer when the maximum code
point changed, for that matter...

Though, IMO any new dtype here would need a cleanup of the dtype code
first so that it doesn't require yet more massive special cases all
over umath.so.

Worth thinking about. As another alternative, what is the minimum we need to make a restricted encoding, say latin-1, appear transparently as a unicode string to python? I know the python folks don't like this much, but I suspect something along that line will eventually be required for the http folks.

Chuck