[Numpy-discussion] A one-byte string dtype?

Charles R Harris charlesr.harris at gmail.com
Mon Jan 20 17:58:26 EST 2014

On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs at pobox.com> wrote:

> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
> >
> >
> >
> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
> oscar.j.benjamin at gmail.com>
> > wrote:
> >>
> >>
> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.harris at gmail.com>
> >> wrote:
> >> >
> >> > I think we may want something like PEP 393. The S datatype may be the
> >> > wrong place to look, we might want a modification of U instead so as
> to
> >> > transparently get the benefit of python strings.
> >>
> >> The approach taken in PEP 393 (the FSR) makes more sense for str than it
> >> does for numpy arrays for two reasons: str is immutable and opaque.
> >>
> >> Since str is immutable the maximum code point in the string can be
> >> determined once when the string is created before anything else can get
> a
> >> pointer to the string buffer.
> >>
> >> Since it is opaque no one can rightly expect it to expose a particular
> >> binary format so it is free to choose without compromising any expected
> >> semantics.
> >>
> >> If someone can call buffer on an array then the FSR is a semantic
> change.
> >>
> >> If a numpy 'U' array used the FSR and consisted only of ASCII characters
> >> then it would have a one byte per char buffer. What then happens if you
> put
> >> a higher code point in? The buffer needs to be resized and the data
> copied
> >> over. But then what happens to any buffer objects or array views? They
> would
> >> be pointing at the old buffer from before the resize. Subsequent
> >> modifications to the resized array would not show up in other views and
> vice
> >> versa.
> >>
> >> I don't think that this can be done transparently since users of a numpy
> >> array need to know about the binary representation. That's why I
> suggest a
> >> dtype that has an encoding. Only in that way can it consistently have
> both a
> >> binary and a text interface.
> >
> >
> > I didn't say we should change the S type, but that we should have
> something,
> > say 's', that appeared to python as a string. I think if we want
> transparent
> > string interoperability with python together with a compressed
> > representation, and I think we need both, we are going to have to deal
> with
> > the difficulties of utf-8. That means raising errors if the string
> doesn't
> > fit in the allotted size, etc. Mind, this is a workaround for the mass of
> > ascii data that is already out there, not a substitute for 'U'.
> If we're going to be taking that much trouble, I'd suggest going ahead
> and adding a variable-length string type (where the array itself
> contains a pointer to a lookaside buffer, maybe with an optimization
> for stashing short strings directly). The fixed-length requirement is
> pretty onerous for lots of applications (e.g., pandas always uses
> dtype="O" for strings -- and that might be a good workaround for some
> people in this thread for now). The use of a lookaside buffer would
> also make it practical to resize the buffer when the maximum code
> point changed, for that matter...
> Though, IMO any new dtype here would need a cleanup of the dtype code
> first so that it doesn't require yet more massive special cases all
> over umath.so.

Worth thinking about. As another alternative, what is the minimum we need
to make a restricted encoding, say latin-1, appear transparently as a
unicode string to python? I know the python folks don't like this much, but
I suspect something along that line will eventually be required for the
http folks.

