[Numpy-discussion] A one-byte string dtype?

Aldcroft, Thomas aldcroft at head.cfa.harvard.edu
Tue Jan 21 07:54:21 EST 2014


On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <charlesr.harris at gmail.com
> wrote:

>
>
>
> On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <
> charlesr.harris at gmail.com> wrote:
>
>>
>>
>>
>> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>>> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
>>> <charlesr.harris at gmail.com> wrote:
>>> >
>>> >
>>> >
>>> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
>>> oscar.j.benjamin at gmail.com>
>>> > wrote:
>>> >>
>>> >>
>>> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" <
>>> charlesr.harris at gmail.com>
>>> >> wrote:
>>> >> >
>>> >> > I think we may want something like PEP 393. The S datatype may be
>>> the
>>> >> > wrong place to look, we might want a modification of U instead so
>>> as to
>>> >> > transparently get the benefit of python strings.
>>> >>
>>> >> The approach taken in PEP 393 (the FSR) makes more sense for str than
>>> it
>>> >> does for numpy arrays for two reasons: str is immutable and opaque.
>>> >>
>>> >> Since str is immutable the maximum code point in the string can be
>>> >> determined once when the string is created before anything else can
>>> get a
>>> >> pointer to the string buffer.
>>> >>
>>> >> Since it is opaque no one can rightly expect it to expose a particular
>>> >> binary format so it is free to choose without compromising any
>>> expected
>>> >> semantics.
>>> >>
>>> >> If someone can call buffer on an array then the FSR is a semantic
>>> change.
>>> >>
>>> >> If a numpy 'U' array used the FSR and consisted only of ASCII
>>> characters
>>> >> then it would have a one byte per char buffer. What then happens if
>>> you put
>>> >> a higher code point in? The buffer needs to be resized and the data
>>> copied
>>> >> over. But then what happens to any buffer objects or array views?
>>> They would
>>> >> be pointing at the old buffer from before the resize. Subsequent
>>> >> modifications to the resized array would not show up in other views
>>> and vice
>>> >> versa.
>>> >>
>>> >> I don't think that this can be done transparently since users of a
>>> numpy
>>> >> array need to know about the binary representation. That's why I
>>> suggest a
>>> >> dtype that has an encoding. Only in that way can it consistently have
>>> both a
>>> >> binary and a text interface.
>>> >
>>> >
>>> > I didn't say we should change the S type, but that we should have
>>> something,
>>> > say 's', that appeared to python as a string. I think if we want
>>> transparent
>>> > string interoperability with python together with a compressed
>>> > representation, and I think we need both, we are going to have to deal
>>> with
>>> > the difficulties of utf-8. That means raising errors if the string
>>> doesn't
>>> > fit in the allotted size, etc. Mind, this is a workaround for the mass
>>> of
>>> > ascii data that is already out there, not a substitute for 'U'.
>>>
>>> If we're going to be taking that much trouble, I'd suggest going ahead
>>> and adding a variable-length string type (where the array itself
>>> contains a pointer to a lookaside buffer, maybe with an optimization
>>> for stashing short strings directly). The fixed-length requirement is
>>> pretty onerous for lots of applications (e.g., pandas always uses
>>> dtype="O" for strings -- and that might be a good workaround for some
>>> people in this thread for now). The use of a lookaside buffer would
>>> also make it practical to resize the buffer when the maximum code
>>> point changed, for that matter...
>>>
>>
> The more I think about it, the more I think we may need to do that. Note
> that dynd has ragged arrays and I think they are implemented as pointers to
> buffers. The easy way for us to do that would be a specialization of object
> arrays to string types only as you suggest.
>

Is this approach intended to be in *addition to* the latin-1 "s" type
originally proposed by Chris, or *instead of* that?

- Tom


>
> <snip>
>
> Chuck
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140121/3f3b6aaa/attachment.html>


More information about the NumPy-Discussion mailing list