[Numpy-discussion] A one-byte string dtype?

Sebastian Berg sebastian at sipsolutions.net
Tue Jan 21 10:10:01 EST 2014


On Tue, 2014-01-21 at 07:48 -0700, Charles R Harris wrote:
> 
> 
> 
> On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas
> <aldcroft at head.cfa.harvard.edu> wrote:
>         
>         
>         
>         On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris
>         <charlesr.harris at gmail.com> wrote:
>                 
>                 
>                 
>                 On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas
>                 <aldcroft at head.cfa.harvard.edu> wrote:
>                         
>                         
>                         
>                         On Mon, Jan 20, 2014 at 6:12 PM, Charles R
>                         Harris <charlesr.harris at gmail.com> wrote:
>                                 
>                                 
>                                 
>                                 On Mon, Jan 20, 2014 at 3:58 PM,
>                                 Charles R Harris
>                                 <charlesr.harris at gmail.com> wrote:
>                                         
>                                         
>                                         
>                                         On Mon, Jan 20, 2014 at 3:35
>                                         PM, Nathaniel Smith
>                                         <njs at pobox.com> wrote:
>                                                 On Mon, Jan 20, 2014
>                                                 at 10:28 PM, Charles R
>                                                 Harris
>                                                 <charlesr.harris at gmail.com> wrote:
>                                                 >
>                                                 >
>                                                 >
>                                                 > On Mon, Jan 20, 2014
>                                                 at 2:27 PM, Oscar
>                                                 Benjamin
>                                                 <oscar.j.benjamin at gmail.com>
>                                                 > wrote:
>                                                 >>
>                                                 >>
>                                                 >> On Jan 20, 2014
>                                                 8:35 PM, "Charles R
>                                                 Harris"
>                                                 <charlesr.harris at gmail.com>
>                                                 >> wrote:
>                                                 >> >
>                                                 >> > I think we may
>                                                 want something like
>                                                 PEP 393. The S
>                                                 datatype may be the
>                                                 >> > wrong place to
>                                                 look, we might want a
>                                                 modification of U
>                                                 instead so as to
>                                                 >> > transparently get
>                                                 the benefit of python
>                                                 strings.
>                                                 >>
>                                                 >> The approach taken
>                                                 in PEP 393 (the FSR)
>                                                 makes more sense for
>                                                 str than it
>                                                 >> does for numpy
>                                                 arrays for two
>                                                 reasons: str is
>                                                 immutable and opaque.
>                                                 >>
>                                                 >> Since str is
>                                                 immutable the maximum
>                                                 code point in the
>                                                 string can be
>                                                 >> determined once
>                                                 when the string is
>                                                 created before
>                                                 anything else can get
>                                                 a
>                                                 >> pointer to the
>                                                 string buffer.
>                                                 >>
>                                                 >> Since it is opaque
>                                                 no one can rightly
>                                                 expect it to expose a
>                                                 particular
>                                                 >> binary format so it
>                                                 is free to choose
>                                                 without compromising
>                                                 any expected
>                                                 >> semantics.
>                                                 >>
>                                                 >> If someone can call
>                                                 buffer on an array
>                                                 then the FSR is a
>                                                 semantic change.
>                                                 >>
>                                                 >> If a numpy 'U'
>                                                 array used the FSR and
>                                                 consisted only of
>                                                 ASCII characters
>                                                 >> then it would have
>                                                 a one byte per char
>                                                 buffer. What then
>                                                 happens if you put
>                                                 >> a higher code point
>                                                 in? The buffer needs
>                                                 to be resized and the
>                                                 data copied
>                                                 >> over. But then what
>                                                 happens to any buffer
>                                                 objects or array
>                                                 views? They would
>                                                 >> be pointing at the
>                                                 old buffer from before
>                                                 the resize. Subsequent
>                                                 >> modifications to
>                                                 the resized array
>                                                 would not show up in
>                                                 other views and vice
>                                                 >> versa.
>                                                 >>
>                                                 >> I don't think that
>                                                 this can be done
>                                                 transparently since
>                                                 users of a numpy
>                                                 >> array need to know
>                                                 about the binary
>                                                 representation. That's
>                                                 why I suggest a
>                                                 >> dtype that has an
>                                                 encoding. Only in that
>                                                 way can it
>                                                 consistently have both
>                                                 a
>                                                 >> binary and a text
>                                                 interface.
>                                                 >
>                                                 >
>                                                 > I didn't say we
>                                                 should change the S
>                                                 type, but that we
>                                                 should have something,
>                                                 > say 's', that
>                                                 appeared to python as
>                                                 a string. I think if
>                                                 we want transparent
>                                                 > string
>                                                 interoperability with
>                                                 python together with a
>                                                 compressed
>                                                 > representation, and
>                                                 I think we need both,
>                                                 we are going to have
>                                                 to deal with
>                                                 > the difficulties of
>                                                 utf-8. That means
>                                                 raising errors if the
>                                                 string doesn't
>                                                 > fit in the allotted
>                                                 size, etc. Mind, this
>                                                 is a workaround for
>                                                 the mass of
>                                                 > ascii data that is
>                                                 already out there, not
>                                                 a substitute for 'U'.
>                                                 
>                                                 
>                                                 If we're going to be
>                                                 taking that much
>                                                 trouble, I'd suggest
>                                                 going ahead
>                                                 and adding a
>                                                 variable-length string
>                                                 type (where the array
>                                                 itself
>                                                 contains a pointer to
>                                                 a lookaside buffer,
>                                                 maybe with an
>                                                 optimization
>                                                 for stashing short
>                                                 strings directly). The
>                                                 fixed-length
>                                                 requirement is
>                                                 pretty onerous for
>                                                 lots of applications
>                                                 (e.g., pandas always
>                                                 uses
>                                                 dtype="O" for strings
>                                                 -- and that might be a
>                                                 good workaround for
>                                                 some
>                                                 people in this thread
>                                                 for now). The use of a
>                                                 lookaside buffer would
>                                                 also make it practical
>                                                 to resize the buffer
>                                                 when the maximum code
>                                                 point changed, for
>                                                 that matter...
>                                 
>                                 
>                                 The more I think about it, the more I
>                                 think we may need to do that. Note
>                                 that dynd has ragged arrays and I
>                                 think they are implemented as pointers
>                                 to buffers. The easy way for us to do
>                                 that would be a specialization of
>                                 object arrays to string types only as
>                                 you suggest.
>                                 
>                         
>                         
>                         Is this approach intended to be in *addition
>                         to* the latin-1 "s" type originally proposed
>                         by Chris, or *instead of* that?
>                         
>                         
>                 
>                 
>                 Well, that's open for discussion. The problem is to
>                 have something that is both compact (latin-1) and
>                 interoperates transparently with python 3 strings
>                 (utf-8). A latin-1 type would be easier to implement
>                 and would probably be a better choice for something
>                 available in both python 2 and python 3, but unless
>                 the python 3 developers come up with something clever
>                 I don't  see how to make it behave transparently as a
>                 string in python 3. OTOH, it's not clear to me how to
>                 make utf-8 operate transparently with python 2
>                 strings, especially as the unicode representation
>                 choices in python 2 are ucs-2 or ucs-4 and the python
>                 3 work adding utf-16 and utf-8 is unlikely to be
>                 backported. The problem may be unsolvable in a
>                 completely satisfactory way.
>                 
>         
>         
>         Since it's open for discussion, I'll put in my vote for
>         implementing the easier latin-1 version in the short term to
>         facilitate Python 2 / 3 interoperability.  This would solve my
>         use-case (giga-rows of short fixed length strings), and
>         presumably allow things like memory mapping of large data
>         files (like for FITS files in astropy.io.fits).
>         
>         
>         I don't have a clue how the current 'U' dtype works under the
>         hood, but from my user perspective it seems to work just fine
>         in terms of interacting with Python 3 strings.  Is there a
>         technical problem with doing basically the same thing for an
>         's' dtype, but using latin-1 instead of UCS-4?
> 
> 
> I think there is a technical problem. We may be able masquerade
> latin-1 as utf-8  for some subset of characters or fool python 3 in
> some other way. But in anycase, I think it needs some research to see
> what the possibilities are.
> 
I am not quite sure, but shouldn't it be even possible to tag on a
possible encoding into the metadata of the string dtype and allow this
to be set to all 1-byte wide encodings that python understands. If the
metadata is not None, all entry points to and from the array
(Object->string, string->Object conversions) would then de- or encode
using the usual python string de- and encode.

Of course it would still be a lot of work, since the string comparisons
would need to know about comparing different encodings and dtype
equivalence is wrong and all the conversions need to be carefully
checked... Most string tools though probably don't care about encoding
as long as it is fixed 1-byte width, though one would have to check that
they don't lose the encoding information by creating a new "S" array...

- Sebastian


> Chuck
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion





More information about the NumPy-Discussion mailing list