[Numpy-discussion] A one-byte string dtype?

Tue Jan 21 06:13:36 EST 2014

On Mon, Jan 20, 2014 at 04:12:20PM -0700, Charles R Harris wrote:
> On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <charlesr.harris at gmail.com wrote:
> > On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs at pobox.com> wrote:
> >> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris <charlesr.harris at gmail.com> wrote:
> >> >
> >> > I didn't say we should change the S type, but that we should have
> >> something,
> >> > say 's', that appeared to python as a string. I think if we want
> >> transparent
> >> > string interoperability with python together with a compressed
> >> > representation, and I think we need both, we are going to have to deal
> >> with
> >> > the difficulties of utf-8. That means raising errors if the string
> >> doesn't
> >> > fit in the allotted size, etc. Mind, this is a workaround for the mass
> >> of
> >> > ascii data that is already out there, not a substitute for 'U'.
> >>
> >> If we're going to be taking that much trouble, I'd suggest going ahead
> >> and adding a variable-length string type (where the array itself
> >> contains a pointer to a lookaside buffer, maybe with an optimization
> >> for stashing short strings directly). The fixed-length requirement is
> >> pretty onerous for lots of applications (e.g., pandas always uses
> >> dtype="O" for strings -- and that might be a good workaround for some
> >> people in this thread for now). The use of a lookaside buffer would
> >> also make it practical to resize the buffer when the maximum code
> >> point changed, for that matter...
> >>
> The more I think about it, the more I think we may need to do that. Note
> that dynd has ragged arrays and I think they are implemented as pointers to
> buffers. The easy way for us to do that would be a specialization of object
> arrays to string types only as you suggest.

This wouldn't necessarily help for the gigarows of short text strings use case
(depending on what "short" means). Also even if it technically saves memory
you may have a greater overhead from fragmenting your array all over the heap.

On my 64 bit Linux system the size of a Python 3.3 str containing only ASCII
characters is 49+N bytes. For the 'U' dtype it's 4N bytes. You get a memory
saving over dtype='U' only if the strings are 17 characters or more. To get a
50% saving over dtype='U' you'd need strings of at least 49 characters.

If the Numpy array would manage the buffers itself then that per string memory
overhead would be eliminated in exchange for an 8 byte pointer and at least 1
byte to represent the length of the string (assuming you can somehow use
Pascal strings when short enough - null bytes cannot be used). This gives an
overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory
if the strings are more than 3 characters long and you get at least a 50%
saving for strings longer than 9 characters.

Using utf-8 in the buffers eliminates the need to go around checking maximum
code points etc. so I would guess that would be simpler to implement (CPython
has now had to triple all of it's code paths that actually access the string
buffer).

Oscar