[Numpy-discussion] proposal: smaller representation of string arrays

Tue Apr 25 15:52:06 EDT 2017

On Apr 25, 2017 10:13 AM, "Anne Archibald" <peridot.faceted at gmail.com>
wrote:

On Tue, Apr 25, 2017 at 6:05 PM Chris Barker <chris.barker at noaa.gov> wrote:

> Anyway, I think I made the mistake of mingling possible solutions in with
> the use-cases, so I'm not sure if there is any consensus on the use cases
> -- which I think we really do need to nail down first -- as Robert has made
> clear.
>

I would make my use-cases more user-specific:

1) User wants an array with numpy indexing tricks that can hold python
strings but doesn't care about the underlying representation.
-> Solvable with object arrays, or Robert's string-specific object arrays;
underlying representation is python objects on the heap. Sadly UCS-4, so
zillions are going to be a memory problem.

It's possible to do much better than this when defining a specialized
variable-width string dtype. E.g. make the itemsize 8 bytes (like an object
array, assuming a 64 bit system), but then for strings that can be encoded
in 7 bytes or less of utf8 store them directly in the array; else store a
pointer to a raw utf8 string on the heap. (Possibly with a reference count
- there are some interesting tradeoffs there. I suspect 1-byte reference
counts might be the way to go; if a logical copy would make it overflow
then make an actual copy instead.) Anything involving the heap is going to
have some overhead, but we don't need full fledged Python objects and once
we give up mmap compatibility then there's a lot of room to tune.

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/1ee5e0e5/attachment.html>