On Apr 25, 2017 10:13 AM, "Anne Archibald" <peridot.faceted@gmail.com> wrote:

On Tue, Apr 25, 2017 at 6:05 PM Chris Barker <chris.barker@noaa.gov> wrote:
Anyway, I think I made the mistake of mingling possible solutions in with the use-cases, so I'm not sure if there is any consensus on the use cases -- which I think we really do need to nail down first -- as Robert has made clear.

I would make my use-cases more user-specific:

1) User wants an array with numpy indexing tricks that can hold python strings but doesn't care about the underlying representation.
-> Solvable with object arrays, or Robert's string-specific object arrays; underlying representation is python objects on the heap. Sadly UCS-4, so zillions are going to be a memory problem.

It's possible to do much better than this when defining a specialized variable-width string dtype. E.g. make the itemsize 8 bytes (like an object array, assuming a 64 bit system), but then for strings that can be encoded in 7 bytes or less of utf8 store them directly in the array; else store a pointer to a raw utf8 string on the heap. (Possibly with a reference count - there are some interesting tradeoffs there. I suspect 1-byte reference counts might be the way to go; if a logical copy would make it overflow then make an actual copy instead.) Anything involving the heap is going to have some overhead, but we don't need full fledged Python objects and once we give up mmap compatibility then there's a lot of room to tune.

-n