[Numpy-discussion] proposal: smaller representation of string arrays

Wed Apr 26 11:39:46 EDT 2017

> I think we can implement viewers for strings as ndarray subclasses. Then one
> could
> do `my_string_array.view(latin_1)`, and so on.  Essentially that just
> changes the default
> encoding of the 'S' array. That could also work for uint8 arrays if needed.
>
> Chuck

To handle structured data-types containing encoded strings, we'd also
need to subclass `np.void`.

Things would get messy when a structured dtype contains two strings in
different encodings (or more likely, one bytestring and one
textstring) - we'd need some way to specify which fields are in which
encoding, and using subclasses means that this can't be contained
within the dtype information.

So I think there's a strong argument for solving this with`dtype`s
rather than subclasses. This really doesn't seem hard though.
Something like (C-but-as-python):

def ENCSTRING_getitem(ptr, arr):  # The PyArrFuncs slot
    encoded = STRING_getitem(ptr, arr)
    return encoded.decode(arr.dtype.encoding)

def ENCSTRING_setitem(val, ptr, arr):  # The PyArrFuncs slot
    val = val.encode(arr.dtype.encoding)
    # todo: handle "safe" truncation, where safe might mean keep
codepoints, keep graphemes, or never allow
    STRING_setitem(val, ptr, arr))

We'd probably need to be careful to do a decode/encode dance when
copying from one encoding to another, but we [already have
bugs](https://github.com/numpy/numpy/issues/3258) in those cases
anyway.

Is it reasonable that the user of such an array would want to work
with plain `builtin.unicode` objects, rather than some special numpy
scalar type?

Eric