[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 20 13:06:31 EDT 2017

Thanks so much for reviving this conversation -- we really do need to
address this.

My thoughts:

What people apparently want is a string type for Python3 which uses less
> memory for the common science use case which rarely needs more than
> latin1 encoding.
>

Yes -- I think there is a real demand for that.

https://en.wikipedia.org/wiki/ISO/IEC_8859-15

To please everyone I think we need to go with a dtype that supports
> multiple encodings via metadata, similar to how datetime supports
> multiple units.
> E.g.: 'U10[latin1]' are 10 characters in latin1 encoding
>

I wonder if we really need that -- as you say, there is real demand for
compact string type, but for many use cases, 1 byte per character is
enough. So to keep things really simple, I think a single 1-byte per char
encoding would meet most people's needs.

What should that encoding be?

latin-1 is obvious (and has the very nice property of being able to
round-trip arbitrary bytes -- at least with Python's implementation) and
scientific data sets tend to use the latin alphabet (with its ascii roots
and all).

But there is now latin-9:

https://en.wikipedia.org/wiki/ISO/IEC_8859-15

Maybe a better option?

Encodings we should support are:
> - latin1 (1 bytes):
> it is compatible with ascii and adds extra characters used in the
> western world.
> - utf-32 (4 bytes):
> can represent every character, equivalent with np.unicode
>

IIUC, datetime64 is, well, always 64 bits. So it may be better to have a
given dtype always be the same bitwidth.

So the utf-32 dtype would be a different dtype. which also keeps it really
simple, we have a latin-* dtype and a full-on unicode dtype -- that's it.

Encodings we should maybe support:
> - utf-16 with explicitly disallowing surrogate pairs (2 bytes):
> this covers a very large range of possible characters in a reasonably
> compact representation
>

I think UTF-16 is very simply, the worst of both worlds. If we want a
two-byte character set, then it should be UCS-2 -- i.e. explicitly
rejecting any code point that takes more than two bytes to represent. (or
maybe that's what you mean by explicitly disallowing surrogate pairs). in
any case, it should certainly give you an encoding error if you try to pass
in a unicode character than can not fit into two bytes.

So: is there actually a demand for this? If so, then I think it should be a
separate 2-byte string type, with the encoding always the same.

> - utf-8 (4 bytes):
> variable length encoding with minimum size of 1 bytes, but we would need
> to assume the worst case of 4 bytes so it would not save anything
> compared to utf-32 but may allow third parties replace an encoding step
> with trailing null trimming on serialization.
>

yeach -- utf-8 is great for interchange and streaming data, but not for
internal storage, particular with the numpy every item has the same number
of bytes requirement. So if someone wants to work with ut-8 they can store
it in a byte array, and encode and decode as they pass it to/from python.
That's going to have to happen anyway, even if under the hood. And it's
risky business -- if you truncate a utf-8 bytestring, you may get invalid
data --  it  really does not belong in numpy.

> - Add a new dtype, e.g. npy.realstring
>

I think that's the way to go. backwards compatibility is really key. Though
could we make the existing string dtype a latin-1 always type without
breaking too much? Or maybe depricate and get there in the future?

It has the cosmetic disadvantage that it makes the np.unicode dtype
> obsolete and is more busywork to implement.
>

I think the np.unicode type should remain as the 4-bytes per char encoding.
But that only makes sense if you follow my idea that we don't have a
variable number of bytes per char dtype.

So my proposal is:

 - Create a new one-byte-per-char dtype that is always latin-9 encoded.
    - in python3 it would map to a string (i.e. unicode)
 - Keep the 4-byte per char unicode string type

Optionally (if there is really demand)
 - Create a new two-byte per char dtype that is always UCS-2 encoded.

Is there any way to leverage Python3's nifty string type? I'm thinking not.
At least not for numpy arrays that can play well with C code, etc.

All that being said, a encoding-specified string dtype would be nice too --
I just think it's more complex that it needs to be. Numpy is not the tool
for text processing...

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/7157c7a5/attachment.html>