[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 20 12:47:09 EDT 2017

On Thu, Apr 20, 2017 at 3:17 PM Julian Taylor <jtaylor.debian at googlemail.com>
wrote:

> To please everyone I think we need to go with a dtype that supports
> multiple encodings via metadata, similar to how datatime supports
> multiple units.
> E.g.: 'U10[latin1]' are 10 characters in latin1 encoding
>
> Encodings we should support are:
> - latin1 (1 bytes):
> it is compatible with ascii and adds extra characters used in the
> western world.
> - utf-32 (4 bytes):
> can represent every character, equivalent with np.unicode
>
> Encodings we should maybe support:
> - utf-16 with explicitly disallowing surrogate pairs (2 bytes):
> this covers a very large range of possible characters in a reasonably
> compact representation
> - utf-8 (4 bytes):
> variable length encoding with minimum size of 1 bytes, but we would need
> to assume the worst case of 4 bytes so it would not save anything
> compared to utf-32 but may allow third parties replace an encoding step
> with trailing null trimming on serialization.
>

I should say first that I've never used even non-Unicode string arrays, but
is there any reason not to support all Unicode encodings that python does,
with the same names and semantics? This would surely be the simplest to
understand.

Also, if latin1 is to going to be the only practical 8-bit encoding, maybe
check with some non-Western users to make sure it's not going to wreck
their lives? I'd have selected ASCII as an encoding to treat specially, if
any, because Unicode already does that and the consequences are familiar.
(I'm used to writing and reading French without accents because it's passed
through ASCII, for example.)

Variable-length encodings, of which UTF-8 is obviously the one that makes
good handling essential, are indeed more complicated. But is it strictly
necessary that string arrays hold fixed-length *strings*, or can the
encoding length be fixed instead? That is, currently if you try to assign a
longer string than will fit, the string is truncated to the number of
characters in the data type. Instead, for encoded Unicode, the string could
be truncated so that the encoding fits. Of course this is not completely
trivial for variable-length encodings, but it should be doable, and it
would allow UTF-8 to be used just the way it usually is - as an encoding
that's almost 8-bit.

All this said, it seems to me that the important use cases for string
arrays involve interaction with existing binary formats, so people who have
to deal with such data should have the final say. (My own closest approach
to this is the FITS format, which is restricted by the standard to ASCII.)

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/692edab7/attachment.html>