[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 20 09:15:27 EDT 2017

Hello,
As you probably know numpy does not deal well with strings in Python3.
The np.string type is actually zero terminated bytes and not a string.
In Python2 this happened to work out as it treats bytes and strings the
same way. But in Python3 this type is pretty hard to work with as each
time you get an item from a numpy bytes array it needs decoding to
receive a string.
The only string type available in Python3 is np.unicode which uses
4-byte utf-32 encoding which is deemed to use too much memory to
actually see much use.

What people apparently want is a string type for Python3 which uses less
memory for the common science use case which rarely needs more than
latin1 encoding.
As we have been told we cannot change the np.string type to actually be
strings as existing programs do interpret its content as bytes despite
this being very broken due to its null terminating property (it will
ignore all trailing nulls).
Also 8 years of working around numpy's poor python3 support decisions in
third parties probably make the 'return bytes' behaviour impossible to
change now.

So we need a new dtype that can represent strings in numpy arrays which
is smaller than the existing 4 byte utf-32.

To please everyone I think we need to go with a dtype that supports
multiple encodings via metadata, similar to how datatime supports
multiple units.
E.g.: 'U10[latin1]' are 10 characters in latin1 encoding

Encodings we should support are:
- latin1 (1 bytes):
it is compatible with ascii and adds extra characters used in the
western world.
- utf-32 (4 bytes):
can represent every character, equivalent with np.unicode

Encodings we should maybe support:
- utf-16 with explicitly disallowing surrogate pairs (2 bytes):
this covers a very large range of possible characters in a reasonably
compact representation
- utf-8 (4 bytes):
variable length encoding with minimum size of 1 bytes, but we would need
to assume the worst case of 4 bytes so it would not save anything
compared to utf-32 but may allow third parties replace an encoding step
with trailing null trimming on serialization.

To actually do this we have two options both of which break our ABI when
doing so without ugly hacks.

- Add a new dtype, e.g. npy.realstring
By not modifying an existing type the only break programs using the
NPY_CHAR. The most notable case of this is f2py.
It has the cosmetic disadvantage that it makes the np.unicode dtype
obsolete and is more busywork to implement.

- Modify np.unicode to have encoding metadata
This allows use to reuse of all the type boilerplate so it is more
convenient to implement and by extending an existing type instead of
making one obsolete it results in a much nicer API.
The big drawback is that it will explicitly break any third party that
receives an array with a new encoding and assumes that the buffer of an
array of type np.unicode will a character itemsize of 4 bytes.
To ease this problem we would need to add API's to get the itemsize and
encoding to numpy now so third parties can error out cleanly.

The implementation of it is not that big a deal, I have already created
a prototype for adding latin1 metadata to np.unicode which works quite
well. It is imo realistic to get this into 1.14 should we be able to
make a decision on which way to implement it.

Do you have comments on how to go forward, in particular in regards to
new dtype vs modify np.unicode?

cheers,
Julian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 845 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/9fc94c20/attachment.sig>