[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 20 14:15:49 EDT 2017

I probably have formulated my goal with the proposal a bit better, I am
not very interested in a repetition of which encoding to use debate.
In the end what will be done allows any encoding via a dtype with
metadata like datetime.
This allows any codec (including truncated utf8) to be added easily (if
python supports it) and allows sidestepping the debate.

My main concern is whether it should be a new dtype or modifying the
unicode dtype. Though the backward compatibility argument is strongly in
favour of adding a new dtype that makes the np.unicode type redundant.

On 20.04.2017 15:15, Julian Taylor wrote:
> Hello,
> As you probably know numpy does not deal well with strings in Python3.
> The np.string type is actually zero terminated bytes and not a string.
> In Python2 this happened to work out as it treats bytes and strings the
> same way. But in Python3 this type is pretty hard to work with as each
> time you get an item from a numpy bytes array it needs decoding to
> receive a string.
> The only string type available in Python3 is np.unicode which uses
> 4-byte utf-32 encoding which is deemed to use too much memory to
> actually see much use.
> 
> What people apparently want is a string type for Python3 which uses less
> memory for the common science use case which rarely needs more than
> latin1 encoding.
> As we have been told we cannot change the np.string type to actually be
> strings as existing programs do interpret its content as bytes despite
> this being very broken due to its null terminating property (it will
> ignore all trailing nulls).
> Also 8 years of working around numpy's poor python3 support decisions in
> third parties probably make the 'return bytes' behaviour impossible to
> change now.
> 
> So we need a new dtype that can represent strings in numpy arrays which
> is smaller than the existing 4 byte utf-32.
> 
> To please everyone I think we need to go with a dtype that supports
> multiple encodings via metadata, similar to how datatime supports
> multiple units.
> E.g.: 'U10[latin1]' are 10 characters in latin1 encoding
> 
> Encodings we should support are:
> - latin1 (1 bytes):
> it is compatible with ascii and adds extra characters used in the
> western world.
> - utf-32 (4 bytes):
> can represent every character, equivalent with np.unicode
> 
> Encodings we should maybe support:
> - utf-16 with explicitly disallowing surrogate pairs (2 bytes):
> this covers a very large range of possible characters in a reasonably
> compact representation
> - utf-8 (4 bytes):
> variable length encoding with minimum size of 1 bytes, but we would need
> to assume the worst case of 4 bytes so it would not save anything
> compared to utf-32 but may allow third parties replace an encoding step
> with trailing null trimming on serialization.
> 
> 
> To actually do this we have two options both of which break our ABI when
> doing so without ugly hacks.
> 
> - Add a new dtype, e.g. npy.realstring
> By not modifying an existing type the only break programs using the
> NPY_CHAR. The most notable case of this is f2py.
> It has the cosmetic disadvantage that it makes the np.unicode dtype
> obsolete and is more busywork to implement.
> 
> - Modify np.unicode to have encoding metadata
> This allows use to reuse of all the type boilerplate so it is more
> convenient to implement and by extending an existing type instead of
> making one obsolete it results in a much nicer API.
> The big drawback is that it will explicitly break any third party that
> receives an array with a new encoding and assumes that the buffer of an
> array of type np.unicode will a character itemsize of 4 bytes.
> To ease this problem we would need to add API's to get the itemsize and
> encoding to numpy now so third parties can error out cleanly.
> 
> The implementation of it is not that big a deal, I have already created
> a prototype for adding latin1 metadata to np.unicode which works quite
> well. It is imo realistic to get this into 1.14 should we be able to
> make a decision on which way to implement it.
> 
> Do you have comments on how to go forward, in particular in regards to
> new dtype vs modify np.unicode?
> 
> cheers,
> Julian
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 845 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/70b27a74/attachment.sig>