[Numpy-discussion] proposal: smaller representation of string arrays

Julian Taylor jtaylor.debian at googlemail.com
Thu Apr 20 15:40:12 EDT 2017


On 20.04.2017 20:59, Anne Archibald wrote:
> On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor
> <jtaylor.debian at googlemail.com <mailto:jtaylor.debian at googlemail.com>>
> wrote:
> 
>     I probably have formulated my goal with the proposal a bit better, I am
>     not very interested in a repetition of which encoding to use debate.
>     In the end what will be done allows any encoding via a dtype with
>     metadata like datetime.
>     This allows any codec (including truncated utf8) to be added easily (if
>     python supports it) and allows sidestepping the debate.
> 
>     My main concern is whether it should be a new dtype or modifying the
>     unicode dtype. Though the backward compatibility argument is strongly in
>     favour of adding a new dtype that makes the np.unicode type redundant.
> 
> 
> Creating a new dtype to handle encoded unicode, with the encoding
> specified in the dtype, sounds perfectly reasonable to me. Changing the
> behaviour of the existing unicode dtype seems like it's going to lead to
> massive headaches unless exactly nobody uses it. The only downside to a
> new type is having to find an obvious name that isn't already in use.
> (And having to actively  maintain/deprecate the old one.) 
> 
> Anne
> 

We wouldn't really be changing the behaviour of the unicode dtype. Only
programs accessing the databuffer directly and trying to decode would
need to be changed.

I assume this can happen for programs that do serialization + reencoding
of numpy string arrays at the C level (at the python level you would be
fine).
These programs would be broken, but only when they actually receive a
string array that does not have the default utf32 encoding.

I really don't like that a fully new dtype means creating more junk and
extra code paths to numpy.
But it is probably do big of a compatibility break to accept to keep our
code clean.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 845 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/e506bfc9/attachment.sig>


More information about the NumPy-Discussion mailing list