[Numpy-discussion] proposal: smaller representation of string arrays
Julian Taylor
jtaylor.debian at googlemail.com
Thu Apr 20 15:40:12 EDT 2017
On 20.04.2017 20:59, Anne Archibald wrote:
> On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor
> <jtaylor.debian at googlemail.com <mailto:jtaylor.debian at googlemail.com>>
> wrote:
>
> I probably have formulated my goal with the proposal a bit better, I am
> not very interested in a repetition of which encoding to use debate.
> In the end what will be done allows any encoding via a dtype with
> metadata like datetime.
> This allows any codec (including truncated utf8) to be added easily (if
> python supports it) and allows sidestepping the debate.
>
> My main concern is whether it should be a new dtype or modifying the
> unicode dtype. Though the backward compatibility argument is strongly in
> favour of adding a new dtype that makes the np.unicode type redundant.
>
>
> Creating a new dtype to handle encoded unicode, with the encoding
> specified in the dtype, sounds perfectly reasonable to me. Changing the
> behaviour of the existing unicode dtype seems like it's going to lead to
> massive headaches unless exactly nobody uses it. The only downside to a
> new type is having to find an obvious name that isn't already in use.
> (And having to actively maintain/deprecate the old one.)
>
> Anne
>
We wouldn't really be changing the behaviour of the unicode dtype. Only
programs accessing the databuffer directly and trying to decode would
need to be changed.
I assume this can happen for programs that do serialization + reencoding
of numpy string arrays at the C level (at the python level you would be
fine).
These programs would be broken, but only when they actually receive a
string array that does not have the default utf32 encoding.
I really don't like that a fully new dtype means creating more junk and
extra code paths to numpy.
But it is probably do big of a compatibility break to accept to keep our
code clean.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 845 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/e506bfc9/attachment.sig>
More information about the NumPy-Discussion
mailing list