[Numpy-discussion] proposal: smaller representation of string arrays
wieser.eric+numpy at gmail.com
Thu Apr 20 15:15:33 EDT 2017
Perhaps `np.encoded_str[encoding]` as the name for the new type, if we
decide a new type is necessary?
Am I right in thinking that the general problem here is that it's very easy
to discard metadata when working with dtypes, and that by adding metadata
to `unicode_`, we risk existing code carelessly dropping it? Is this a
problem in both C and python, or just C?
If that's the case, can we end up with a compromise where being careless
just causes old code to promote to ucs32?
On Thu, 20 Apr 2017 at 20:09 Anne Archibald <peridot.faceted at gmail.com>
> On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor <
> jtaylor.debian at googlemail.com> wrote:
>> I probably have formulated my goal with the proposal a bit better, I am
>> not very interested in a repetition of which encoding to use debate.
>> In the end what will be done allows any encoding via a dtype with
>> metadata like datetime.
>> This allows any codec (including truncated utf8) to be added easily (if
>> python supports it) and allows sidestepping the debate.
>> My main concern is whether it should be a new dtype or modifying the
>> unicode dtype. Though the backward compatibility argument is strongly in
>> favour of adding a new dtype that makes the np.unicode type redundant.
> Creating a new dtype to handle encoded unicode, with the encoding
> specified in the dtype, sounds perfectly reasonable to me. Changing the
> behaviour of the existing unicode dtype seems like it's going to lead to
> massive headaches unless exactly nobody uses it. The only downside to a new
> type is having to find an obvious name that isn't already in use. (And
> having to actively maintain/deprecate the old one.)
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion