Perhaps `np.encoded_str[encoding]` as the name for the new type, if we decide a new type is necessary? Am I right in thinking that the general problem here is that it's very easy to discard metadata when working with dtypes, and that by adding metadata to `unicode_`, we risk existing code carelessly dropping it? Is this a problem in both C and python, or just C? If that's the case, can we end up with a compromise where being careless just causes old code to promote to ucs32? On Thu, 20 Apr 2017 at 20:09 Anne Archibald <peridot.faceted@gmail.com> wrote:
On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor < jtaylor.debian@googlemail.com> wrote:
I probably have formulated my goal with the proposal a bit better, I am not very interested in a repetition of which encoding to use debate. In the end what will be done allows any encoding via a dtype with metadata like datetime. This allows any codec (including truncated utf8) to be added easily (if python supports it) and allows sidestepping the debate.
My main concern is whether it should be a new dtype or modifying the unicode dtype. Though the backward compatibility argument is strongly in favour of adding a new dtype that makes the np.unicode type redundant.
Creating a new dtype to handle encoded unicode, with the encoding specified in the dtype, sounds perfectly reasonable to me. Changing the behaviour of the existing unicode dtype seems like it's going to lead to massive headaches unless exactly nobody uses it. The only downside to a new type is having to find an obvious name that isn't already in use. (And having to actively maintain/deprecate the old one.)
Anne _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion