On 20.04.2017 20:59, Anne Archibald wrote:
On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor
mailto:jtaylor.debian@googlemail.com> wrote: I probably have formulated my goal with the proposal a bit better, I am not very interested in a repetition of which encoding to use debate. In the end what will be done allows any encoding via a dtype with metadata like datetime. This allows any codec (including truncated utf8) to be added easily (if python supports it) and allows sidestepping the debate.
My main concern is whether it should be a new dtype or modifying the unicode dtype. Though the backward compatibility argument is strongly in favour of adding a new dtype that makes the np.unicode type redundant.
Creating a new dtype to handle encoded unicode, with the encoding specified in the dtype, sounds perfectly reasonable to me. Changing the behaviour of the existing unicode dtype seems like it's going to lead to massive headaches unless exactly nobody uses it. The only downside to a new type is having to find an obvious name that isn't already in use. (And having to actively maintain/deprecate the old one.)
Anne
We wouldn't really be changing the behaviour of the unicode dtype. Only programs accessing the databuffer directly and trying to decode would need to be changed. I assume this can happen for programs that do serialization + reencoding of numpy string arrays at the C level (at the python level you would be fine). These programs would be broken, but only when they actually receive a string array that does not have the default utf32 encoding. I really don't like that a fully new dtype means creating more junk and extra code paths to numpy. But it is probably do big of a compatibility break to accept to keep our code clean.