On Thu, Apr 20, 2017 at 12:27 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
>
> On 20.04.2017 20:53, Robert Kern wrote:
> > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor
> > <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>>
> > wrote:
> >
> >> Do you have comments on how to go forward, in particular in regards to
> >> new dtype vs modify np.unicode?
> >
> > Can we restate the use cases explicitly? I feel like we ended up with
> > the current sub-optimal situation because we never really laid out the
> > use cases. We just felt like we needed bytestring and unicode dtypes,
> > more out of completionism than anything, and we made a bunch of
> > assumptions just to get each one done. I think there may be broad
> > agreement that many of those assumptions are "wrong", but it would be
> > good to reference that against concretely-stated use cases.
>
> We ended up in this situation because we did not take the opportunity to
> break compatibility when python3 support was added.

Oh, the root cause I'm thinking of long predates Python 3, or even numpy 1.0. There never was an explicitly fleshed out use case for unicode arrays other than "Python has unicode strings, so we should have a string dtype that supports it". Hence the "we only support UCS4" implementation; it's not like anyone *wants* UCS4 or interoperates with UCS4, but it does represent all possible Unicode strings. The Python 3 transition merely exacerbated the problem by making Unicode strings the primary string type to work with. I don't really want to ameliorate the exacerbation without addressing the root problem, which is worth solving.

I will put this down as a marker use case: Support HDF5's fixed-width UTF-8 arrays.

--
Robert Kern