On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin wrote:
I use numpy for giga-row arrays of short text strings, so memory and performance issues are real.
> > performance issues are real.
> > As discussed in the previous parent thread, using the bytes dtype is
> > a problem because users of a text array want to do things like filtering
> > (`match_rows = text_array == 'match'`), printing, or other manipulations
> > a natural way without having to continually use bytestring literals or
> > `.decode('ascii')` everywhere.  I tried converting a few packages while
> > leaving the arrays as bytestrings and it just ended up as a very big
> mess.
> > From my perspective the goal here is to provide a pragmatic way to allow
> > numpy-based applications and end users to use python 3.  Something like
> > this proposal seems to be the right direction, maybe not pure and perfect
> > but a sensible step to get us there given the reality of scientific
> > computing.
I don't really see how writing b'match' instead of 'match' is that big a deal.
> deal.

It's a big deal because all your existing python 2 code suddenly breaks on
python 3, even after running 2to3.  Yes, you can backfix all the python 2
code and use bytestring literals everywhere, but that is very painful and
ugly.  More importantly it's very fiddly because *sometimes* you'll need to
use bytestring literals, and *sometimes* not, depending on the exact
dataset you've been handed.  That's basically a non-starter.

As you say below, the only solution is a proper separation of bytes/unicode
where everything internally is unicode.  The problem is that the existing
4-byte unicode in numpy is a big performance / memory hit.  It's even
trickier because libraries will happily deliver a numpy structured array
with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to
then convert to 'U' since you need to remake the entire structured array.
 With a one-byte unicode the goal would be an in-place update of 'S' to 's'.

> And why are you needing to write .decode('ascii') everywhere?

>>> print("The first value is {}".format(bytestring_array[0]))

On Python 2 this gives "The first value is string_value", while on Python 3
this gives "The first value is b'string_value'".

> read and write in binary mode?
> I apologise if I'm wrong but I suspect that much of the difficulty in
> getting
> the bytes/unicode separation right is down to the fact that a lot of the
> code
> you're using (or attempting to support) hasn't yet been ported to a clean
> text
> model. When I started using Python 3 it took me quite a few failed attempts
> at understanding the text model before I got to the point where I
> understood
> how it is supposed to be used. The problem was that I had been conflating
> text
> and bytes in many places, and that's hard to disentangle. Having fixed
> most of
> those problems I now understand why it is such an improvement.
> In any case I don't see anything wrong with a more efficient dtype for
> representing text if the user can specify the encoding. The problem is that
> numpy arrays expose their underlying memory buffer. Allowing them to
> interact
> directly with text strings on the one side and binary files on the other
> breaches Python 3's very good text model unless the user can specify the
> encoding that is to be used. Or at least if there is to be a blessed
> encoding
> then make it unicode-capable utf-8 instead of legacy ascii/latin-1.
> Oscar
