On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas <aldcroft@head.cfa.harvard.edu> wrote:



On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
> On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
> <oscar.j.benjamin@gmail.com>wrote:
> > How significant are the performance issues? Does anyone really use numpy
> > for
> > this kind of text handling? If you really are operating on gigantic text
> > arrays of ascii characters then is it so bad to just use the bytes dtype
> > and
> > handle decoding/encoding at the boundaries? If you're not operating on
> > gigantic text arrays is there really a noticeable problem just using the
> > 'U'
> > dtype?
> >
>
> I use numpy for giga-row arrays of short text strings, so memory and
> performance issues are real.
>
> As discussed in the previous parent thread, using the bytes dtype is really
> a problem because users of a text array want to do things like filtering
> (`match_rows = text_array == 'match'`), printing, or other manipulations in
> a natural way without having to continually use bytestring literals or
> `.decode('ascii')` everywhere.  I tried converting a few packages while
> leaving the arrays as bytestrings and it just ended up as a very big mess.
>
> From my perspective the goal here is to provide a pragmatic way to allow
> numpy-based applications and end users to use python 3.  Something like
> this proposal seems to be the right direction, maybe not pure and perfect
> but a sensible step to get us there given the reality of scientific
> computing.

I don't really see how writing b'match' instead of 'match' is that big a deal.

It's a big deal because all your existing python 2 code suddenly breaks on python 3, even after running 2to3.  Yes, you can backfix all the python 2 code and use bytestring literals everywhere, but that is very painful and ugly.  More importantly it's very fiddly because *sometimes* you'll need to use bytestring literals, and *sometimes* not, depending on the exact dataset you've been handed.  That's basically a non-starter.  

As you say below, the only solution is a proper separation of bytes/unicode where everything internally is unicode.  The problem is that the existing 4-byte unicode in numpy is a big performance / memory hit.  It's even trickier because libraries will happily deliver a numpy structured array with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to then convert to 'U' since you need to remake the entire structured array.  With a one-byte unicode the goal would be an in-place update of 'S' to 's'.
 
And why are you needing to write .decode('ascii') everywhere?

>>> print("The first value is {}".format(bytestring_array[0]))

On Python 2 this gives "The first value is string_value", while on Python 3 this gives "The first value is b'string_value'".

As Nathaniel has mentioned, this is a known problem with Python 3 and the developers are trying to come up with a solution. Python 3.4 solves some existing problems, but this one remains. It's not just numpy here, it's that python itself needs to provide some help.

Chuck