On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin < oscar.j.benjamin@gmail.com> wrote:
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com>wrote:
How significant are the performance issues? Does anyone really use numpy for this kind of text handling? If you really are operating on gigantic text arrays of ascii characters then is it so bad to just use the bytes
and handle decoding/encoding at the boundaries? If you're not operating on gigantic text arrays is there really a noticeable problem just using
'U' dtype?
I use numpy for giga-row arrays of short text strings, so memory and performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is really a problem because users of a text array want to do things like filtering (`match_rows = text_array == 'match'`), printing, or other manipulations in a natural way without having to continually use bytestring literals or `.decode('ascii')` everywhere. I tried converting a few packages while leaving the arrays as bytestrings and it just ended up as a very big mess.
From my perspective the goal here is to provide a pragmatic way to allow numpy-based applications and end users to use python 3. Something like this proposal seems to be the right direction, maybe not pure and
On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote: dtype the perfect
but a sensible step to get us there given the reality of scientific computing.
I don't really see how writing b'match' instead of 'match' is that big a deal.
It's a big deal because all your existing python 2 code suddenly breaks on python 3, even after running 2to3. Yes, you can backfix all the python 2 code and use bytestring literals everywhere, but that is very painful and ugly. More importantly it's very fiddly because *sometimes* you'll need to use bytestring literals, and *sometimes* not, depending on the exact dataset you've been handed. That's basically a non-starter.
As you say below, the only solution is a proper separation of bytes/unicode where everything internally is unicode. The problem is that the existing 4-byte unicode in numpy is a big performance / memory hit. It's even trickier because libraries will happily deliver a numpy structured array with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to then convert to 'U' since you need to remake the entire structured array. With a one-byte unicode the goal would be an in-place update of 'S' to 's'.
And why are you needing to write .decode('ascii') everywhere?
print("The first value is {}".format(bytestring_array[0]))
On Python 2 this gives "The first value is string_value", while on Python 3 this gives "The first value is b'string_value'".
As Nathaniel has mentioned, this is a known problem with Python 3 and the developers are trying to come up with a solution. Python 3.4 solves some existing problems, but this one remains. It's not just numpy here, it's that python itself needs to provide some help. Chuck