
On Mon, Jan 20, 2014 at 12:12 PM, Aldcroft, Thomas <aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com>wrote:
How significant are the performance issues? Does anyone really use numpy for this kind of text handling? If you really are operating on gigantic text arrays of ascii characters then is it so bad to just use the bytes dtype and handle decoding/encoding at the boundaries? If you're not operating on gigantic text arrays is there really a noticeable problem just using the 'U' dtype?
I use numpy for giga-row arrays of short text strings, so memory and performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is really a problem because users of a text array want to do things like filtering (`match_rows = text_array == 'match'`), printing, or other manipulations in a natural way without having to continually use bytestring literals or `.decode('ascii')` everywhere. I tried converting a few packages while leaving the arrays as bytestrings and it just ended up as a very big mess.
From my perspective the goal here is to provide a pragmatic way to allow numpy-based applications and end users to use python 3. Something like this proposal seems to be the right direction, maybe not pure and perfect but a sensible step to get us there given the reality of scientific computing.
I don't really see how writing b'match' instead of 'match' is that big a deal.
It's a big deal because all your existing python 2 code suddenly breaks on python 3, even after running 2to3. Yes, you can backfix all the python 2 code and use bytestring literals everywhere, but that is very painful and ugly. More importantly it's very fiddly because *sometimes* you'll need to use bytestring literals, and *sometimes* not, depending on the exact dataset you've been handed. That's basically a non-starter.
As you say below, the only solution is a proper separation of bytes/unicode where everything internally is unicode. The problem is that the existing 4-byte unicode in numpy is a big performance / memory hit. It's even trickier because libraries will happily deliver a numpy structured array with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to then convert to 'U' since you need to remake the entire structured array. With a one-byte unicode the goal would be an in-place update of 'S' to 's'.
And why are you needing to write .decode('ascii') everywhere?
print("The first value is {}".format(bytestring_array[0]))
On Python 2 this gives "The first value is string_value", while on Python 3 this gives "The first value is b'string_value'".
Unfortunately (?) setprintoptions and set_string_function don't work with numpy scalars AFAICS. If it did then it would be possible to override the string representation. It works for arrays. I didn't find the right key for numpy.bytes_ on python 3.3 so now my interpreter can only print bytes np.set_printoptions(formatter={'all':lambda x: x.decode('ascii',errors="ignore") }) Josef
If you really do just want to work with bytes in your own known encoding then why not just read and write in binary mode?
I apologise if I'm wrong but I suspect that much of the difficulty in getting the bytes/unicode separation right is down to the fact that a lot of the code you're using (or attempting to support) hasn't yet been ported to a clean text model. When I started using Python 3 it took me quite a few failed attempts at understanding the text model before I got to the point where I understood how it is supposed to be used. The problem was that I had been conflating text and bytes in many places, and that's hard to disentangle. Having fixed most of those problems I now understand why it is such an improvement.
In any case I don't see anything wrong with a more efficient dtype for representing text if the user can specify the encoding. The problem is that numpy arrays expose their underlying memory buffer. Allowing them to interact directly with text strings on the one side and binary files on the other breaches Python 3's very good text model unless the user can specify the encoding that is to be used. Or at least if there is to be a blessed encoding then make it unicode-capable utf-8 instead of legacy ascii/latin-1.
Oscar _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion