Re: [Numpy-discussion] A one-byte string dtype?

Jan. 20, 2014

      On Mon, Jan 20, 2014 at 12:12 PM, Aldcroft, Thomas
<aldcroft@head.cfa.harvard.edu> wrote:
...
On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin
<oscar.j.benjamin@gmail.com> wrote:
...
On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
...
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
<oscar.j.benjamin@gmail.com>wrote:
...
How significant are the performance issues? Does anyone really use
numpy
for
this kind of text handling? If you really are operating on gigantic
text
arrays of ascii characters then is it so bad to just use the bytes
dtype
and
handle decoding/encoding at the boundaries? If you're not operating on
gigantic text arrays is there really a noticeable problem just using
the
'U'
dtype?
I use numpy for giga-row arrays of short text strings, so memory and
performance issues are real.
As discussed in the previous parent thread, using the bytes dtype is
really
a problem because users of a text array want to do things like filtering
(`match_rows = text_array == 'match'`), printing, or other manipulations
in
a natural way without having to continually use bytestring literals or
`.decode('ascii')` everywhere.  I tried converting a few packages while
leaving the arrays as bytestrings and it just ended up as a very big
mess.
From my perspective the goal here is to provide a pragmatic way to allow
numpy-based applications and end users to use python 3.  Something like
this proposal seems to be the right direction, maybe not pure and
perfect
but a sensible step to get us there given the reality of scientific
computing.
I don't really see how writing b'match' instead of 'match' is that big a
deal.
It's a big deal because all your existing python 2 code suddenly breaks on
python 3, even after running 2to3.  Yes, you can backfix all the python 2
code and use bytestring literals everywhere, but that is very painful and
ugly.  More importantly it's very fiddly because *sometimes* you'll need to
use bytestring literals, and *sometimes* not, depending on the exact dataset
you've been handed.  That's basically a non-starter.
As you say below, the only solution is a proper separation of bytes/unicode
where everything internally is unicode.  The problem is that the existing
4-byte unicode in numpy is a big performance / memory hit.  It's even
trickier because libraries will happily deliver a numpy structured array
with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to
then convert to 'U' since you need to remake the entire structured array.
With a one-byte unicode the goal would be an in-place update of 'S' to 's'.
...
And why are you needing to write .decode('ascii') everywhere?
...
...
...
print("The first value is {}".format(bytestring_array[0]))
On Python 2 this gives "The first value is string_value", while on Python 3
this gives "The first value is b'string_value'".
Unfortunately (?) setprintoptions  and set_string_function don't work
with numpy scalars AFAICS. If it did then it would be possible to
override the string representation. It works for arrays.

I didn't find the right key for numpy.bytes_ on  python 3.3 so now my
interpreter can only print bytes
np.set_printoptions(formatter={'all':lambda x:
x.decode('ascii',errors="ignore") })

Josef
...
...
If you really
do just want to work with bytes in your own known encoding then why not
just
read and write in binary mode?
I apologise if I'm wrong but I suspect that much of the difficulty in
getting
the bytes/unicode separation right is down to the fact that a lot of the
code
you're using (or attempting to support) hasn't yet been ported to a clean
text
model. When I started using Python 3 it took me quite a few failed
attempts
at understanding the text model before I got to the point where I
understood
how it is supposed to be used. The problem was that I had been conflating
text
and bytes in many places, and that's hard to disentangle. Having fixed
most of
those problems I now understand why it is such an improvement.
In any case I don't see anything wrong with a more efficient dtype for
representing text if the user can specify the encoding. The problem is
that
numpy arrays expose their underlying memory buffer. Allowing them to
interact
directly with text strings on the one side and binary files on the other
breaches Python 3's very good text model unless the user can specify the
encoding that is to be used. Or at least if there is to be a blessed
encoding
then make it unicode-capable utf-8 instead of legacy ascii/latin-1.
Oscar
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] A one-byte string dtype?

josef.pktd＠gmail.com