[Numpy-discussion] A one-byte string dtype?

Oscar Benjamin oscar.j.benjamin at gmail.com
Mon Jan 20 10:40:42 EST 2014

On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
> On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
> <oscar.j.benjamin at gmail.com>wrote:
> > How significant are the performance issues? Does anyone really use numpy
> > for
> > this kind of text handling? If you really are operating on gigantic text
> > arrays of ascii characters then is it so bad to just use the bytes dtype
> > and
> > handle decoding/encoding at the boundaries? If you're not operating on
> > gigantic text arrays is there really a noticeable problem just using the
> > 'U'
> > dtype?
> >
> I use numpy for giga-row arrays of short text strings, so memory and
> performance issues are real.
> As discussed in the previous parent thread, using the bytes dtype is really
> a problem because users of a text array want to do things like filtering
> (`match_rows = text_array == 'match'`), printing, or other manipulations in
> a natural way without having to continually use bytestring literals or
> `.decode('ascii')` everywhere.  I tried converting a few packages while
> leaving the arrays as bytestrings and it just ended up as a very big mess.
> From my perspective the goal here is to provide a pragmatic way to allow
> numpy-based applications and end users to use python 3.  Something like
> this proposal seems to be the right direction, maybe not pure and perfect
> but a sensible step to get us there given the reality of scientific
> computing.

I don't really see how writing b'match' instead of 'match' is that big a deal.
And why are you needing to write .decode('ascii') everywhere? If you really
do just want to work with bytes in your own known encoding then why not just
read and write in binary mode?

I apologise if I'm wrong but I suspect that much of the difficulty in getting
the bytes/unicode separation right is down to the fact that a lot of the code
you're using (or attempting to support) hasn't yet been ported to a clean text
model. When I started using Python 3 it took me quite a few failed attempts
at understanding the text model before I got to the point where I understood
how it is supposed to be used. The problem was that I had been conflating text
and bytes in many places, and that's hard to disentangle. Having fixed most of
those problems I now understand why it is such an improvement.

In any case I don't see anything wrong with a more efficient dtype for
representing text if the user can specify the encoding. The problem is that
numpy arrays expose their underlying memory buffer. Allowing them to interact
directly with text strings on the one side and binary files on the other
breaches Python 3's very good text model unless the user can specify the
encoding that is to be used. Or at least if there is to be a blessed encoding
then make it unicode-capable utf-8 instead of legacy ascii/latin-1.


More information about the NumPy-Discussion mailing list