[Numpy-discussion] A one-byte string dtype?

Aldcroft, Thomas aldcroft at head.cfa.harvard.edu
Mon Jan 20 12:12:06 EST 2014

On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <oscar.j.benjamin at gmail.com
> wrote:

> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
> > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
> > <oscar.j.benjamin at gmail.com>wrote:
> > > How significant are the performance issues? Does anyone really use
> numpy
> > > for
> > > this kind of text handling? If you really are operating on gigantic
> text
> > > arrays of ascii characters then is it so bad to just use the bytes
> dtype
> > > and
> > > handle decoding/encoding at the boundaries? If you're not operating on
> > > gigantic text arrays is there really a noticeable problem just using
> the
> > > 'U'
> > > dtype?
> > >
> >
> > I use numpy for giga-row arrays of short text strings, so memory and
> > performance issues are real.
> >
> > As discussed in the previous parent thread, using the bytes dtype is
> really
> > a problem because users of a text array want to do things like filtering
> > (`match_rows = text_array == 'match'`), printing, or other manipulations
> in
> > a natural way without having to continually use bytestring literals or
> > `.decode('ascii')` everywhere.  I tried converting a few packages while
> > leaving the arrays as bytestrings and it just ended up as a very big
> mess.
> >
> > From my perspective the goal here is to provide a pragmatic way to allow
> > numpy-based applications and end users to use python 3.  Something like
> > this proposal seems to be the right direction, maybe not pure and perfect
> > but a sensible step to get us there given the reality of scientific
> > computing.
> I don't really see how writing b'match' instead of 'match' is that big a
> deal.

It's a big deal because all your existing python 2 code suddenly breaks on
python 3, even after running 2to3.  Yes, you can backfix all the python 2
code and use bytestring literals everywhere, but that is very painful and
ugly.  More importantly it's very fiddly because *sometimes* you'll need to
use bytestring literals, and *sometimes* not, depending on the exact
dataset you've been handed.  That's basically a non-starter.

As you say below, the only solution is a proper separation of bytes/unicode
where everything internally is unicode.  The problem is that the existing
4-byte unicode in numpy is a big performance / memory hit.  It's even
trickier because libraries will happily deliver a numpy structured array
with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to
then convert to 'U' since you need to remake the entire structured array.
 With a one-byte unicode the goal would be an in-place update of 'S' to 's'.

> And why are you needing to write .decode('ascii') everywhere?

>>> print("The first value is {}".format(bytestring_array[0]))

On Python 2 this gives "The first value is string_value", while on Python 3
this gives "The first value is b'string_value'".

> If you really
> do just want to work with bytes in your own known encoding then why not
> just
> read and write in binary mode?
> I apologise if I'm wrong but I suspect that much of the difficulty in
> getting
> the bytes/unicode separation right is down to the fact that a lot of the
> code
> you're using (or attempting to support) hasn't yet been ported to a clean
> text
> model. When I started using Python 3 it took me quite a few failed attempts
> at understanding the text model before I got to the point where I
> understood
> how it is supposed to be used. The problem was that I had been conflating
> text
> and bytes in many places, and that's hard to disentangle. Having fixed
> most of
> those problems I now understand why it is such an improvement.
> In any case I don't see anything wrong with a more efficient dtype for
> representing text if the user can specify the encoding. The problem is that
> numpy arrays expose their underlying memory buffer. Allowing them to
> interact
> directly with text strings on the one side and binary files on the other
> breaches Python 3's very good text model unless the user can specify the
> encoding that is to be used. Or at least if there is to be a blessed
> encoding
> then make it unicode-capable utf-8 instead of legacy ascii/latin-1.
> Oscar
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140120/ef191aa3/attachment.html>

More information about the NumPy-Discussion mailing list