[Numpy-discussion] A one-byte string dtype?

Charles R Harris charlesr.harris at gmail.com
Mon Jan 20 12:21:27 EST 2014


On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:

>
>
>
> On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <
> oscar.j.benjamin at gmail.com> wrote:
>
>> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
>> > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
>> > <oscar.j.benjamin at gmail.com>wrote:
>> > > How significant are the performance issues? Does anyone really use
>> numpy
>> > > for
>> > > this kind of text handling? If you really are operating on gigantic
>> text
>> > > arrays of ascii characters then is it so bad to just use the bytes
>> dtype
>> > > and
>> > > handle decoding/encoding at the boundaries? If you're not operating on
>> > > gigantic text arrays is there really a noticeable problem just using
>> the
>> > > 'U'
>> > > dtype?
>> > >
>> >
>> > I use numpy for giga-row arrays of short text strings, so memory and
>> > performance issues are real.
>> >
>> > As discussed in the previous parent thread, using the bytes dtype is
>> really
>> > a problem because users of a text array want to do things like filtering
>> > (`match_rows = text_array == 'match'`), printing, or other
>> manipulations in
>> > a natural way without having to continually use bytestring literals or
>> > `.decode('ascii')` everywhere.  I tried converting a few packages while
>> > leaving the arrays as bytestrings and it just ended up as a very big
>> mess.
>> >
>> > From my perspective the goal here is to provide a pragmatic way to allow
>> > numpy-based applications and end users to use python 3.  Something like
>> > this proposal seems to be the right direction, maybe not pure and
>> perfect
>> > but a sensible step to get us there given the reality of scientific
>> > computing.
>>
>> I don't really see how writing b'match' instead of 'match' is that big a
>> deal.
>>
>
> It's a big deal because all your existing python 2 code suddenly breaks on
> python 3, even after running 2to3.  Yes, you can backfix all the python 2
> code and use bytestring literals everywhere, but that is very painful and
> ugly.  More importantly it's very fiddly because *sometimes* you'll need to
> use bytestring literals, and *sometimes* not, depending on the exact
> dataset you've been handed.  That's basically a non-starter.
>
> As you say below, the only solution is a proper separation of
> bytes/unicode where everything internally is unicode.  The problem is that
> the existing 4-byte unicode in numpy is a big performance / memory hit.
>  It's even trickier because libraries will happily deliver a numpy
> structured array with an 'S'-dtype field (from a binary dataset on disk),
> and it's a pain to then convert to 'U' since you need to remake the entire
> structured array.  With a one-byte unicode the goal would be an in-place
> update of 'S' to 's'.
>
>
>> And why are you needing to write .decode('ascii') everywhere?
>
>
> >>> print("The first value is {}".format(bytestring_array[0]))
>
> On Python 2 this gives "The first value is string_value", while on Python
> 3 this gives "The first value is b'string_value'".
>

As Nathaniel has mentioned, this is a known problem with Python 3 and the
developers are trying to come up with a solution. Python 3.4 solves some
existing problems, but this one remains. It's not just numpy here, it's
that python itself needs to provide some help.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140120/0fa55dba/attachment.html>


More information about the NumPy-Discussion mailing list