[Numpy-discussion] A one-byte string dtype?

Mon Jan 20 15:13:08 EST 2014

On Mon, Jan 20, 2014 at 12:12 PM, Aldcroft, Thomas
<aldcroft at head.cfa.harvard.edu> wrote:
>
>
>
> On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin
> <oscar.j.benjamin at gmail.com> wrote:
>>
>> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
>> > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
>> > <oscar.j.benjamin at gmail.com>wrote:
>> > > How significant are the performance issues? Does anyone really use
>> > > numpy
>> > > for
>> > > this kind of text handling? If you really are operating on gigantic
>> > > text
>> > > arrays of ascii characters then is it so bad to just use the bytes
>> > > dtype
>> > > and
>> > > handle decoding/encoding at the boundaries? If you're not operating on
>> > > gigantic text arrays is there really a noticeable problem just using
>> > > the
>> > > 'U'
>> > > dtype?
>> > >
>> >
>> > I use numpy for giga-row arrays of short text strings, so memory and
>> > performance issues are real.
>> >
>> > As discussed in the previous parent thread, using the bytes dtype is
>> > really
>> > a problem because users of a text array want to do things like filtering
>> > (`match_rows = text_array == 'match'`), printing, or other manipulations
>> > in
>> > a natural way without having to continually use bytestring literals or
>> > `.decode('ascii')` everywhere.  I tried converting a few packages while
>> > leaving the arrays as bytestrings and it just ended up as a very big
>> > mess.
>> >
>> > From my perspective the goal here is to provide a pragmatic way to allow
>> > numpy-based applications and end users to use python 3.  Something like
>> > this proposal seems to be the right direction, maybe not pure and
>> > perfect
>> > but a sensible step to get us there given the reality of scientific
>> > computing.
>>
>> I don't really see how writing b'match' instead of 'match' is that big a
>> deal.
>
>
> It's a big deal because all your existing python 2 code suddenly breaks on
> python 3, even after running 2to3.  Yes, you can backfix all the python 2
> code and use bytestring literals everywhere, but that is very painful and
> ugly.  More importantly it's very fiddly because *sometimes* you'll need to
> use bytestring literals, and *sometimes* not, depending on the exact dataset
> you've been handed.  That's basically a non-starter.
>
> As you say below, the only solution is a proper separation of bytes/unicode
> where everything internally is unicode.  The problem is that the existing
> 4-byte unicode in numpy is a big performance / memory hit.  It's even
> trickier because libraries will happily deliver a numpy structured array
> with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to
> then convert to 'U' since you need to remake the entire structured array.
> With a one-byte unicode the goal would be an in-place update of 'S' to 's'.
>
>>
>> And why are you needing to write .decode('ascii') everywhere?
>
>
>>>> print("The first value is {}".format(bytestring_array[0]))
>
> On Python 2 this gives "The first value is string_value", while on Python 3
> this gives "The first value is b'string_value'".

Unfortunately (?) setprintoptions  and set_string_function don't work
with numpy scalars AFAICS. If it did then it would be possible to
override the string representation. It works for arrays.

I didn't find the right key for numpy.bytes_ on  python 3.3 so now my
interpreter can only print bytes
np.set_printoptions(formatter={'all':lambda x:
x.decode('ascii',errors="ignore") })

Josef

>
>>
>> If you really
>> do just want to work with bytes in your own known encoding then why not
>> just
>> read and write in binary mode?
>>
>> I apologise if I'm wrong but I suspect that much of the difficulty in
>> getting
>> the bytes/unicode separation right is down to the fact that a lot of the
>> code
>> you're using (or attempting to support) hasn't yet been ported to a clean
>> text
>> model. When I started using Python 3 it took me quite a few failed
>> attempts
>> at understanding the text model before I got to the point where I
>> understood
>> how it is supposed to be used. The problem was that I had been conflating
>> text
>> and bytes in many places, and that's hard to disentangle. Having fixed
>> most of
>> those problems I now understand why it is such an improvement.
>>
>> In any case I don't see anything wrong with a more efficient dtype for
>> representing text if the user can specify the encoding. The problem is
>> that
>> numpy arrays expose their underlying memory buffer. Allowing them to
>> interact
>> directly with text strings on the one side and binary files on the other
>> breaches Python 3's very good text model unless the user can specify the
>> encoding that is to be used. Or at least if there is to be a blessed
>> encoding
>> then make it unicode-capable utf-8 instead of legacy ascii/latin-1.
>>
>>
>> Oscar
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>