[Numpy-discussion] A one-byte string dtype?

Aldcroft, Thomas aldcroft at head.cfa.harvard.edu
Mon Jan 20 10:00:55 EST 2014

On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
<oscar.j.benjamin at gmail.com>wrote:

> On Fri, Jan 17, 2014 at 02:30:19PM -0800, Chris Barker wrote:
> > Folks,
> >
> > I've been blathering away on the related threads a lot -- sorry if it's
> too
> > much. It's gotten a bit tangled up, so I thought I'd start a new one to
> > address this one question (i.e. dont bring up genfromtext here):
> >
> > Would it be a good thing for numpy to have a one-byte--per-character
> string
> > type?
> If you mean a string type that can only hold latin-1 characters then I
> think
> that this is a step backwards.
> If you mean a dtype that holds bytes in a known, specifiable encoding and
> automatically decodes them to unicode strings when you call .item() and
> has a
> friendly repr() then that may be a good idea.
> So for example you could have dtype='S:utf-8' which would store strings
> encoded as utf-8 e.g.:
> >>> text = array(['foo', 'bar'], dtype='S:utf-8')
> >>> text
> array(['foo', 'bar'], dtype='|S3:utf-8')
> >>> print(a)
> ['foo', 'bar']
> >>> a[0]
> 'foo'
> >>> a.nbytes
> 6
> > We did have that with the 'S' type in py2, but the changes in py3 have
> made
> > it not quite the right thing. And it appears that enough people use 'S'
> in
> > py3 to mean 'bytes', so that we can't change that now.
> It wasn't really the right thing before either. That's why Python 3 has
> changed all of this.
> > The only difference may be that 'S' currently auto translates to a bytes
> > object, resulting in things like:
> >
> > np.array(['some text',],  dtype='S')[0] == 'some text'
> >
> > yielding False on Py3. And you can't do all the usual text stuff with the
> > resulting bytes object, either. (and it probably used the default
> encoding
> > to generate the bytes, so will barf on some inputs, though that may be
> > unavoidable.) So you need to decode the bytes that are given back, and
> now
> > that I think about it, I have no idea what encoding you'd need to use in
> > the general case.
> You should let the user specify the encoding or otherwise require them to
> use
> the 'U' dtype.
> > So the correct solution is (particularly on py3) to use the 'U' (unicode)
> > dtype for text in numpy arrays.
> Absolutely. Embrace the Python 3 text model. Once you understand the how,
> what
> and why of it you'll see that it really is a good thing!
> > However, the 'U' dtype is 4 bytes per character, and that may be "too
> big"
> > for some use-cases. And there is a lot of text in scientific data sets
> that
> > are pure ascii, or at least some 1-byte-per-character encoding.
> >
> > So, in the spirit of having multiple numeric types that use different
> > amounts of memory, and can hold different ranges of values, a
> one-byte-per
> > character dtype would be nice:
> >
> > (note, this opens the door for a 2-byte per (UCS-2) dtype too, I
> personally
> > don't think that's worth it, but maybe that's because I'm an english
> > speaker...)
> You could just use a 2-byte encoding with the S dtype e.g.
> dtype='S:utf-16-le'.
> > It could use the 's' (lower-case s) type identifier.
> >
> > For passing to/from python built-in objects, it would
> >
> > * Allow either Python bytes objects or Python unicode objects as input
> >      a) bytes objects would be passed through as-is
> >      b) unicode objects would be encoded as latin-1
> >
> > [note: I'm not entirely sure that bytes objects should be allowed, but it
> > would provide an nice efficiency in a fairly common case]
> I think it would be a bad idea to accept bytes here. There are good reasons
> that Python 3 creates a barrier between the two worlds of text and bytes.
> Allowing implicit mixing of bytes and text is a recipe for mojibake. The
> TypeErrors in Python 3 are used to guard against conceptual errors that
> lead
> to data corruption. Attempting to undermine that barrier in numpy would be
> a
> backward step.
> I apologise if this is misplaced but there seems to be an attitude that
> scientific programming isn't really affected by the issues that have lead
> to
> the Python 3 text model. I think that's ridiculous; data corruption is a
> problem in scientific programming just as it is anywhere else.
> > * It would create python unicode text objects, decoded as latin-1.
> Don't try to bless a particular encoding and stop trying to pretend that
> it's
> possible to write a sensible system where end users don't need to worry
> about
> and specify the encoding of their data.
> > Could we have a way to specify another encoding? I'm not sure how that
> > would fit into the dtype system.
> If the encoding cannot be specified then the whole idea is misguided.
> > I've explained the latin-1 thing on other threads, but the short version
> is:
> >
> >  - It will work perfectly for ascii text
> >  - It will work perfectly for latin-1 text (natch)
> >  - It will never give you an UnicodeEncodeError regardless of what
> > arbitrary bytes you pass in.
> >  - It will preserve those arbitrary bytes through a encoding/decoding
> > operation.
> So what happens if I do:
> >>> with open('myutf-8-file.txt', 'rb') as fin:
> ...     text = numpy.fromfile(fin, dtype='s')
> >>> text[0] # Decodes as latin-1 leading to mojibake.
> I would propose that it's better to be able to do:
> >>> with open('myutf-8-file.txt', 'rb') as fin:
> ...     text = numpy.fromfile(fin, dtype='s:utf-8')
> There's really no way to get around the fact that users need to specify the
> encoding of their text files.
> > (it still wouldn't allow you to store arbitrary unicode -- but that's the
> > limitation of one-byte per character...)
> You could if you use 'utf-8'. It would be one-byte-per-char for text that
> only
> contains ascii characters. However it would still support every character
> that
> the unicode consortium can dream up.

> The only possible advantage here is as a memory optimisation (potentially
> having a speed impact too although it could equally be a speed regression).
> Otherwise it just adds needless complexity to numpy and to the code that
> uses
> the new dtype as well as limiting its ability to handle unicode.

> How significant are the performance issues? Does anyone really use numpy
> for
> this kind of text handling? If you really are operating on gigantic text
> arrays of ascii characters then is it so bad to just use the bytes dtype
> and
> handle decoding/encoding at the boundaries? If you're not operating on
> gigantic text arrays is there really a noticeable problem just using the
> 'U'
> dtype?

I use numpy for giga-row arrays of short text strings, so memory and
performance issues are real.

As discussed in the previous parent thread, using the bytes dtype is really
a problem because users of a text array want to do things like filtering
(`match_rows = text_array == 'match'`), printing, or other manipulations in
a natural way without having to continually use bytestring literals or
`.decode('ascii')` everywhere.  I tried converting a few packages while
leaving the arrays as bytestrings and it just ended up as a very big mess.

>From my perspective the goal here is to provide a pragmatic way to allow
numpy-based applications and end users to use python 3.  Something like
this proposal seems to be the right direction, maybe not pure and perfect
but a sensible step to get us there given the reality of scientific

- Tom

> Oscar
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140120/a99b8330/attachment.html>

More information about the NumPy-Discussion mailing list