[Numpy-discussion] A one-byte string dtype?

Oscar Benjamin oscar.j.benjamin at gmail.com
Tue Jan 21 09:43:31 EST 2014


On Tue, Jan 21, 2014 at 06:55:29AM -0700, Charles R Harris wrote:
>
> Well, that's open for discussion. The problem is to have something that is
> both compact (latin-1) and interoperates transparently with python 3
> strings (utf-8). A latin-1 type would be easier to implement and would
> probably be a better choice for something available in both python 2 and
> python 3, but unless the python 3 developers come up with something clever
> I don't  see how to make it behave transparently as a string in python 3.
> OTOH, it's not clear to me how to make utf-8 operate transparently with
> python 2 strings, especially as the unicode representation choices in
> python 2 are ucs-2 or ucs-4

On Python 2, unicode strings can operate transparently with byte strings:

$ python
Python 2.7.3 (default, Sep 26 2013, 20:03:06) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as bnp
>>> import numpy as np
>>> a = np.array([u'\xd5scar'], dtype='U')
>>> a
array([u'\xd5scar'], 
      dtype='<U5')
>>> a[0]
u'\xd5scar'
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print(a[0])  # Encodes as 'utf-8'
Õscar
>>> 'My name is %s' % a[0]  # Decodes as ASCII
u'My name is \xd5scar'
>>> print('My name is %s' % a[0])  # Encodes as UTF-8
My name is Õscar

This is no better worse than the rest of the Py2 text model. So if the new
dtype always returns a unicode string under Py2 it should work (as well as the
Py2 text model ever does).

> and the python 3 work adding utf-16 and utf-8
> is unlikely to be backported. The problem may be unsolvable in a completely
> satisfactory way.

What do you mean by this? PEP 393 uses UCS-1/2/4 not utf-8/16/32 i.e. it
always uses a fixed-width encoding.

You can just use the CPython C-API to create the unicode strings. The simplest
way is probably use utf-8 internally and then call PyUnicode_DecodeUTF8 and
PyUnicode_EncodeUTF8 at the boundaries. This should work fine on Python 2.x
and 3.x. It obviates any need to think about pre-3.3 narrow and wide builds
and post-3.3 FSR formats.

Unlike Python's str there isn't much need to be able to efficiently slice or
index within the string array element. Indexing into the array to get the
string requires creating a new object, so you may as well just decode from
utf-8 at that point [it's big-O(num chars) either way]. There's no need to
constrain it to fixed-width encodings like the FSR in which case utf-8 is
clearly the best choice as:

1) It covers the whole unicode spectrum.
2) It uses 1 byte-per-char for ASCII.
3) UTF-8 is a big optimisation target for CPython (so it's fast).


Oscar



More information about the NumPy-Discussion mailing list