[Numpy-discussion] One-byte string dtype: third time's the charm?

Sun Feb 22 14:29:25 EST 2015

On 22/02/15 19:21, Aldcroft, Thomas wrote:

> Problems like this are now showing up in the wild [3].  Workarounds are
> also showing up, like a way to easily convert from 'S' to 'U' within
> astropy Tables [4], but this is really not a desirable way to go.
> Gigabyte-sized string data arrays are not uncommon, so converting to
> UCS-4 is a real memory and performance hit.

Why UCS-4? The Python's internal "flexible string respresentation" will 
use ascii for ascii text.

By PEP 393 an application should not assume an internal string 
representation at all:

https://www.python.org/dev/peps/pep-0393/

If the problem is PEP 393 violation in NumPy string or unicode dtype, we 
shouldn't violate it even further by adding a latin-1 encoded ascii 
string. We should let Python represent strings as it wants, and it will 
not bloat.

I am m -1 on adding latin-1 and +1 on making the unicode dtype PEP 393 
compliant if it is not. And on Python 3 'U' and 'S' should just be synonyms.

You can also store an array of bytes with uint8. Then you can decode it 
however you like to make a Python string. If it is encoded as latin-1 
then decode it as latin-1:

In [1]: import numpy as np

In [2]: ascii_bytestr = "The quick brown fox jumps over the lazy 
dog".encode('latin-1')

In [3]: numpy_bytestr = np.array(memoryview(ascii_bytestr))

In [4]: numpy_bytestr.dtype, numpy_bytestr.shape
Out[4]: (dtype('uint8'), (43,))

In [5]: bytes(numpy_bytestr).decode('latin-1')
Out[5]: 'The quick brown fox jumps over the lazy dog'

Sturla