[Numpy-discussion] One-byte string dtype: third time's the charm?
Sturla Molden
sturla.molden at gmail.com
Sun Feb 22 14:29:25 EST 2015
On 22/02/15 19:21, Aldcroft, Thomas wrote:
> Problems like this are now showing up in the wild [3]. Workarounds are
> also showing up, like a way to easily convert from 'S' to 'U' within
> astropy Tables [4], but this is really not a desirable way to go.
> Gigabyte-sized string data arrays are not uncommon, so converting to
> UCS-4 is a real memory and performance hit.
Why UCS-4? The Python's internal "flexible string respresentation" will
use ascii for ascii text.
By PEP 393 an application should not assume an internal string
representation at all:
https://www.python.org/dev/peps/pep-0393/
If the problem is PEP 393 violation in NumPy string or unicode dtype, we
shouldn't violate it even further by adding a latin-1 encoded ascii
string. We should let Python represent strings as it wants, and it will
not bloat.
I am m -1 on adding latin-1 and +1 on making the unicode dtype PEP 393
compliant if it is not. And on Python 3 'U' and 'S' should just be synonyms.
You can also store an array of bytes with uint8. Then you can decode it
however you like to make a Python string. If it is encoded as latin-1
then decode it as latin-1:
In [1]: import numpy as np
In [2]: ascii_bytestr = "The quick brown fox jumps over the lazy
dog".encode('latin-1')
In [3]: numpy_bytestr = np.array(memoryview(ascii_bytestr))
In [4]: numpy_bytestr.dtype, numpy_bytestr.shape
Out[4]: (dtype('uint8'), (43,))
In [5]: bytes(numpy_bytestr).decode('latin-1')
Out[5]: 'The quick brown fox jumps over the lazy dog'
Sturla
More information about the NumPy-Discussion
mailing list