[Numpy-discussion] Bytes vs. Unicode in Python3

Fri Nov 27 05:17:15 EST 2009

A Friday 27 November 2009 10:47:53 Pauli Virtanen escrigué:
> 1) For 'S' dtype, I believe we use Bytes for the raw data and the
>    interface.
> 
>    Maybe we want to introduce a separate "bytes" dtype that's an alias
>    for 'S'?

Yeah.  As regular strings in Python 3 are Unicode, I think that introducing 
separate "bytes" dtype would help doing the transition.  Meanwhile, the next 
should still work:

In [2]: s = np.array(['asa'], dtype="S10")

In [3]: s[0]
Out[3]: 'asa'  # will become b'asa' in Python 3

In [4]: s.dtype.itemsize
Out[4]: 10     # still 1-byte per element

Also, I suppose that there will be issues with the current Unicode support in 
NumPy:

In [5]: u = np.array(['asa'], dtype="U10")

In [6]: u[0]
Out[6]: u'asa'  # will become 'asa' in Python 3

In [7]: u.dtype.itemsize
Out[7]: 40      # not sure about the size in Python 3

For example, if it is true that internal strings in Python 3 and Unicode UTF-8 
(as René seems to suggest), I suppose that the internal conversions from 2-
bytes or 4-bytes (depending on how the Python interpreter has been compiled) 
in NumPy Unicode dtype to the new Python string should have to be reworked 
(perhaps you have dealt with that already).

Cheers,

-- 
Francesc Alted