[Numpy-discussion] Bytes vs. Unicode in Python3

Francesc Alted faltet at pytables.org
Fri Nov 27 07:49:21 EST 2009

A Friday 27 November 2009 13:23:10 René Dudfield escrigué:
> >> I don't think they are internally UTF-8:
> >> http://docs.python.org/3.1/c-api/unicode.html
> >>
> >> """Python’s default builds use a 16-bit type for Py_UNICODE and store
> >> Unicode values internally as UCS2."""
> >
> > Ah!  No changes for that matter.  Much better then.
> Hello,
> in py3...
> >>> 'Hello\u0020World !'.encode()
> b'Hello World !'
> >>> "Äpfel".encode('utf-8')
> b'\xc3\x84pfel'
> >>> "Äpfel".encode()
> b'\xc3\x84pfel'
> The default encoding does appear to be utf-8 in py3.
> Although it is compiled with something different, and stores it as
> something different, that is UCS2 or UCS4.

OK.  One thing is which is the default encoding for Unicode and another is how 
Python keeps Unicode internally.  And internally Python 3 is still using UCS2 
or UCS4, i.e. the same thing than in Python 2, so no worries here.

> I imagine dtype 'S' and 'U' need more clarification.  As it misses the
> concept of encodings it seems?  Currently, S appears to mean 8bit
> characters no encoding, and U appears to mean 16bit characters no
> encoding?  Or are some sort of default encodings assumed?

You only need encoding if you are going to represent Unicode strings with 
other types (for example bytes).  Currently, NumPy can transparently 
import/export native Python Unicode strings (UCS2 or UCS4) into its own 
Unicode type (always UCS4).  So, we don't have to worry here either.

> btw, in my numpy tree there is a unicode_() alias to str in py3, and
> to unicode in py2 (inside the compat.py file).  This helped us in many
> cases with compatible string code in the pygame port.  This allows you
> to create unicode strings on both platforms with the same code.

Correct.  But, in addition, we are going to need a new 'bytes' dtype for NumPy 
for Python 3, right?

Francesc Alted

More information about the NumPy-Discussion mailing list