PEP 393 vs UTF-8 Everywhere

eryk sun eryksun at
Sat Jan 21 15:49:26 EST 2017

On Sat, Jan 21, 2017 at 8:21 PM, Pete Forman <petef4+usenet at> wrote:
> Marko Rauhamaa <marko at> writes:
>>> py> low = '\uDC37'
>> That should raise a SyntaxError exception.
> Quite. My point was that with older Python on a narrow build (Windows
> and Mac) you need to understand that you are using UTF-16 rather than
> Unicode. On a wide build or Python 3.3+ then all is rosy. (At this point
> I'm tempted to put in a winky emoji but that might push the internal
> representation into UCS-4.)

CPython allows surrogate codes for use with the "surrogateescape" and
"surrogatepass" error handlers, which are used for POSIX and Windows
file-system encoding, respectively. Maybe MicroPython goes about the
file-system round-trip problem differently, or maybe it just require
using bytes for file-system and environment-variable names on POSIX
and doesn't care about Windows.

"surrogateescape" allows 'decoding' arbitrary bytes:

    >>> b'\x81'.decode('ascii', 'surrogateescape')
    >>> '\udc81'.encode('ascii', 'surrogateescape')

This error handler is required by CPython on POSIX to handle arbitrary
bytes in file-system paths. For example, when running with LANG=C:

    >>> sys.getfilesystemencoding()
    >>> os.listdir(b'.')
    >>> os.listdir('.')

"surrogatepass" allows encoding surrogates:

    >>> '\udc81'.encode('utf-8', 'surrogatepass')
    >>> b'\xed\xb2\x81'.decode('utf-8', 'surrogatepass')

This error handler is used by CPython 3.6+ to encode Windows UCS-2
file-system paths as WTF-8 (Wobbly). For example:

    >>> os.listdir('.')
    >>> os.listdir(b'.')

More information about the Python-list mailing list