PEP 393 vs UTF-8 Everywhere

Matt Ruffalo mruffalo at
Sat Jan 21 15:23:26 EST 2017

On 2017-01-21 10:50, Pete Forman wrote:
> Thanks for a very thorough reply, most useful. I'm going to pick you up
> on the above, though.
> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
> and UTF-32. The rules for UTF-8 were tightened up in Unicode 4 and RFC
> 3629 (2003). There is CESU-8 if you really need a naive encoding of
> UTF-16 to UTF-8-alike.
> py> low = '\uDC37'
> is only meaningful on narrow builds pre Python 3.3 where the user must
> do extra to correctly handle characters outside the BMP.

Hi Pete-

Lone surrogate characters have a standardized use in Python, not just in
narrow builds of Python <= 3.2. Unpaired high surrogate characters are
used to store any bytes that couldn't be decoded with a given character
encoding scheme, for use in OS/filesystem interfaces that use arbitrary
byte strings:

Python 3.6.0 (default, Dec 23 2016, 08:25:24)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s = 'héllo'
>>> b = s.encode('latin-1')
>>> b
>>> from os import fsdecode, fsencode
>>> decoded = fsdecode(b)
>>> decoded
>>> fsencode(decoded)

This provides a mechanism for lossless round-trip decoding and encoding
of arbitrary byte strings which aren't valid under the user's locale.
This is absolutely necessary in POSIX systems in which filenames can
contain any sequence of bytes despite the user's locale, and is even
necessary in Windows, where filenames are stored as opaque
not-quite-UCS2 strings:

Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64
bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> from pathlib import Path
>>> import os
>>> os.chdir(Path('~/Desktop').expanduser())
>>> filename = '\udcf9'
>>> with open(filename, 'w'): pass

>>> os.listdir('.')
['desktop.ini', '\udcf9']


More information about the Python-list mailing list