PEP 393 vs UTF-8 Everywhere
petef4+usenet at gmail.com
Sat Jan 21 10:50:40 EST 2017
Steve D'Aprano <steve+python at pearwood.info> writes:
> Another factor which I didn't see discussed anywhere is that Python
> strings treat surrogates as normal code points. I believe that would
> be troublesome for a UTF-8 implementation:
> py> '\uDC37'.encode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc37' in
> position 0: surrogates not allowed
> but of course with a UCS-2 or UTF-32 implementation it is trivial: you
> just treat the surrogate as another code point like any other.
Thanks for a very thorough reply, most useful. I'm going to pick you up
on the above, though.
Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
and UTF-32. The rules for UTF-8 were tightened up in Unicode 4 and RFC
3629 (2003). There is CESU-8 if you really need a naive encoding of
UTF-16 to UTF-8-alike.
py> low = '\uDC37'
is only meaningful on narrow builds pre Python 3.3 where the user must
do extra to correctly handle characters outside the BMP.
More information about the Python-list