Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

6 May 2009


      ...
I qualify with a). I believe I understand c) but, as explained in my
other post, I do not think your reason applies.  In fact, I think
concern for naming rights might suggest that you *not* reuse the name
for something different.  I would have to learn more about the existing
'surrogates' handler to judge Antione's suggestion 'surrogates-pass'.
'Surrogates-escape' is pretty good for the new handler since, to my
understanding, it 'escapes' 'bad bytes' by prefixing them with bits that
push them to the surrogates plane.
See issue 3672. In essence, in python 2.5:

py> u"\ud800".encode("utf-8")
'\xed\xa0\x80'
py> '\xed\xa0\x80'.decode("utf-8")
u'\ud800'

In 3.1,

py> "\ud800".encode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed
py> "\ud800".encode("utf-8","surrogates")
b'\xed\xa0\x80'
py> b'\xed\xa0\x80'.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2:
illegal encoding
py> b'\xed\xa0\x80'.decode("utf-8","surrogates")
'\ud800'

Regards,
Martin

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

"Martin v. Löwis"