[Python-Dev] PEP 383 update: utf8b is now the error handler

Stephen J. Turnbull stephen at xemacs.org
Tue May 5 15:09:25 CEST 2009


M.-A. Lemburg writes:
 > On 2009-05-03 19:39, Martin v. Löwis wrote:
 > >> If the error handler is supposed to be used for codecs other than utf-8,
 > >> perhaps it should renamed something more generic, e.g. "surrogate-escape"?
 > > 
 > > Perhaps. However, utf-8b doesn't really have to do anything with utf-8 -
 > > it's an algorithm based on 16-bit or 32-bit code points.

I don't understand this phrasing.  The algorithm is only applicable to
ASCII-compatible octet streams.  It results in code points by a simple
displacement of octet -> octet + 0xDC00.  It cannot be used on (say)
UTF-32 to deal with embedded surrogates.

Certainly, the computation requires (at least) 16 bit numbers, but the
input must be restricted to a stream of 8-bit code points, while the
output is 16- or 32-bit code points.

 > Please use a more descriptive name [than "utf-8b"] for the handler
 > which does not cause confusion with a existing codec.

But please don't use "surrogate-escape" or (as in the current PEP)
"python-escape"; it's not an escaping (quotation) mechanism.
"surrogate-replace", "surrogate-substitute", or "surrogate-translate"
would be better names.


More information about the Python-Dev mailing list