[Python-Dev] PEP 383 update: utf8b is now the error handler

Terry Reedy tjreedy at udel.edu
Thu May 7 05:48:38 CEST 2009


Martin v. Löwis wrote:
>>> Are you serious?
>> Are you? ;-?  You are the one naming a codec-agnostic error handler (if
>> I understand correctly, and correct me if I do not) after a particular
>> codec, and denying that that could cause confusion.  See other message.
> 
> I can only repeat what I said before: I call it

What, specifically, is 'it'?

> utf8b because that's
> the established name for the algorithm

Which algorithm?

> it implements.

Again, what is 'it'?

As *I* read the sentence above, it is not true.

I went to the site you referred to as the source of your reasoning and 
specifically
http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/utf_8b.c

The algorithm called utf-8b *IS* utf-8 with the addition or replacement 
(of an error return) of essentially one line in each direction:

# encode
if 0xDC00 <= codepoint <= 0xDCFF:
     byte = codepoint - 0xDC00 #encode

Note: for security concerns, you are increasing the lower limit to 
0xDC80. The comment at the top of the utf_8b.c, suggests that that is 
what it should be and should have been in the file, with the other half 
of that surrogate area an error along with the other surrogate area.

#decode
if (0x80 <= byte <= 0xFF) and utf-8-invalid(byte):
     codepoint = byte + 0xDC00 # decode

> That algorithm was originally designed with UTF-8 in mind (and only
> meant to be applied for UTF-8), however, it remains the same algorithm
> even though PEP 383 widens its application.

The error handler designed with utf-8 in mind has no name in the encode 
direction and is called "utf_8b_decoder_invalid_bytes" in the decode 
direction.  By your reasoning, *that* should be its name in Python.  The 
encoding error handler would then be named analogously 
"utf_8b_encoder_invalid_codepoints".  Even these, to me, would be better 
than confusing giving them the same name as the codec.

Terry Jan Reedy



More information about the Python-Dev mailing list