[Python-Dev] PEP 383 update: utf8b is now the error handler

Wed May 6 08:06:07 CEST 2009

"Martin v. Löwis" writes:

 > Done: the Python-Version header already clarifies that point.

Ah, OK.  I wish my day job required reading more PEPs so I'd be more
familiar with these formalities. :-)

 > > Second, I suggest "surrogate-replace" as the name of the error handler
 > > rather than "utf8b".
 > 
 > I think this is bike-shedding.

I don't personally care (I already was aware of UTF-8B), but there are
plenty of others who do.  I think that's a good name to make
Marc-Andre and Terry happier.  You have to fix the existing uses of
the obsolete "python-escape", anyway.

 > It's a security risk. If U+DCXX would map to \xXX, then somebody could
 > embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets
 > sanitized, nobody would expect that this will actually access ../

The odds that anybody will actually take notice of U+002E U+002E
U+002F in a string are sufficiently small that any number of exploits
have already been based on it.  I agree that there is some additional
risk from this if people make the check for "../" before they prepend
"\ucd2e\udc2e\udc2f", but I think that risk is very small compared to
the pain of having a error handler whose raison d'etre is to not raise
exceptions go ahead and raise them anyway.

See also my reply to Lino Mastrodomenico.  Again, an option is good
enough for my purposes as long as interfaces for os.listdir() and the
like support setting the error handler (cf. Zooko's proposal), but I
think the option should be available.

 > I tried to understand "surrogate", and it was explained to me that
 > "surrogate" is something that stands for something - but then I
 > would argue that the two subsequence codes form a surrogate - they
 > stand for something else. The individual surrogate code (in Unicode
 > terminology) doesn't stand for anything. So don't you agree that
 > it is the Unicode terminology that is in error, not the PEP?

Plausibly so.  Keep making comments like that and nobody will ever let
you off the hook for being a non-native speaker!

However, "surrogate" in English is typically used in situation that
are too complex to be covered by simply "substitution."  I've always
read "surrogate" as "alternative form of encoding", and "surrogate
code point" as "code point in that alternative form of encoding".
Where it's an alternative to code-point-is-scalar-value.  I think
probably the authors of the terminology just made the best of a bad
situation, I can't think of a better single word for this.

 > No. The specification puts no requirements on applications whatsoever.
 > So if you propose to use MUST NOT in the RFC 2119 sense, I strongly
 > disagree.

I do propose that.

But you're writing the PEP, so this battle will have to be deferred.
Eventually Python will have to take a stand on Unicode conformance,
but it's not urgent yet.

 > > 3.  In the discussion, the transition from the example of alternative
 > >     use of 'python-escape' to discussion of the error handler
 > >     interface extension is a bit abrupt.  I suggest rewriting as:
 > > 
 > >     """The extension to the encode error handler interface proposed by
 > >     this PEP is necessary to implement the 'utf8b' error handler,
 > >     because there are required byte sequences which cannot be
 > >     generated from replacement Unicode.  However, the encode error
 > >     handler interface presently requires replacement Unicode to be
 > >     provided in lieu of the non-encodable Unicode from the source
 > >     string.  Then it promptly encodes that replacement Unicode.  In
 > >     some error handlers, such as the 'utf8b' proposed here, it is also
 > >     simpler and more efficient for the error handler to provide a
 > >     pre-encoded replacement byte string, rather than forcing it to
 > >     calculating Unicode from which the encoder would create the
 > >     desired bytes."""
 > 
 > Unfortunately, I failed to understand where you want this text to
 > go. What paragraphs should I remove, or (if none), after which
 > paragraph should I insert this text?

Sorry!  I suggest substituting the paragraph above for the paragraph
which begins "The encode error handler interface presentlyrequires..."
at line 129.

I think I forgot to do this before:  "I hereby dedicate all text
I suggest for inclusion in the PEP to the public domain."