[Python-Dev] PEP 383 update: utf8b is now the error handler

MRAB google at mrabarnett.plus.com
Tue May 5 17:25:46 CEST 2009


Stephen J. Turnbull wrote:
> "Martin v. Löwis" writes:
> 
>  > I've updated the PEP accordingly.
> 
> I have three substantive comments.  First, although consequences for
> Python 3 byte interfaces (ie, "none") are explicitly stated, as far as
> I can see this PEP could apply to Python 2 as well.  I don't think
> it's intended that way.  Either way, I think you should clarify that
> point.
> 
> Second, I suggest "surrogate-replace" as the name of the error handler
> rather than "utf8b".  (Elsewhere I've suggested others, but I think
> this is the best of the bunch.)
> 
+1

> Third, it is not clear to me why non-decodable ASCII should be an
> error.  There are plenty of low surrogates for the purpose.  Is there
> another technical reason?  Stupid or not, Shift-JIS- and Big5-encoded
> file systems are quite common in Asia still (including non-rewritable
> media).  I think surrogate-replacement of ASCII should at least be an
> option.
> 
> I don't think "people shouldn't be using non-ASCII-compatible
> encodings for locale encodings" is a sufficient rationale for a hard
> error here.  I mean, of course they *should* be using UTF-8.  Maybe
> Python 3.1 should just go ahead and error on any other encoding on
> POSIX platforms? <wink>
> 
I don't see why the error handler couldn't in principle be used with
encodings other than UTF-8, although in that case all of the low
surrogates should be open to use.

> I have a number of nitpicking comments and technical clarifications on
> the PEP.  Rationale is in footnotes.  There were also a few typos I
> noticed.
> 
> 1.  There is no such thing as a "half-surrogate" in Unicode.  "Lone
>     surrogate" is clear enough.  Or for somewhat fancier English,
>     "isolated surrogate" or "non-syntactic surrogate".  To emphasize
>     that Python codecs will only produce them in contexts where a
>     Unicode character or high surrogate (for UTF-16 Python) is
>     syntactically required, "isolated low surrogate" or "isolated
>     trailing surrogate" might be good.[1]
> 
> 2.  The specification should state, and the discussion emphasize, that
>     strings which were produced by surrogate replacement *must not* be
>     used in data interchange with systems that do not specifically
>     accept such strings, and that this is the responsibility of the
>     application.[2]
> 
>     Rather than saying that "dealing with such conflicts is out of
>     scope of this PEP", I would say
> 
>     """Dealing with such conflicts is the responsibility of the
>     application.  Since this PEP's mechanism produces valid Unicode
>     where possible, and produces *invalid* code points only via the
>     error handler, one strategy is for the application to validate all
>     other sources of strings as Unicode conforming.  There may be
>     other useful application-specific strategies, as well."""
> 
> 3.  In the discussion, the transition from the example of alternative
>     use of 'python-escape' to discussion of the error handler
>     interface extension is a bit abrupt.  I suggest rewriting as:
> 
>     """The extension to the encode error handler interface proposed by
>     this PEP is necessary to implement the 'utf8b' error handler,
>     because there are required byte sequences which cannot be
>     generated from replacement Unicode.  However, the encode error
>     handler interface presently requires replacement Unicode to be
>     provided in lieu of the non-encodable Unicode from the source
>     string.  Then it promptly encodes that replacement Unicode.  In
>     some error handlers, such as the 'utf8b' proposed here, it is also
>     simpler and more efficient for the error handler to provide a
>     pre-encoded replacement byte string, rather than forcing it to
>     calculating Unicode from which the encoder would create the
>     desired bytes."""
> 
> Typos (line references are to pep-0383.txt svn r72332):
> 
> l.  86: "Byte-orientied" -> "Byte-oriented"
> l.  98, 118, 124, 127, 132, 136: "python-escape" -> "utf8b"
> l. 130: "provide" -> "provided"
> l. 134: "calculating" -> "calculate"
> 
> 
> Footnotes: 
> [1] Unicode 5.0 uses the terms "high-half" and "low-half" at least
>     once, in section 16.6, but the context is such that I take it to
>     refer to "half of the surrogate area".  Section 3.8 doesn't use
>     these, instead noting that "leading" and "trailing" are sometimes
>     used instead of "high" and "low".  Better to avoid the word "half"
>     in PEP 383, I think.
> 
"Leading" and "trailing" simply state the order, not the set ("high" or
"low"), so are not good terms to use.

> [2] Since this error handler is going to be the default for POSIX I/O,
>     of course people are going to mostly ignore that restriction.  The
>     point is, passing such strings to systems that don't expect them
>     is a bug, and the PEP should make it clear that it's the app's
>     bug, not the other system's.  On the other hand, using those
>     strings in a context of consenting adults (and I do mean
>     double-opt-in here) is perfectly acceptable.  I'm specifically
>     thinking of use in the Tahoe protocol discussed by Zooko
>     O'Whielacronx; it may not be usable there for backward
>     compatibility reasons, but "Unicode conformance" is not an issue
>     in principle.
> 
>     This does imply that programs that take advantage of the error
>     handler specified in this PEP are on their own if they accept data
>     from any sources that are not known to be Unicode-conforming.
>     OTOH, as far as I can see if other sources are known to be Unicode
>     conformant, it's reasonably (but not perfectly) safe to combine
>     them with strings from this PEP (and of course use either 'utf8b'
>     or 'strict', as appropriate, when passing data out of Python).
> 
Should there be a function or method to check for conformance and
lone surrogates?


More information about the Python-Dev mailing list