[Python-Dev] PEP 383 update: utf8b is now the error handler

Tue May 5 16:57:36 CEST 2009

"Martin v. Löwis" writes:

 > I've updated the PEP accordingly.

I have three substantive comments.  First, although consequences for
Python 3 byte interfaces (ie, "none") are explicitly stated, as far as
I can see this PEP could apply to Python 2 as well.  I don't think
it's intended that way.  Either way, I think you should clarify that
point.

Second, I suggest "surrogate-replace" as the name of the error handler
rather than "utf8b".  (Elsewhere I've suggested others, but I think
this is the best of the bunch.)

Third, it is not clear to me why non-decodable ASCII should be an
error.  There are plenty of low surrogates for the purpose.  Is there
another technical reason?  Stupid or not, Shift-JIS- and Big5-encoded
file systems are quite common in Asia still (including non-rewritable
media).  I think surrogate-replacement of ASCII should at least be an
option.

I don't think "people shouldn't be using non-ASCII-compatible
encodings for locale encodings" is a sufficient rationale for a hard
error here.  I mean, of course they *should* be using UTF-8.  Maybe
Python 3.1 should just go ahead and error on any other encoding on
POSIX platforms? <wink>

I have a number of nitpicking comments and technical clarifications on
the PEP.  Rationale is in footnotes.  There were also a few typos I
noticed.

1.  There is no such thing as a "half-surrogate" in Unicode.  "Lone
    surrogate" is clear enough.  Or for somewhat fancier English,
    "isolated surrogate" or "non-syntactic surrogate".  To emphasize
    that Python codecs will only produce them in contexts where a
    Unicode character or high surrogate (for UTF-16 Python) is
    syntactically required, "isolated low surrogate" or "isolated
    trailing surrogate" might be good.[1]

2.  The specification should state, and the discussion emphasize, that
    strings which were produced by surrogate replacement *must not* be
    used in data interchange with systems that do not specifically
    accept such strings, and that this is the responsibility of the
    application.[2]

    Rather than saying that "dealing with such conflicts is out of
    scope of this PEP", I would say

    """Dealing with such conflicts is the responsibility of the
    application.  Since this PEP's mechanism produces valid Unicode
    where possible, and produces *invalid* code points only via the
    error handler, one strategy is for the application to validate all
    other sources of strings as Unicode conforming.  There may be
    other useful application-specific strategies, as well."""

3.  In the discussion, the transition from the example of alternative
    use of 'python-escape' to discussion of the error handler
    interface extension is a bit abrupt.  I suggest rewriting as:

    """The extension to the encode error handler interface proposed by
    this PEP is necessary to implement the 'utf8b' error handler,
    because there are required byte sequences which cannot be
    generated from replacement Unicode.  However, the encode error
    handler interface presently requires replacement Unicode to be
    provided in lieu of the non-encodable Unicode from the source
    string.  Then it promptly encodes that replacement Unicode.  In
    some error handlers, such as the 'utf8b' proposed here, it is also
    simpler and more efficient for the error handler to provide a
    pre-encoded replacement byte string, rather than forcing it to
    calculating Unicode from which the encoder would create the
    desired bytes."""

Typos (line references are to pep-0383.txt svn r72332):

l.  86: "Byte-orientied" -> "Byte-oriented"
l.  98, 118, 124, 127, 132, 136: "python-escape" -> "utf8b"
l. 130: "provide" -> "provided"
l. 134: "calculating" -> "calculate"

Footnotes: 
[1] Unicode 5.0 uses the terms "high-half" and "low-half" at least
    once, in section 16.6, but the context is such that I take it to
    refer to "half of the surrogate area".  Section 3.8 doesn't use
    these, instead noting that "leading" and "trailing" are sometimes
    used instead of "high" and "low".  Better to avoid the word "half"
    in PEP 383, I think.

[2] Since this error handler is going to be the default for POSIX I/O,
    of course people are going to mostly ignore that restriction.  The
    point is, passing such strings to systems that don't expect them
    is a bug, and the PEP should make it clear that it's the app's
    bug, not the other system's.  On the other hand, using those
    strings in a context of consenting adults (and I do mean
    double-opt-in here) is perfectly acceptable.  I'm specifically
    thinking of use in the Tahoe protocol discussed by Zooko
    O'Whielacronx; it may not be usable there for backward
    compatibility reasons, but "Unicode conformance" is not an issue
    in principle.

    This does imply that programs that take advantage of the error
    handler specified in this PEP are on their own if they accept data
    from any sources that are not known to be Unicode-conforming.
    OTOH, as far as I can see if other sources are known to be Unicode
    conformant, it's reasonably (but not perfectly) safe to combine
    them with strings from this PEP (and of course use either 'utf8b'
    or 'strict', as appropriate, when passing data out of Python).