[Python-Dev] PEP 383 update: utf8b is now the error handler
Stephen J. Turnbull
stephen at xemacs.org
Tue May 5 16:57:36 CEST 2009
"Martin v. Löwis" writes:
> I've updated the PEP accordingly.
I have three substantive comments. First, although consequences for
Python 3 byte interfaces (ie, "none") are explicitly stated, as far as
I can see this PEP could apply to Python 2 as well. I don't think
it's intended that way. Either way, I think you should clarify that
point.
Second, I suggest "surrogate-replace" as the name of the error handler
rather than "utf8b". (Elsewhere I've suggested others, but I think
this is the best of the bunch.)
Third, it is not clear to me why non-decodable ASCII should be an
error. There are plenty of low surrogates for the purpose. Is there
another technical reason? Stupid or not, Shift-JIS- and Big5-encoded
file systems are quite common in Asia still (including non-rewritable
media). I think surrogate-replacement of ASCII should at least be an
option.
I don't think "people shouldn't be using non-ASCII-compatible
encodings for locale encodings" is a sufficient rationale for a hard
error here. I mean, of course they *should* be using UTF-8. Maybe
Python 3.1 should just go ahead and error on any other encoding on
POSIX platforms? <wink>
I have a number of nitpicking comments and technical clarifications on
the PEP. Rationale is in footnotes. There were also a few typos I
noticed.
1. There is no such thing as a "half-surrogate" in Unicode. "Lone
surrogate" is clear enough. Or for somewhat fancier English,
"isolated surrogate" or "non-syntactic surrogate". To emphasize
that Python codecs will only produce them in contexts where a
Unicode character or high surrogate (for UTF-16 Python) is
syntactically required, "isolated low surrogate" or "isolated
trailing surrogate" might be good.[1]
2. The specification should state, and the discussion emphasize, that
strings which were produced by surrogate replacement *must not* be
used in data interchange with systems that do not specifically
accept such strings, and that this is the responsibility of the
application.[2]
Rather than saying that "dealing with such conflicts is out of
scope of this PEP", I would say
"""Dealing with such conflicts is the responsibility of the
application. Since this PEP's mechanism produces valid Unicode
where possible, and produces *invalid* code points only via the
error handler, one strategy is for the application to validate all
other sources of strings as Unicode conforming. There may be
other useful application-specific strategies, as well."""
3. In the discussion, the transition from the example of alternative
use of 'python-escape' to discussion of the error handler
interface extension is a bit abrupt. I suggest rewriting as:
"""The extension to the encode error handler interface proposed by
this PEP is necessary to implement the 'utf8b' error handler,
because there are required byte sequences which cannot be
generated from replacement Unicode. However, the encode error
handler interface presently requires replacement Unicode to be
provided in lieu of the non-encodable Unicode from the source
string. Then it promptly encodes that replacement Unicode. In
some error handlers, such as the 'utf8b' proposed here, it is also
simpler and more efficient for the error handler to provide a
pre-encoded replacement byte string, rather than forcing it to
calculating Unicode from which the encoder would create the
desired bytes."""
Typos (line references are to pep-0383.txt svn r72332):
l. 86: "Byte-orientied" -> "Byte-oriented"
l. 98, 118, 124, 127, 132, 136: "python-escape" -> "utf8b"
l. 130: "provide" -> "provided"
l. 134: "calculating" -> "calculate"
Footnotes:
[1] Unicode 5.0 uses the terms "high-half" and "low-half" at least
once, in section 16.6, but the context is such that I take it to
refer to "half of the surrogate area". Section 3.8 doesn't use
these, instead noting that "leading" and "trailing" are sometimes
used instead of "high" and "low". Better to avoid the word "half"
in PEP 383, I think.
[2] Since this error handler is going to be the default for POSIX I/O,
of course people are going to mostly ignore that restriction. The
point is, passing such strings to systems that don't expect them
is a bug, and the PEP should make it clear that it's the app's
bug, not the other system's. On the other hand, using those
strings in a context of consenting adults (and I do mean
double-opt-in here) is perfectly acceptable. I'm specifically
thinking of use in the Tahoe protocol discussed by Zooko
O'Whielacronx; it may not be usable there for backward
compatibility reasons, but "Unicode conformance" is not an issue
in principle.
This does imply that programs that take advantage of the error
handler specified in this PEP are on their own if they accept data
from any sources that are not known to be Unicode-conforming.
OTOH, as far as I can see if other sources are known to be Unicode
conformant, it's reasonably (but not perfectly) safe to combine
them with strings from this PEP (and of course use either 'utf8b'
or 'strict', as appropriate, when passing data out of Python).
More information about the Python-Dev
mailing list