[Python-Dev] PEP 383 update: utf8b is now the error handler
MRAB
google at mrabarnett.plus.com
Tue May 5 17:25:46 CEST 2009
Stephen J. Turnbull wrote:
> "Martin v. Löwis" writes:
>
> > I've updated the PEP accordingly.
>
> I have three substantive comments. First, although consequences for
> Python 3 byte interfaces (ie, "none") are explicitly stated, as far as
> I can see this PEP could apply to Python 2 as well. I don't think
> it's intended that way. Either way, I think you should clarify that
> point.
>
> Second, I suggest "surrogate-replace" as the name of the error handler
> rather than "utf8b". (Elsewhere I've suggested others, but I think
> this is the best of the bunch.)
>
+1
> Third, it is not clear to me why non-decodable ASCII should be an
> error. There are plenty of low surrogates for the purpose. Is there
> another technical reason? Stupid or not, Shift-JIS- and Big5-encoded
> file systems are quite common in Asia still (including non-rewritable
> media). I think surrogate-replacement of ASCII should at least be an
> option.
>
> I don't think "people shouldn't be using non-ASCII-compatible
> encodings for locale encodings" is a sufficient rationale for a hard
> error here. I mean, of course they *should* be using UTF-8. Maybe
> Python 3.1 should just go ahead and error on any other encoding on
> POSIX platforms? <wink>
>
I don't see why the error handler couldn't in principle be used with
encodings other than UTF-8, although in that case all of the low
surrogates should be open to use.
> I have a number of nitpicking comments and technical clarifications on
> the PEP. Rationale is in footnotes. There were also a few typos I
> noticed.
>
> 1. There is no such thing as a "half-surrogate" in Unicode. "Lone
> surrogate" is clear enough. Or for somewhat fancier English,
> "isolated surrogate" or "non-syntactic surrogate". To emphasize
> that Python codecs will only produce them in contexts where a
> Unicode character or high surrogate (for UTF-16 Python) is
> syntactically required, "isolated low surrogate" or "isolated
> trailing surrogate" might be good.[1]
>
> 2. The specification should state, and the discussion emphasize, that
> strings which were produced by surrogate replacement *must not* be
> used in data interchange with systems that do not specifically
> accept such strings, and that this is the responsibility of the
> application.[2]
>
> Rather than saying that "dealing with such conflicts is out of
> scope of this PEP", I would say
>
> """Dealing with such conflicts is the responsibility of the
> application. Since this PEP's mechanism produces valid Unicode
> where possible, and produces *invalid* code points only via the
> error handler, one strategy is for the application to validate all
> other sources of strings as Unicode conforming. There may be
> other useful application-specific strategies, as well."""
>
> 3. In the discussion, the transition from the example of alternative
> use of 'python-escape' to discussion of the error handler
> interface extension is a bit abrupt. I suggest rewriting as:
>
> """The extension to the encode error handler interface proposed by
> this PEP is necessary to implement the 'utf8b' error handler,
> because there are required byte sequences which cannot be
> generated from replacement Unicode. However, the encode error
> handler interface presently requires replacement Unicode to be
> provided in lieu of the non-encodable Unicode from the source
> string. Then it promptly encodes that replacement Unicode. In
> some error handlers, such as the 'utf8b' proposed here, it is also
> simpler and more efficient for the error handler to provide a
> pre-encoded replacement byte string, rather than forcing it to
> calculating Unicode from which the encoder would create the
> desired bytes."""
>
> Typos (line references are to pep-0383.txt svn r72332):
>
> l. 86: "Byte-orientied" -> "Byte-oriented"
> l. 98, 118, 124, 127, 132, 136: "python-escape" -> "utf8b"
> l. 130: "provide" -> "provided"
> l. 134: "calculating" -> "calculate"
>
>
> Footnotes:
> [1] Unicode 5.0 uses the terms "high-half" and "low-half" at least
> once, in section 16.6, but the context is such that I take it to
> refer to "half of the surrogate area". Section 3.8 doesn't use
> these, instead noting that "leading" and "trailing" are sometimes
> used instead of "high" and "low". Better to avoid the word "half"
> in PEP 383, I think.
>
"Leading" and "trailing" simply state the order, not the set ("high" or
"low"), so are not good terms to use.
> [2] Since this error handler is going to be the default for POSIX I/O,
> of course people are going to mostly ignore that restriction. The
> point is, passing such strings to systems that don't expect them
> is a bug, and the PEP should make it clear that it's the app's
> bug, not the other system's. On the other hand, using those
> strings in a context of consenting adults (and I do mean
> double-opt-in here) is perfectly acceptable. I'm specifically
> thinking of use in the Tahoe protocol discussed by Zooko
> O'Whielacronx; it may not be usable there for backward
> compatibility reasons, but "Unicode conformance" is not an issue
> in principle.
>
> This does imply that programs that take advantage of the error
> handler specified in this PEP are on their own if they accept data
> from any sources that are not known to be Unicode-conforming.
> OTOH, as far as I can see if other sources are known to be Unicode
> conformant, it's reasonably (but not perfectly) safe to combine
> them with strings from this PEP (and of course use either 'utf8b'
> or 'strict', as appropriate, when passing data out of Python).
>
Should there be a function or method to check for conformance and
lone surrogates?
More information about the Python-Dev
mailing list