[Python-ideas] Support WHATWG versions of legacy encodings

Stephen J. Turnbull turnbull.stephen.fw at u.tsukuba.ac.jp
Mon Jan 22 01:43:37 EST 2018

I don't expect to change your mind about the "right" way to deal with
this, but this is a more explicit description of what those of us who
advocate error handlers are thinking about.  It may be useful in
writing your PEP (PEPs describe rejected counterproposals and
amendments along with adopted proposals and rationale in either case).

Rob Speer writes:

 > > The question to my mind is whether or not this "latin1replace" handler,
 > > in conjunction with existing codecs, will do the same thing as the
 > > WHATWG codecs. If I have understood you correctly, I think it will. Have
 > > I missed something?
 > It won't do the same thing, and neither will the "chaining coders"
 > proposal.

The "chaining coders" proposal isn't well-enough specified to be sure.

However, for practical purposes you may think of a Python *codec* as a
"whole array" decoder/encoder, and an *error handler* as a "token-by-
token" decoder/encoder.  The distinction in type is for efficiency, of
course.  Codecs can't be "chained" (I think, but I didn't think very
hard), but handlers can, in the sense that each handler can handle
some input values and delegate anything it can't deal with to the next
handler in the chain (under the hood handler implementationss are just
Python functions with a particular signature, so this is just "loop
until non-None").

 > It's easy to miss details like this in all the counterproposals.

I see no reason why a 'whatwgreplace' error handler with the logic

    # I am assuming decoding, and single-byte encodings.  Encoding
    # with 'html' error mode would insert format("&#%d;", ord(unicode)).
    # Multibyte is a little harder.

    # ASCII bytes never error except maybe in UTF16, UTF32, Shift JIS
    # and Big5.
    assert the_byte >= 0x80
    # Handle C1 control characters.
    if the_byte < 0xA0:
    # Handle extended repertoire with a dict.
    # This condition will depend on the particular codec.
    elif the_byte in additional_code_points:
    # Implement WHATWG error modes.
    elif whatwg_error_mode is replacement:

doesn't have the effect you want.  This can be done in pure Python.
(Note: The actions in the pseudocode are not accurate.  IIRC real
handlers take a UnicodeError as argument, and return a tuple of the
text to append to output and number of input tokens to skip, or
return None to indicate an unhandled error, rather than doing the
appending and raising themselves.)

The main objection to doing it this way would be efficiency.  To be
honest, I personally don't think that's an important objection since
this handler is frequently invoked only if the source text is badly
broken.  (Remember, you'll already be greatly expanding the repertoire
of at least ASCII and ISO 8859/1 by promoting to windows-1252.)  And
it would surely be "fast enough" if written in C.

Caveat: I'm not sure I agree with MAL about windows-1255.  I think
it's arguable that the WHAT-WG index is a better approximation to
reality, and I'd like to hear Hebrew speakers argue about that (I'm
not one).

 > The difference between WHATWG encodings and the ones in Python is,
 > in all but one case, *only* in the C1 control character range (0x80
 > to 0x9F),

Also in Japanese, where "corporate characters" have been added
(frequently twice, preventing round-tripping ... yuck) to the JIS
standard.  I haven't checked the Chinese and Korean tables for similar
damage, but they're not quite as wacky about this stuff as the JISC
is, so they're probably OK (and of course Big5 was "corporate" from
the get-go).

 > a range of Unicode characters that has historically evaded
 > standardization because they never had a clear purpose even before
 > Unicode.  Filling in all the gaps with Latin-1

That's wrong, as you explain:

 > [Eg, in Greek, some code points] are simply unassigned. Other
 > software sometimes maps them to the Private Use Area, but this is
 > not standardized at all, and it seems clear that Python should
 > handle them with its usual error handler for unassigned
 > bytes. (Which is one of the reasons not to replace the error
 > handler with something different: we still need the error handler.)

The logic above handles all this.  As mentioned, a stdlib error
handler ('strict', 'replace', or 'xmlcharrefreplace' for WHAT-WG
conformance, or 'surrogatereplace' for the Pythonic equivalent of
mapping to the private area) could be chained if desired, and the
defaults could be changed and the names aliased to the WHAT-WG terms.

This could be automated with a factory function that takes a list of
predefined handlers and composes them, although that would add another
layer of inefficiency (the composition would presumably be done in a
loop, and possibly using try although I think the error handler
convention is to return the text to insert if handled, and None if the
error can't be handled).


More information about the Python-ideas mailing list