Re: [Python-ideas] Support WHATWG versions of legacy encodings

Jan. 22, 2018

      I don't really understand what you're doing when you take a fragment of my
sentence where I explain a wrong understanding of WHATWG encodings, and say
"that's wrong, as you explain". I know it's wrong. That's what I was saying.

You quoted the part where I said "Filling in all the gaps with Latin-1",
cut out the part where I said "is wrong", and replied with "that's wrong".
I guess I'm glad we're in agreement, but this has been a strange bit of
discourse.

In this pseudocode that implements a "whatwg_error_mode", can you describe
what the Python code to call it would look like? Does every call to .encode
and .decode now have a "whatwg_error_mode" parameter, in addition to the
"errors" parameter? Or are there twice as many possible strings you could
pass as the "errors" parameter, so you can have "replace",
"replace-whatwg", "surrogateescape", "surrogateescape-whatwg", etc?

My objection here isn't efficiency, it's adding confusing extra options to
.encode() and .decode() that aren't relevant in most cases.

I'd like to limit this proposal to single-byte encodings, addressing the
discrepancies in the C1 characters and possibly that Hebrew vowel point. If
there are differences in the JIS encodings, that is a can of worms I'd like
to not open at the moment.

-- Rob Speer

On Mon, 22 Jan 2018 at 01:43 Stephen J. Turnbull <
turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
...
I don't expect to change your mind about the "right" way to deal with
this, but this is a more explicit description of what those of us who
advocate error handlers are thinking about.  It may be useful in
writing your PEP (PEPs describe rejected counterproposals and
amendments along with adopted proposals and rationale in either case).
Rob Speer writes:
...
...
The question to my mind is whether or not this "latin1replace"
handler,
in conjunction with existing codecs, will do the same thing as the
WHATWG codecs. If I have understood you correctly, I think it will.
Have
I missed something?
It won't do the same thing, and neither will the "chaining coders"
proposal.
The "chaining coders" proposal isn't well-enough specified to be sure.
However, for practical purposes you may think of a Python *codec* as a
"whole array" decoder/encoder, and an *error handler* as a "token-by-
token" decoder/encoder.  The distinction in type is for efficiency, of
course.  Codecs can't be "chained" (I think, but I didn't think very
hard), but handlers can, in the sense that each handler can handle
some input values and delegate anything it can't deal with to the next
handler in the chain (under the hood handler implementationss are just
Python functions with a particular signature, so this is just "loop
until non-None").
...
It's easy to miss details like this in all the counterproposals.
I see no reason why a 'whatwgreplace' error handler with the logic
# I am assuming decoding, and single-byte encodings.  Encoding
    # with 'html' error mode would insert format("&#%d;", ord(unicode)).
    # Multibyte is a little harder.
# ASCII bytes never error except maybe in UTF16, UTF32, Shift JIS
    # and Big5.
    assert the_byte >= 0x80
    # Handle C1 control characters.
    if the_byte < 0xA0:
        append_to_output(chr(the_byte))
    # Handle extended repertoire with a dict.
    # This condition will depend on the particular codec.
    elif the_byte in additional_code_points:
        append_to_output(additional_code_points[the_byte])
    # Implement WHATWG error modes.
    elif whatwg_error_mode is replacement:
        append_to_output("\uFFFD")
    else:
        raise
doesn't have the effect you want.  This can be done in pure Python.
(Note: The actions in the pseudocode are not accurate.  IIRC real
handlers take a UnicodeError as argument, and return a tuple of the
text to append to output and number of input tokens to skip, or
return None to indicate an unhandled error, rather than doing the
appending and raising themselves.)
The main objection to doing it this way would be efficiency.  To be
honest, I personally don't think that's an important objection since
this handler is frequently invoked only if the source text is badly
broken.  (Remember, you'll already be greatly expanding the repertoire
of at least ASCII and ISO 8859/1 by promoting to windows-1252.)  And
it would surely be "fast enough" if written in C.
Caveat: I'm not sure I agree with MAL about windows-1255.  I think
it's arguable that the WHAT-WG index is a better approximation to
reality, and I'd like to hear Hebrew speakers argue about that (I'm
not one).
...
The difference between WHATWG encodings and the ones in Python is,
in all but one case, *only* in the C1 control character range (0x80
to 0x9F),
Also in Japanese, where "corporate characters" have been added
(frequently twice, preventing round-tripping ... yuck) to the JIS
standard.  I haven't checked the Chinese and Korean tables for similar
damage, but they're not quite as wacky about this stuff as the JISC
is, so they're probably OK (and of course Big5 was "corporate" from
the get-go).
...
a range of Unicode characters that has historically evaded
standardization because they never had a clear purpose even before
Unicode.  Filling in all the gaps with Latin-1
That's wrong, as you explain:
...
[Eg, in Greek, some code points] are simply unassigned. Other
software sometimes maps them to the Private Use Area, but this is
not standardized at all, and it seems clear that Python should
handle them with its usual error handler for unassigned
bytes. (Which is one of the reasons not to replace the error
handler with something different: we still need the error handler.)
The logic above handles all this.  As mentioned, a stdlib error
handler ('strict', 'replace', or 'xmlcharrefreplace' for WHAT-WG
conformance, or 'surrogatereplace' for the Pythonic equivalent of
mapping to the private area) could be chained if desired, and the
defaults could be changed and the names aliased to the WHAT-WG terms.
This could be automated with a factory function that takes a list of
predefined handlers and composes them, although that would add another
layer of inefficiency (the composition would presumably be done in a
loop, and possibly using try although I think the error handler
convention is to return the text to insert if handled, and None if the
error can't be handled).
Steve