[Python-ideas] Support WHATWG versions of legacy encodings

Rob Speer rspeer at luminoso.com
Mon Jan 22 14:32:04 EST 2018


I don't really understand what you're doing when you take a fragment of my
sentence where I explain a wrong understanding of WHATWG encodings, and say
"that's wrong, as you explain". I know it's wrong. That's what I was saying.

You quoted the part where I said "Filling in all the gaps with Latin-1",
cut out the part where I said "is wrong", and replied with "that's wrong".
I guess I'm glad we're in agreement, but this has been a strange bit of
discourse.

In this pseudocode that implements a "whatwg_error_mode", can you describe
what the Python code to call it would look like? Does every call to .encode
and .decode now have a "whatwg_error_mode" parameter, in addition to the
"errors" parameter? Or are there twice as many possible strings you could
pass as the "errors" parameter, so you can have "replace",
"replace-whatwg", "surrogateescape", "surrogateescape-whatwg", etc?

My objection here isn't efficiency, it's adding confusing extra options to
.encode() and .decode() that aren't relevant in most cases.

I'd like to limit this proposal to single-byte encodings, addressing the
discrepancies in the C1 characters and possibly that Hebrew vowel point. If
there are differences in the JIS encodings, that is a can of worms I'd like
to not open at the moment.

-- Rob Speer

On Mon, 22 Jan 2018 at 01:43 Stephen J. Turnbull <
turnbull.stephen.fw at u.tsukuba.ac.jp> wrote:

> I don't expect to change your mind about the "right" way to deal with
> this, but this is a more explicit description of what those of us who
> advocate error handlers are thinking about.  It may be useful in
> writing your PEP (PEPs describe rejected counterproposals and
> amendments along with adopted proposals and rationale in either case).
>
> Rob Speer writes:
>
>  > > The question to my mind is whether or not this "latin1replace"
> handler,
>  > > in conjunction with existing codecs, will do the same thing as the
>  > > WHATWG codecs. If I have understood you correctly, I think it will.
> Have
>  > > I missed something?
>  >
>  > It won't do the same thing, and neither will the "chaining coders"
>  > proposal.
>
> The "chaining coders" proposal isn't well-enough specified to be sure.
>
> However, for practical purposes you may think of a Python *codec* as a
> "whole array" decoder/encoder, and an *error handler* as a "token-by-
> token" decoder/encoder.  The distinction in type is for efficiency, of
> course.  Codecs can't be "chained" (I think, but I didn't think very
> hard), but handlers can, in the sense that each handler can handle
> some input values and delegate anything it can't deal with to the next
> handler in the chain (under the hood handler implementationss are just
> Python functions with a particular signature, so this is just "loop
> until non-None").
>
>  > It's easy to miss details like this in all the counterproposals.
>
> I see no reason why a 'whatwgreplace' error handler with the logic
>
>     # I am assuming decoding, and single-byte encodings.  Encoding
>     # with 'html' error mode would insert format("&#%d;", ord(unicode)).
>     # Multibyte is a little harder.
>
>     # ASCII bytes never error except maybe in UTF16, UTF32, Shift JIS
>     # and Big5.
>     assert the_byte >= 0x80
>     # Handle C1 control characters.
>     if the_byte < 0xA0:
>         append_to_output(chr(the_byte))
>     # Handle extended repertoire with a dict.
>     # This condition will depend on the particular codec.
>     elif the_byte in additional_code_points:
>         append_to_output(additional_code_points[the_byte])
>     # Implement WHATWG error modes.
>     elif whatwg_error_mode is replacement:
>         append_to_output("\uFFFD")
>     else:
>         raise
>
> doesn't have the effect you want.  This can be done in pure Python.
> (Note: The actions in the pseudocode are not accurate.  IIRC real
> handlers take a UnicodeError as argument, and return a tuple of the
> text to append to output and number of input tokens to skip, or
> return None to indicate an unhandled error, rather than doing the
> appending and raising themselves.)
>
> The main objection to doing it this way would be efficiency.  To be
> honest, I personally don't think that's an important objection since
> this handler is frequently invoked only if the source text is badly
> broken.  (Remember, you'll already be greatly expanding the repertoire
> of at least ASCII and ISO 8859/1 by promoting to windows-1252.)  And
> it would surely be "fast enough" if written in C.
>
> Caveat: I'm not sure I agree with MAL about windows-1255.  I think
> it's arguable that the WHAT-WG index is a better approximation to
> reality, and I'd like to hear Hebrew speakers argue about that (I'm
> not one).
>
>  > The difference between WHATWG encodings and the ones in Python is,
>  > in all but one case, *only* in the C1 control character range (0x80
>  > to 0x9F),
>
> Also in Japanese, where "corporate characters" have been added
> (frequently twice, preventing round-tripping ... yuck) to the JIS
> standard.  I haven't checked the Chinese and Korean tables for similar
> damage, but they're not quite as wacky about this stuff as the JISC
> is, so they're probably OK (and of course Big5 was "corporate" from
> the get-go).
>
>  > a range of Unicode characters that has historically evaded
>  > standardization because they never had a clear purpose even before
>  > Unicode.  Filling in all the gaps with Latin-1
>
> That's wrong, as you explain:
>
>  > [Eg, in Greek, some code points] are simply unassigned. Other
>  > software sometimes maps them to the Private Use Area, but this is
>  > not standardized at all, and it seems clear that Python should
>  > handle them with its usual error handler for unassigned
>  > bytes. (Which is one of the reasons not to replace the error
>  > handler with something different: we still need the error handler.)
>
> The logic above handles all this.  As mentioned, a stdlib error
> handler ('strict', 'replace', or 'xmlcharrefreplace' for WHAT-WG
> conformance, or 'surrogatereplace' for the Pythonic equivalent of
> mapping to the private area) could be chained if desired, and the
> defaults could be changed and the names aliased to the WHAT-WG terms.
>
> This could be automated with a factory function that takes a list of
> predefined handlers and composes them, although that would add another
> layer of inefficiency (the composition would presumably be done in a
> loop, and possibly using try although I think the error handler
> convention is to return the text to insert if handled, and None if the
> error can't be handled).
>
> Steve
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180122/03520acb/attachment.html>


More information about the Python-ideas mailing list