[Python-ideas] Support WHATWG versions of legacy encodings
rspeer at luminoso.com
Mon Jan 22 14:32:04 EST 2018
I don't really understand what you're doing when you take a fragment of my
sentence where I explain a wrong understanding of WHATWG encodings, and say
"that's wrong, as you explain". I know it's wrong. That's what I was saying.
You quoted the part where I said "Filling in all the gaps with Latin-1",
cut out the part where I said "is wrong", and replied with "that's wrong".
I guess I'm glad we're in agreement, but this has been a strange bit of
In this pseudocode that implements a "whatwg_error_mode", can you describe
what the Python code to call it would look like? Does every call to .encode
and .decode now have a "whatwg_error_mode" parameter, in addition to the
"errors" parameter? Or are there twice as many possible strings you could
pass as the "errors" parameter, so you can have "replace",
"replace-whatwg", "surrogateescape", "surrogateescape-whatwg", etc?
My objection here isn't efficiency, it's adding confusing extra options to
.encode() and .decode() that aren't relevant in most cases.
I'd like to limit this proposal to single-byte encodings, addressing the
discrepancies in the C1 characters and possibly that Hebrew vowel point. If
there are differences in the JIS encodings, that is a can of worms I'd like
to not open at the moment.
-- Rob Speer
On Mon, 22 Jan 2018 at 01:43 Stephen J. Turnbull <
turnbull.stephen.fw at u.tsukuba.ac.jp> wrote:
> I don't expect to change your mind about the "right" way to deal with
> this, but this is a more explicit description of what those of us who
> advocate error handlers are thinking about. It may be useful in
> writing your PEP (PEPs describe rejected counterproposals and
> amendments along with adopted proposals and rationale in either case).
> Rob Speer writes:
> > > The question to my mind is whether or not this "latin1replace"
> > > in conjunction with existing codecs, will do the same thing as the
> > > WHATWG codecs. If I have understood you correctly, I think it will.
> > > I missed something?
> > It won't do the same thing, and neither will the "chaining coders"
> > proposal.
> The "chaining coders" proposal isn't well-enough specified to be sure.
> However, for practical purposes you may think of a Python *codec* as a
> "whole array" decoder/encoder, and an *error handler* as a "token-by-
> token" decoder/encoder. The distinction in type is for efficiency, of
> course. Codecs can't be "chained" (I think, but I didn't think very
> hard), but handlers can, in the sense that each handler can handle
> some input values and delegate anything it can't deal with to the next
> handler in the chain (under the hood handler implementationss are just
> Python functions with a particular signature, so this is just "loop
> until non-None").
> > It's easy to miss details like this in all the counterproposals.
> I see no reason why a 'whatwgreplace' error handler with the logic
> # I am assuming decoding, and single-byte encodings. Encoding
> # with 'html' error mode would insert format("&#%d;", ord(unicode)).
> # Multibyte is a little harder.
> # ASCII bytes never error except maybe in UTF16, UTF32, Shift JIS
> # and Big5.
> assert the_byte >= 0x80
> # Handle C1 control characters.
> if the_byte < 0xA0:
> # Handle extended repertoire with a dict.
> # This condition will depend on the particular codec.
> elif the_byte in additional_code_points:
> # Implement WHATWG error modes.
> elif whatwg_error_mode is replacement:
> doesn't have the effect you want. This can be done in pure Python.
> (Note: The actions in the pseudocode are not accurate. IIRC real
> handlers take a UnicodeError as argument, and return a tuple of the
> text to append to output and number of input tokens to skip, or
> return None to indicate an unhandled error, rather than doing the
> appending and raising themselves.)
> The main objection to doing it this way would be efficiency. To be
> honest, I personally don't think that's an important objection since
> this handler is frequently invoked only if the source text is badly
> broken. (Remember, you'll already be greatly expanding the repertoire
> of at least ASCII and ISO 8859/1 by promoting to windows-1252.) And
> it would surely be "fast enough" if written in C.
> Caveat: I'm not sure I agree with MAL about windows-1255. I think
> it's arguable that the WHAT-WG index is a better approximation to
> reality, and I'd like to hear Hebrew speakers argue about that (I'm
> not one).
> > The difference between WHATWG encodings and the ones in Python is,
> > in all but one case, *only* in the C1 control character range (0x80
> > to 0x9F),
> Also in Japanese, where "corporate characters" have been added
> (frequently twice, preventing round-tripping ... yuck) to the JIS
> standard. I haven't checked the Chinese and Korean tables for similar
> damage, but they're not quite as wacky about this stuff as the JISC
> is, so they're probably OK (and of course Big5 was "corporate" from
> the get-go).
> > a range of Unicode characters that has historically evaded
> > standardization because they never had a clear purpose even before
> > Unicode. Filling in all the gaps with Latin-1
> That's wrong, as you explain:
> > [Eg, in Greek, some code points] are simply unassigned. Other
> > software sometimes maps them to the Private Use Area, but this is
> > not standardized at all, and it seems clear that Python should
> > handle them with its usual error handler for unassigned
> > bytes. (Which is one of the reasons not to replace the error
> > handler with something different: we still need the error handler.)
> The logic above handles all this. As mentioned, a stdlib error
> handler ('strict', 'replace', or 'xmlcharrefreplace' for WHAT-WG
> conformance, or 'surrogatereplace' for the Pythonic equivalent of
> mapping to the private area) could be chained if desired, and the
> defaults could be changed and the names aliased to the WHAT-WG terms.
> This could be automated with a factory function that takes a list of
> predefined handlers and composes them, although that would add another
> layer of inefficiency (the composition would presumably be done in a
> loop, and possibly using try although I think the error handler
> convention is to return the text to insert if handled, and None if the
> error can't be handled).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-ideas