[Python-ideas] Support WHATWG versions of legacy encodings
Stephen J. Turnbull
turnbull.stephen.fw at u.tsukuba.ac.jp
Mon Jan 22 02:39:18 EST 2018
> I think his point is that the WHATWG standard is the one that
> governs HTML and therefore HTML that uses these encodings
> (including the C1 characters) are conformant to *that* standard,
I don't think that is a tenable interpretation of this standard.
The WHAT-WG standard encoding for HTML is UTF-8. This is what
Authors must use the UTF-8 encoding and must use the ASCII
case-insensitive "utf-8" label to identify it.
New protocols and formats, as well as existing formats deployed in
new contexts, must use the UTF-8 encoding exclusively. If these
protocols and formats need to expose the encoding’s name or label,
they must expose it as "utf-8".
Non-UTF-8 *documents* do not conform. There's nothing anywhere that
says you may use other encodings, with the single exception of implied
permission when encoding form input to send to the server (and that's
not even HTML!) Even there you're encouraged to use UTF-8.
The rest of the standard provides for how *processes* should handle
encodings in purported HTML documents that fail the requirement to
encode in UTF-8. That doesn't mean such documents conform; it simply
*gives permission* to a conformant process to try to deal with them,
and rules for doing that.
Yes, it's true that WHAT-WG processing probably would have saved
Nathaniel some aggravation with his manipulations of HTML. It's
equally likely that errors='surrogateescape' would do so, and a better
job on encodings like Hebrew that leave code points in graphic regions
 I take this to mean that when I take an EUC-JP HTML document and
move it from my legacy document tree to my new Django static resource
collection, I *must* transcode it to UTF-8.
More information about the Python-ideas