[Python-ideas] Support WHATWG versions of legacy encodings

Stephen J. Turnbull turnbull.stephen.fw at u.tsukuba.ac.jp
Mon Jan 22 02:39:18 EST 2018


Random832 writes:

 > I think his point is that the WHATWG standard is the one that
 > governs HTML and therefore HTML that uses these encodings
 > (including the C1 characters) are conformant to *that* standard,

I don't think that is a tenable interpretation of this standard.
The WHAT-WG standard encoding for HTML is UTF-8.  This is what
https://encoding.spec.whatwg.org/#names-and-labels says:

    Authors must use the UTF-8 encoding and must use the ASCII
    case-insensitive "utf-8" label to identify it.

    New protocols and formats, as well as existing formats deployed in
    new contexts[1], must use the UTF-8 encoding exclusively. If these
    protocols and formats need to expose the encoding’s name or label,
    they must expose it as "utf-8".

Non-UTF-8 *documents* do not conform.  There's nothing anywhere that
says you may use other encodings, with the single exception of implied
permission when encoding form input to send to the server (and that's
not even HTML!)  Even there you're encouraged to use UTF-8.

The rest of the standard provides for how *processes* should handle
encodings in purported HTML documents that fail the requirement to
encode in UTF-8.  That doesn't mean such documents conform; it simply
*gives permission* to a conformant process to try to deal with them,
and rules for doing that.

Yes, it's true that WHAT-WG processing probably would have saved
Nathaniel some aggravation with his manipulations of HTML.  It's
equally likely that errors='surrogateescape' would do so, and a better
job on encodings like Hebrew that leave code points in graphic regions
undefined.


Footnotes: 
[1]  I take this to mean that when I take an EUC-JP HTML document and
move it from my legacy document tree to my new Django static resource
collection, I *must* transcode it to UTF-8.



More information about the Python-ideas mailing list