[Python-ideas] Support WHATWG versions of legacy encodings
Stephen J. Turnbull
turnbull.stephen.fw at u.tsukuba.ac.jp
Fri Jan 12 02:05:08 EST 2018
Executive summary: we already do.
Nathaniel suggests we should conform to the WHAT-WG standard. But
AFAGCT, there is no such thing as "WHATWG versions of legacy
encodings". The document at https://encoding.spec.whatwg.org/ has the
following normative specifications (capitalized words are presumed to
have RFC 2119 semantics, but the document doesn't say):
1. New content MUST be encoded in UTF-8, which may or may not be
tagged with "charset=utf-8" according to context.
2. Error-handling is draconian. On decoding, errors MUST be fatal or
(default) replacement of *all* uninterpretable text segments with
U+FFFD REPLACEMENT CHARACTER. On encoding (i.e., of form input),
errors MUST be (default) fatal or 'html' (ie, 💩-encoding).
Developers SHOULD construct processes to negotiate for UTF-8
instead of using 'html'.
3. "Legacy" (ie, IANA registered) encoding names MUST be interpreted
via a specified map to actual encodings. (I believe Subject:
refers to a garbled interpretation of this requirement.)
Note that "WHATWG codecs" don't help with this at all! There
won't be labels for them in documents! You see charset="us-ascii"
or charset="shift_jis", which correspond to existing Python codecs.
What a Python process needs to do to conform:
1. Specify 'utf-8' as the codec.
2. Use 'strict', 'replace', or 'xmlcharrefreplace' in error handling,
conforming to the restrictions for decoding and encoding.
3. Use https://pypi.python.org/pypi/webencodings for the mapping of
"charset" or "encoding" labels to codecs.
What we might want to do in the stdlib to make conformance easier:
2. Nothing. (Caveat: I have not checked that Python error handlers
are 100% equivalent to the algorithms in the WHAT-WG encoding
standard. I believe they are, or very close.)
3. Add the webencodings module to the stdlib. Add codecs if any are
missing. (I haven't correlated the lists of codecs, but Python's
is quite complete.)
I think adding webencodings is a very plausible step. Maintenance
and future development should be minimal since it's a very well-
specified, complete, and self-contained standard.
If somebody wants to do hacks not in the WHAT-WG encoding standard to
improve "readability" of decoded broken HTML, I think they're on their
own. I'm -1 on adding hacks for unassigned code points or worse
breakage to the stdlib. Such hacks belong in frameworks etc., or as
standalone modules on PyPI.
 OK, my Google-fu may be lacking today.
 Although 💩 decoding is part of HTML (and has been for a
long time), I'm sure that Python HTML-processing modules already
handle that. It definitely doesn't belong in the codecs or error
handlers, which handle the decoding of encoded text to "unencoded"
text (or "internal encoding" if you prefer). "💩" as
characters means PILE OF POO regardless of whether the text is encoded
in ASCII or EBCDIC or written in graphite and carried on a physical
RFC 1149 network -- it's a higher-level construct.
More information about the Python-ideas