Re: [Python-ideas] Support WHATWG versions of legacy encodings

On Jan 11, 2018 4:05 AM, "Antoine Pitrou" <solipsis@pitrou.net> wrote: Define "widely used". If web-XXX is a superset of windows-XXX, then perhaps web-XXX is "used" in the sense of "used to decode valid windows-XXX data" (but windows-XXX could be used just as well to decode the same data). The question is rather: how often does web-XXX mojibake happen? We're well in the 2010s now and you'd hope that mojibake doesn't happen as often as it used to in, e.g., 1998. I'm not an expert here or anything, but from what we've been hearing it sounds like it must be used by all standard-compliant HTML parsers. I don't *like* the standard much, but I don't think that the stdlib should refuse to handle standard-compliant HTML, or help users handle standard-compliant HTML correctly, just because the HTML standard has unfortunate things in it. We're not going to convince them to change the standard or anything. And this whole thread started with someone said that their mojibake fixing library is having trouble because of this, so clearly mojibake does still exist. Does it help if we reframe it as not that whatwg is "wrong" about windows-1252, but rather that there is this encoding web-1252, and thanks to an interesting quirk of history, in HTML documents the byte sequence b'<meta charset="windows-1252">' indicates a file using this encoding? In fact the mapping between byte sequences and character sets here is so arbitrary that in standards-compliant HTML, the byte sequences b'<meta charset="ascii">', b'<meta charset="iso-8859-1">', and b'<meta charset="latin1">' *also* indicate that the file is encoded using web-1252. (See: https://encoding.spec.whatwg.org/#names-and-labels) -n

On Thu, 11 Jan 2018 05:18:43 -0800 Nathaniel Smith <njs@pobox.com> wrote:
This is true. The other question is what the bar is for admitting new encodings in the standard library. I don't know much about the history of past practices there, so I will happily leave the decision to other people such as Marc-André. Regards Antoine.

Executive summary: we already do. Nathaniel suggests we should conform to the WHAT-WG standard. But AFAGCT[1], there is no such thing as "WHATWG versions of legacy encodings". The document at https://encoding.spec.whatwg.org/ has the following normative specifications (capitalized words are presumed to have RFC 2119 semantics, but the document doesn't say): 1. New content MUST be encoded in UTF-8, which may or may not be tagged with "charset=utf-8" according to context. 2. Error-handling is draconian. On decoding, errors MUST be fatal or (default) replacement of *all* uninterpretable text segments with U+FFFD REPLACEMENT CHARACTER. On encoding (i.e., of form input), errors MUST be (default) fatal or 'html' (ie, 💩-encoding). Developers SHOULD construct processes to negotiate for UTF-8 instead of using 'html'. 3. "Legacy" (ie, IANA registered) encoding names MUST be interpreted via a specified map to actual encodings. (I believe Subject: refers to a garbled interpretation of this requirement.) Note that "WHATWG codecs" don't help with this at all! There won't be labels for them in documents! You see charset="us-ascii" or charset="shift_jis", which correspond to existing Python codecs. What a Python process needs to do to conform: 1. Specify 'utf-8' as the codec. 2. Use 'strict', 'replace', or 'xmlcharrefreplace' in error handling, conforming to the restrictions for decoding and encoding. 3. Use https://pypi.python.org/pypi/webencodings for the mapping of "charset" or "encoding" labels to codecs. What we might want to do in the stdlib to make conformance easier: 1. Nothing. 2. Nothing. (Caveat: I have not checked that Python error handlers are 100% equivalent to the algorithms in the WHAT-WG encoding standard. I believe they are, or very close.) 3. Add the webencodings module to the stdlib. Add codecs if any are missing. (I haven't correlated the lists of codecs, but Python's is quite complete.) I think adding webencodings is a very plausible step. Maintenance and future development should be minimal since it's a very well- specified, complete, and self-contained standard. If somebody wants to do hacks not in the WHAT-WG encoding standard to improve "readability" of decoded broken HTML, I think they're on their own.[2] I'm -1 on adding hacks for unassigned code points or worse breakage to the stdlib. Such hacks belong in frameworks etc., or as standalone modules on PyPI. Footnotes: [1] OK, my Google-fu may be lacking today. [2] Although 💩 decoding is part of HTML (and has been for a long time), I'm sure that Python HTML-processing modules already handle that. It definitely doesn't belong in the codecs or error handlers, which handle the decoding of encoded text to "unencoded" text (or "internal encoding" if you prefer). "💩" as characters means PILE OF POO regardless of whether the text is encoded in ASCII or EBCDIC or written in graphite and carried on a physical RFC 1149 network -- it's a higher-level construct.

On Thu, 11 Jan 2018 05:18:43 -0800 Nathaniel Smith <njs@pobox.com> wrote:
This is true. The other question is what the bar is for admitting new encodings in the standard library. I don't know much about the history of past practices there, so I will happily leave the decision to other people such as Marc-André. Regards Antoine.

Executive summary: we already do. Nathaniel suggests we should conform to the WHAT-WG standard. But AFAGCT[1], there is no such thing as "WHATWG versions of legacy encodings". The document at https://encoding.spec.whatwg.org/ has the following normative specifications (capitalized words are presumed to have RFC 2119 semantics, but the document doesn't say): 1. New content MUST be encoded in UTF-8, which may or may not be tagged with "charset=utf-8" according to context. 2. Error-handling is draconian. On decoding, errors MUST be fatal or (default) replacement of *all* uninterpretable text segments with U+FFFD REPLACEMENT CHARACTER. On encoding (i.e., of form input), errors MUST be (default) fatal or 'html' (ie, 💩-encoding). Developers SHOULD construct processes to negotiate for UTF-8 instead of using 'html'. 3. "Legacy" (ie, IANA registered) encoding names MUST be interpreted via a specified map to actual encodings. (I believe Subject: refers to a garbled interpretation of this requirement.) Note that "WHATWG codecs" don't help with this at all! There won't be labels for them in documents! You see charset="us-ascii" or charset="shift_jis", which correspond to existing Python codecs. What a Python process needs to do to conform: 1. Specify 'utf-8' as the codec. 2. Use 'strict', 'replace', or 'xmlcharrefreplace' in error handling, conforming to the restrictions for decoding and encoding. 3. Use https://pypi.python.org/pypi/webencodings for the mapping of "charset" or "encoding" labels to codecs. What we might want to do in the stdlib to make conformance easier: 1. Nothing. 2. Nothing. (Caveat: I have not checked that Python error handlers are 100% equivalent to the algorithms in the WHAT-WG encoding standard. I believe they are, or very close.) 3. Add the webencodings module to the stdlib. Add codecs if any are missing. (I haven't correlated the lists of codecs, but Python's is quite complete.) I think adding webencodings is a very plausible step. Maintenance and future development should be minimal since it's a very well- specified, complete, and self-contained standard. If somebody wants to do hacks not in the WHAT-WG encoding standard to improve "readability" of decoded broken HTML, I think they're on their own.[2] I'm -1 on adding hacks for unassigned code points or worse breakage to the stdlib. Such hacks belong in frameworks etc., or as standalone modules on PyPI. Footnotes: [1] OK, my Google-fu may be lacking today. [2] Although 💩 decoding is part of HTML (and has been for a long time), I'm sure that Python HTML-processing modules already handle that. It definitely doesn't belong in the codecs or error handlers, which handle the decoding of encoded text to "unencoded" text (or "internal encoding" if you prefer). "💩" as characters means PILE OF POO regardless of whether the text is encoded in ASCII or EBCDIC or written in graphite and carried on a physical RFC 1149 network -- it's a higher-level construct.
participants (3)
-
Antoine Pitrou
-
Nathaniel Smith
-
Stephen J. Turnbull