[Python-ideas] Support WHATWG versions of legacy encodings

Thu Jan 11 08:18:43 EST 2018

On Jan 11, 2018 4:05 AM, "Antoine Pitrou" <solipsis at pitrou.net> wrote:

Define "widely used".  If web-XXX is a superset of windows-XXX, then
perhaps web-XXX is "used" in the sense of "used to decode valid
windows-XXX data" (but windows-XXX could be used just as well to
decode the same data).  The question is rather: how often does web-XXX
mojibake happen?  We're well in the 2010s now and you'd hope that
mojibake doesn't happen as often as it used to in, e.g., 1998.

I'm not an expert here or anything, but from what we've been hearing it
sounds like it must be used by all standard-compliant HTML parsers. I don't
*like* the standard much, but I don't think that the stdlib should refuse
to handle standard-compliant HTML, or help users handle standard-compliant
HTML correctly, just because the HTML standard has unfortunate things in
it. We're not going to convince them to change the standard or anything.
And this whole thread started with someone said that their mojibake fixing
library is having trouble because of this, so clearly mojibake does still
exist.

Does it help if we reframe it as not that whatwg is "wrong" about
windows-1252, but rather that there is this encoding web-1252, and thanks
to an interesting quirk of history, in HTML documents the byte sequence
b'<meta charset="windows-1252">' indicates a file using this encoding? In
fact the mapping between byte sequences and character sets here is so
arbitrary that in standards-compliant HTML, the byte sequences b'<meta
charset="ascii">', b'<meta charset="iso-8859-1">', and b'<meta
charset="latin1">' *also* indicate that the file is encoded using web-1252.
(See: https://encoding.spec.whatwg.org/#names-and-labels)

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180111/c8b19120/attachment-0001.html>