[Python-ideas] Support WHATWG versions of legacy encodings

Steven D'Aprano steve at pearwood.info
Thu Jan 18 22:39:07 EST 2018

On Wed, Jan 10, 2018 at 07:13:39PM +0000, Rob Speer wrote:

> Having a pip installable library as the _only_ way to use these encodings
> is the status quo that I am very familiar with. It's awkward. To use a
> package that registers new codecs, you have to import something from that
> package, even if you never call anything from what you imported, and that
> makes flake8 complain. The idea that an encoding name may or may not be
> registered, based on what has been imported, breaks our intuition about
> reading Python code and is very hard to statically analyze.

Breaks whose intuition?

You don't speak for me on that matter -- while I don't like modules 
which operate by side-effect on import, I know that they are possible. 
In the stdlib, we have rlcompleter which operates like that. Whether 
such a design is good or bad (I think bad), nevertheless registering 
codecs by side-effect at import time should be an obvious possibility 
to any reasonably experienced developer.

But regardless, I don't think that "the existing codec library has a 
poor API, and flake8 complains about it" is a good reason for adding the 
codecs to the stdlib. We don't necessarily add functionality to the 
stdlib just because existing third-party solutions are awkward to use.

Having said that, I'm not actually against adding this, although I lean 
slightly towards "add". I think the case for adding is unclear, and 
needs a PEP to discuss the issues fully. I think we've come to a 
consensus on the following question:

- Should we change the behaviour of the existing codecs to match
  the WHATWG encodings? No.

but there are others that do not have a consensus:

- Are existing stdlib solutions satisfactory to meet the WHATWG

- If not, should the WHATWG encodings be added to the stdlib?

- If so, should they be built-in codecs, or should we import
  a library to register them?

- Or use the error handler mechanism?

- If codecs, should we offer both encode and decode support,
  or just decoding?

- What about the Unicode best-fit encodings?

Regarding that first undecided question, I'm particularly interested to 
see your response to Stephen Turnbull's statements here:


> I disagree with calling the WHATWG encodings that are implemented in every
> Web browser "non-standard". WHATWG may not have a typical origin story as a
> standards organization, but it _is_ the standards organization for the Web.

I wonder what the W3C would say about that last statement.

> I'm really not interested in best-fit mappings that turn infinity into "8"
> and square roots into "v". Making weird mappings like that sounds like a
> job for the "unidecode" library, not the stdlib.

Frankly, the idea that browsers should ignore the HTML's declared 
encoding in favour of some other hybrid encoding which never existed 
outside of broken web pages in order to be called "standards compliant" 
seems weird if not broken to me. Possibly even more weird than mapping ∞ 
to 8 and √ to v.

(I really wish the Unicode Consortium would do a better job of 
explaining the reasoning behind some of their more unintuitive or flat 
out strange-seeming decisions. But that's a rant for another day.)

I know that web browsers aren't quite the same as programming languages, 
and "Practicality beats purity", but still, "In the face of ambiguity, 
resist the temptation to guess". The WHATWG standard strikes me as "Do 
What You Guess I Mean".


More information about the Python-ideas mailing list