[Python-ideas] Support WHATWG versions of legacy encodings
steve at pearwood.info
Thu Jan 18 22:39:07 EST 2018
On Wed, Jan 10, 2018 at 07:13:39PM +0000, Rob Speer wrote:
> Having a pip installable library as the _only_ way to use these encodings
> is the status quo that I am very familiar with. It's awkward. To use a
> package that registers new codecs, you have to import something from that
> package, even if you never call anything from what you imported, and that
> makes flake8 complain. The idea that an encoding name may or may not be
> registered, based on what has been imported, breaks our intuition about
> reading Python code and is very hard to statically analyze.
Breaks whose intuition?
You don't speak for me on that matter -- while I don't like modules
which operate by side-effect on import, I know that they are possible.
In the stdlib, we have rlcompleter which operates like that. Whether
such a design is good or bad (I think bad), nevertheless registering
codecs by side-effect at import time should be an obvious possibility
to any reasonably experienced developer.
But regardless, I don't think that "the existing codec library has a
poor API, and flake8 complains about it" is a good reason for adding the
codecs to the stdlib. We don't necessarily add functionality to the
stdlib just because existing third-party solutions are awkward to use.
Having said that, I'm not actually against adding this, although I lean
slightly towards "add". I think the case for adding is unclear, and
needs a PEP to discuss the issues fully. I think we've come to a
consensus on the following question:
- Should we change the behaviour of the existing codecs to match
the WHATWG encodings? No.
but there are others that do not have a consensus:
- Are existing stdlib solutions satisfactory to meet the WHATWG
- If not, should the WHATWG encodings be added to the stdlib?
- If so, should they be built-in codecs, or should we import
a library to register them?
- Or use the error handler mechanism?
- If codecs, should we offer both encode and decode support,
or just decoding?
- What about the Unicode best-fit encodings?
Regarding that first undecided question, I'm particularly interested to
see your response to Stephen Turnbull's statements here:
> I disagree with calling the WHATWG encodings that are implemented in every
> Web browser "non-standard". WHATWG may not have a typical origin story as a
> standards organization, but it _is_ the standards organization for the Web.
I wonder what the W3C would say about that last statement.
> I'm really not interested in best-fit mappings that turn infinity into "8"
> and square roots into "v". Making weird mappings like that sounds like a
> job for the "unidecode" library, not the stdlib.
Frankly, the idea that browsers should ignore the HTML's declared
encoding in favour of some other hybrid encoding which never existed
outside of broken web pages in order to be called "standards compliant"
seems weird if not broken to me. Possibly even more weird than mapping ∞
to 8 and √ to v.
(I really wish the Unicode Consortium would do a better job of
explaining the reasoning behind some of their more unintuitive or flat
out strange-seeming decisions. But that's a rant for another day.)
I know that web browsers aren't quite the same as programming languages,
and "Practicality beats purity", but still, "In the face of ambiguity,
resist the temptation to guess". The WHATWG standard strikes me as "Do
What You Guess I Mean".
More information about the Python-ideas