[Python-ideas] Support WHATWG versions of legacy encodings

Fri Feb 2 01:52:39 EST 2018

On Thu, Feb 01, 2018 at 10:20:00AM +0100, M.-A. Lemburg wrote:

> In general, we have only added new encodings when there was an encoding
> missing which a lot of people were actively using. We asked for
> official documentation defining the mappings, references showing
> usage and IANA or similar standard names to use for the encoding
> itself and its aliases.
[...]
> Now the OP comes proposing to add a whole set of encodings which
> only differ slightly from our existing ones. Backing is their
> use and definition by WHATWG, a consortium of browser vendors
> who are interested in showing web pages to users in a consistent
> way.

That gives us a defined mapping, references showing usage, but (alas) 
not standard names, due to the WHATWG's (foolish and arrogantly 
obnoxious, in my opinion) decision to re-use the standard names for the 
non-standard usages.

Two out of three seems like a reasonable start to me.

But one thing we haven't really discussed is, why is this an issue for 
Python? Everything I've seen so far suggests that these standards are 
only for browsers and/or web scrapers. That seems fairly niche to me. If 
you're writing a browser in Python, surely it isn't too much to ask that 
you import a set of codecs from a third party library?

If I've missed something, please say so.

> We also have the naming issue, since WHATWG chose to use
> the same names as the standard mappings. Anything we'd
> define will neither match WHATWG nor any other encoding
> standard name, so we'd be creating a new set of encoding
> names - which is really not what the world is after,
> including WHATWG itself.

I hear you, but I think this is a comparatively minor objection. I don't 
think it is a major problem for usability if we were to call these 
encodings "spam-whatwg" instead of "spam". It isn't difficult for 
browser authors to write:

encoding = get_document_encoding()
if config.USE_WHATWG_ENCODINGS:
    encoding += '-whatwg'

or otherwise look the encodings up in a mapping. We could even provide 
that mapping in the codecs module:

    encoding = codecs.whatwg_mapping.get(encoding, encoding)

So the naming issue shouldn't be more than a minor nuisance, and one we 
can entirely place in the lap of the WHATWG for misusing standard names.

Documentation-wise, I'd argue for placing these in a seperate 
sub-section of the codecs docs, with a strong notice that they should 
only be used for decoding web documents and not for creating new 
documents (except for testing purposes).

> People would start creating encoded text using these new
> encoding names, resulting in even more mojibake out there
> instead of fixing the errors in the data and using Unicode
> or UTF-8 for interchange.

We can't stop people from doing that: so long as the encodings exist as 
a third-party package, people who really insist on creating such 
abominable documents can do so. Just as they currently can accidentally 
create mojibake in their own documents by misunderstanding encodings, or 
as they can create new documents using legacy encodings like MacRoman 
instead of UTF-8 like they should.

(And very occasionally, they might even have a good reason for doing so 
-- while we can and should *discourage* such uses, we cannot and should 
not expect to prohibit them.)

If it were my decision, I'd have these codecs raise a warning (not an 
error) when used for encoding. But I guess some people will consider 
that either going too far or not far enough :-)

> As I mentioned before, we could disable encoding in the new
> mappings to resolve this concern, but the OP wasn't interested
> in such an approach. As alternative approach we proposed error
> handlers, which are the normal technology to use when dealing
> with encoding errors. Again, the OP wasn't interested.

Be fair: it isn't that the OP (Rob Speer) merely isn't interested, he 
does make some reasonable arguments that error handlers are the wrong 
solution. He's convinced me that an error handler isn't the right way to 
do this.

He *hasn't* convinced me that the stdlib needs to solve this problem, 
but if it does, I think some new encodings are the right way to do it.

> Please also note that once we start adding, say
> "whatwg-<original name>" encodings (or rather decodings :-),
> going for the simple charmap encodings first, someone
> will eventually also request addition of the more complex
> Asian encodings which WHATWG defines. Maintaining these
> is hard, since they require writing C code for performance
> reasons and to keep the mapping tables small.

YAGNI -- we can deal with that when and if it gets requested. This is 
not the camel's nose: adding a handful of 8-bit WHATWG encodings does 
not oblige us to add more.

[...]
> There are quite a few downsides to consider 

Indeed -- this isn't a "no-brainer". That's why I'm still hoping to see 
a fair and balanced PEP.

> and since the OP
> is not interested in going for a compromise as described above,
> I don't see a way forward.

Status quo wins a stalemate. Sometimes that's better than a broken 
solution that won't satisfy anyone.

-- 
Steve