[Python-ideas] Support WHATWG versions of legacy encodings

Thu Feb 1 04:20:00 EST 2018

On 01.02.2018 00:40, Chris Angelico wrote:
> On Thu, Feb 1, 2018 at 10:15 AM, Chris Barker <chris.barker at noaa.gov> wrote:
>> I still have no ide4a why there is such resistance to this -- yes, it's a
>> fairly small benefit over a package no PyPi, but there is also virtually no
>> downside.
> 
> I don't understand it either. Aside from maybe bikeshedding the *name*
> of the encoding, this seems like a pretty straight-forward addition.

I guess many of you are not aware of how we have treated such encoding
additions in the past 1.5 decades.

In general, we have only added new encodings when there was an encoding
missing which a lot of people were actively using. We asked for
official documentation defining the mappings, references showing
usage and IANA or similar standard names to use for the encoding
itself and its aliases.

In recent years, we had only very few such requests, mainly because
the set we have in Python is already fairly complete.

Now the OP comes proposing to add a whole set of encodings which
only differ slightly from our existing ones. Backing is their
use and definition by WHATWG, a consortium of browser vendors
who are interested in showing web pages to users in a consistent
way.

WHATWG decided to simply override the standard names for
encodings with new mappings under their control. Again, their
motivation is clear: browsers get documents with advertised
encoding which don't always match the standard ones, so they have
to make some choices on how to display those documents. The easiest
way for them is to define all special cases in a set of new mappings
for each standard encoding name.

This is all fine, but it's also a very limited use case: that
of wanting to display web pages in a browser. It's certainly
needed for applications implementing browser interfaces and
probably also for ones which do web scraping, but otherwise,
the need should rarely arise.

What WHATWG uses as workarounds may also not necessarily be
what actual users would like to have. Such workarounds are
always trade-offs and they can change over time - which WHATWG
addresses by making the encodings "living standards". They
are a solution, but not a one fits all way of dealing with
broken data.

We also have the naming issue, since WHATWG chose to use
the same names as the standard mappings. Anything we'd
define will neither match WHATWG nor any other encoding
standard name, so we'd be creating a new set of encoding
names - which is really not what the world is after,
including WHATWG itself.

People would start creating encoded text using these new
encoding names, resulting in even more mojibake out there
instead of fixing the errors in the data and using Unicode
or UTF-8 for interchange.

As I mentioned before, we could disable encoding in the new
mappings to resolve this concern, but the OP wasn't interested
in such an approach. As alternative approach we proposed error
handlers, which are the normal technology to use when dealing
with encoding errors. Again, the OP wasn't interested.

Please also note that once we start adding, say
"whatwg-<original name>" encodings (or rather decodings :-),
going for the simple charmap encodings first, someone
will eventually also request addition of the more complex
Asian encodings which WHATWG defines. Maintaining these
is hard, since they require writing C code for performance
reasons and to keep the mapping tables small.

I probably forgot a few aspects, but the above is how I would
summarize the discussion from the perspective of the people
who have dealt with such discussions in the past.

There are quite a few downsides to consider and since the OP
is not interested in going for a compromise as described above,
I don't see a way forward.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Feb 01 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/