[Python-ideas] Support WHATWG versions of legacy encodings

M.-A. Lemburg mal at egenix.com
Wed Jan 31 12:41:04 EST 2018


On 31.01.2018 17:36, Guido van Rossum wrote:
> On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka <storchaka at gmail.com
> <mailto:storchaka at gmail.com>> wrote:
> 
>     19.01.18 05:51, Guido van Rossum пише:
> 
>         Can someone explain to me why this is such a controversial issue?
> 
>         It seems reasonable to me to add new encodings to the stdlib
>         that do the roundtripping requested in the first message of the
>         thread. As long as they have new names that seems to fall under
>         "practicality beats purity". (Modifying existing encodings seems
>         wrong -- did the feature request somehow transmogrify into that?)
> 
> 
>     In any case you need to change your code. If add new error handler
>     -- you need to change the decoding code to use this error handler:
> 
>         text = data.decode(encoding, 'whatwgreplace')
> 
>     If add new encodings -- you need to support an alias table that maps
>     standard encoding names to corresponding names of WHATWG encoding:
> 
>         aliases = {'windows_1252': 'windows-1252-whatwg',
>                    'windows_1251': 'windows-1251-whatwg',
>                    'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass
>                    ...
>                   }
>         ...
>         text = data.decode(aliases.get(normalize_encoding(encoding),
>     encoding))
> 
>     I don't see an advantage of the second approach for the end user.
>     And of course it is more costly for maintainers, because we will
>     need  to implement around 20 new encodings, and adds a cognitive
>     burden for new Python users, which now have more tables of encodings
>     in the documentation.
> 
> 
> Hm. As a user, unless I run into problems with a specific encoding, I
> never care about how many encodings we have, so I don't see how adding
> extra encodings bothers those users who have no need for them.
> 
> There's a reason to prefer new encoding names (maybe augmented with
> alias table) over a new error handler: there are lots of places where
> encodings are passed around via text files, Internet protocols, RPC
> calls, layers and layers of function calls. Many of these treat the
> encoding as a string, not as a (string, errorhandler) pair. So there may
> be situations where there is no way in a given API to preserve the need
> for using a special error handler, while the API would not have a
> problem preserving just the encoding name.

I already mentioned several reasons why I don't believe it's a good
idea to add these encodings to the stdlib as opposed to keeping
them on PyPI for those who need them, so won't repeat.

One detail I did not mention is that these encodings do not have
standard names.

WHATWG uses the same names as the original
encodings from which they derive - which makes sense for their
intended purpose to interpret data coming from web servers,
essentially in a decoding only way, but cannot be used for Python
since our encodings follow the Unicode standard and don't
generate mojibake when encoding.

Whatever name would be used in the stdlib would neither be
compatible to WHATWG nor to IANA. No other tool outside
Python would be able to interpret the encoded data using
those names.

Given all those issues, I don't see what the benefit would
be to add these encodings to the stdlib over leaving them on
PyPI for the special use case of reading broken web server
data.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 31 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/



More information about the Python-ideas mailing list