[Python-ideas] Support WHATWG versions of legacy encodings

Wed Jan 31 12:48:29 EST 2018

31.01.18 18:36, Guido van Rossum пише:
> On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka 
> <storchaka at gmail.com 
> <mailto:storchaka at gmail.com>> wrote:
> 
>     19.01.18 05:51, Guido van Rossum пише:
> 
>         Can someone explain to me why this is such a controversial issue?
> 
>         It seems reasonable to me to add new encodings to the stdlib
>         that do the roundtripping requested in the first message of the
>         thread. As long as they have new names that seems to fall under
>         "practicality beats purity". (Modifying existing encodings seems
>         wrong -- did the feature request somehow transmogrify into that?)
> 
> 
>     In any case you need to change your code. If add new error handler
>     -- you need to change the decoding code to use this error handler:
> 
>          text = data.decode(encoding, 'whatwgreplace')
> 
>     If add new encodings -- you need to support an alias table that maps
>     standard encoding names to corresponding names of WHATWG encoding:
> 
>          aliases = {'windows_1252': 'windows-1252-whatwg',
>                     'windows_1251': 'windows-1251-whatwg',
>                     'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass
>                     ...
>                    }
>          ...
>          text = data.decode(aliases.get(normalize_encoding(encoding),
>     encoding))
> 
>     I don't see an advantage of the second approach for the end user.
>     And of course it is more costly for maintainers, because we will
>     need  to implement around 20 new encodings, and adds a cognitive
>     burden for new Python users, which now have more tables of encodings
>     in the documentation.
> 
> 
> Hm. As a user, unless I run into problems with a specific encoding, I 
> never care about how many encodings we have, so I don't see how adding 
> extra encodings bothers those users who have no need for them.

The codecs module documentation contains several tables of encodings: 
standard encodings, Python-specific text encodings, binary transforms 
and text transforms (a single one). This will add yet one large table. 
The user that learn Python will need to learn the difference of these 
encodings from others encodings and how to use them correctly. The new 
user doesn't know what is important for he, and what he can ignore until 
he will need it (and how to know that he needs it).

> There's a reason to prefer new encoding names (maybe augmented with 
> alias table) over a new error handler: there are lots of places where 
> encodings are passed around via text files, Internet protocols, RPC 
> calls, layers and layers of function calls. Many of these treat the 
> encoding as a string, not as a (string, errorhandler) pair. So there may 
> be situations where there is no way in a given API to preserve the need 
> for using a special error handler, while the API would not have a 
> problem preserving just the encoding name.

The passed encoding differs from the name of new Python encoding. It is 
just 'windows-1252', not 'windows-1252-whatwg'. If just change the 
existing encoding, this can break other code that expects the standard 
'windows-1252'. Thus every time when you need 'windows-1252-whatwg' 
instead of 'windows-1252' passed with the text, you need to map encoding 
names. How this differs from using a special error handler?

Yet one problem, is that actually we need two error handlers. WHATWG 
specifies two behaviors for unmapped codes outside of C0-C1 range: 
replacing with a special character or error. This corresponds standard 
Python handlers 'replace' and 'strict'. Thus we need either add two new 
error handlers 'whatwgreplace' and 'whatwgstrict', or add *two* sets of 
new encodings (more than 70 encodings totally!).