[Python-ideas] Support WHATWG versions of legacy encodings

Guido van Rossum guido at python.org
Wed Jan 31 14:23:42 EST 2018

OK, I am no longer interested in this topic. If you can't reach agreement,
so be it, and then the status quo prevails. I am going to mute this thread.
There's no need to explain to me why I am wrong.

On Wed, Jan 31, 2018 at 9:48 AM, Serhiy Storchaka <storchaka at gmail.com>

> 31.01.18 18:36, Guido van Rossum пише:
> On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka <storchaka at gmail.com
>> <mailto:storchaka at gmail.com>> wrote:
>>     19.01.18 05:51, Guido van Rossum пише:
>>         Can someone explain to me why this is such a controversial issue?
>>         It seems reasonable to me to add new encodings to the stdlib
>>         that do the roundtripping requested in the first message of the
>>         thread. As long as they have new names that seems to fall under
>>         "practicality beats purity". (Modifying existing encodings seems
>>         wrong -- did the feature request somehow transmogrify into that?)
>>     In any case you need to change your code. If add new error handler
>>     -- you need to change the decoding code to use this error handler:
>>          text = data.decode(encoding, 'whatwgreplace')
>>     If add new encodings -- you need to support an alias table that maps
>>     standard encoding names to corresponding names of WHATWG encoding:
>>          aliases = {'windows_1252': 'windows-1252-whatwg',
>>                     'windows_1251': 'windows-1251-whatwg',
>>                     'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass
>>                     ...
>>                    }
>>          ...
>>          text = data.decode(aliases.get(normalize_encoding(encoding),
>>     encoding))
>>     I don't see an advantage of the second approach for the end user.
>>     And of course it is more costly for maintainers, because we will
>>     need  to implement around 20 new encodings, and adds a cognitive
>>     burden for new Python users, which now have more tables of encodings
>>     in the documentation.
>> Hm. As a user, unless I run into problems with a specific encoding, I
>> never care about how many encodings we have, so I don't see how adding
>> extra encodings bothers those users who have no need for them.
> The codecs module documentation contains several tables of encodings:
> standard encodings, Python-specific text encodings, binary transforms and
> text transforms (a single one). This will add yet one large table. The user
> that learn Python will need to learn the difference of these encodings from
> others encodings and how to use them correctly. The new user doesn't know
> what is important for he, and what he can ignore until he will need it (and
> how to know that he needs it).
> There's a reason to prefer new encoding names (maybe augmented with alias
>> table) over a new error handler: there are lots of places where encodings
>> are passed around via text files, Internet protocols, RPC calls, layers and
>> layers of function calls. Many of these treat the encoding as a string, not
>> as a (string, errorhandler) pair. So there may be situations where there is
>> no way in a given API to preserve the need for using a special error
>> handler, while the API would not have a problem preserving just the
>> encoding name.
> The passed encoding differs from the name of new Python encoding. It is
> just 'windows-1252', not 'windows-1252-whatwg'. If just change the existing
> encoding, this can break other code that expects the standard
> 'windows-1252'. Thus every time when you need 'windows-1252-whatwg' instead
> of 'windows-1252' passed with the text, you need to map encoding names. How
> this differs from using a special error handler?
> Yet one problem, is that actually we need two error handlers. WHATWG
> specifies two behaviors for unmapped codes outside of C0-C1 range:
> replacing with a special character or error. This corresponds standard
> Python handlers 'replace' and 'strict'. Thus we need either add two new
> error handlers 'whatwgreplace' and 'whatwgstrict', or add *two* sets of new
> encodings (more than 70 encodings totally!).
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/

--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180131/09b3e804/attachment-0001.html>

More information about the Python-ideas mailing list