[Python-ideas] Support WHATWG versions of legacy encodings
Guido van Rossum
guido at python.org
Wed Jan 31 14:23:42 EST 2018
OK, I am no longer interested in this topic. If you can't reach agreement,
so be it, and then the status quo prevails. I am going to mute this thread.
There's no need to explain to me why I am wrong.
On Wed, Jan 31, 2018 at 9:48 AM, Serhiy Storchaka <storchaka at gmail.com>
wrote:
> 31.01.18 18:36, Guido van Rossum пише:
>
> On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka <storchaka at gmail.com
>> <mailto:storchaka at gmail.com>> wrote:
>>
>> 19.01.18 05:51, Guido van Rossum пише:
>>
>> Can someone explain to me why this is such a controversial issue?
>>
>> It seems reasonable to me to add new encodings to the stdlib
>> that do the roundtripping requested in the first message of the
>> thread. As long as they have new names that seems to fall under
>> "practicality beats purity". (Modifying existing encodings seems
>> wrong -- did the feature request somehow transmogrify into that?)
>>
>>
>> In any case you need to change your code. If add new error handler
>> -- you need to change the decoding code to use this error handler:
>>
>> text = data.decode(encoding, 'whatwgreplace')
>>
>> If add new encodings -- you need to support an alias table that maps
>> standard encoding names to corresponding names of WHATWG encoding:
>>
>> aliases = {'windows_1252': 'windows-1252-whatwg',
>> 'windows_1251': 'windows-1251-whatwg',
>> 'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass
>> ...
>> }
>> ...
>> text = data.decode(aliases.get(normalize_encoding(encoding),
>> encoding))
>>
>> I don't see an advantage of the second approach for the end user.
>> And of course it is more costly for maintainers, because we will
>> need to implement around 20 new encodings, and adds a cognitive
>> burden for new Python users, which now have more tables of encodings
>> in the documentation.
>>
>>
>> Hm. As a user, unless I run into problems with a specific encoding, I
>> never care about how many encodings we have, so I don't see how adding
>> extra encodings bothers those users who have no need for them.
>>
>
> The codecs module documentation contains several tables of encodings:
> standard encodings, Python-specific text encodings, binary transforms and
> text transforms (a single one). This will add yet one large table. The user
> that learn Python will need to learn the difference of these encodings from
> others encodings and how to use them correctly. The new user doesn't know
> what is important for he, and what he can ignore until he will need it (and
> how to know that he needs it).
>
> There's a reason to prefer new encoding names (maybe augmented with alias
>> table) over a new error handler: there are lots of places where encodings
>> are passed around via text files, Internet protocols, RPC calls, layers and
>> layers of function calls. Many of these treat the encoding as a string, not
>> as a (string, errorhandler) pair. So there may be situations where there is
>> no way in a given API to preserve the need for using a special error
>> handler, while the API would not have a problem preserving just the
>> encoding name.
>>
>
> The passed encoding differs from the name of new Python encoding. It is
> just 'windows-1252', not 'windows-1252-whatwg'. If just change the existing
> encoding, this can break other code that expects the standard
> 'windows-1252'. Thus every time when you need 'windows-1252-whatwg' instead
> of 'windows-1252' passed with the text, you need to map encoding names. How
> this differs from using a special error handler?
>
> Yet one problem, is that actually we need two error handlers. WHATWG
> specifies two behaviors for unmapped codes outside of C0-C1 range:
> replacing with a special character or error. This corresponds standard
> Python handlers 'replace' and 'strict'. Thus we need either add two new
> error handlers 'whatwgreplace' and 'whatwgstrict', or add *two* sets of new
> encodings (more than 70 encodings totally!).
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
--
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180131/09b3e804/attachment-0001.html>
More information about the Python-ideas
mailing list