<div dir="ltr">OK, I am no longer interested in this topic. If you can't reach agreement, so be it, and then the status quo prevails. I am going to mute this thread. There's no need to explain to me why I am wrong.<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jan 31, 2018 at 9:48 AM, Serhiy Storchaka <span dir="ltr"><<a href="mailto:storchaka@gmail.com" target="_blank">storchaka@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">31.01.18 18:36, Guido van Rossum Ð¿Ð¸ÑˆÐµ:<div><div class="h5"><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka <<a href="mailto:storchaka@gmail.com" target="_blank">storchaka@gmail.com</a> <mailto:<a href="mailto:storchaka@gmail.com" target="_blank">storchaka@gmail.com</a>>> wrote:<br>

<br>

Â  Â  19.01.18 05:51, Guido van Rossum Ð¿Ð¸ÑˆÐµ:<br>

<br>

Â  Â  Â  Â  Can someone explain to me why this is such a controversial issue?<br>

<br>

Â  Â  Â  Â  It seems reasonable to me to add new encodings to the stdlib<br>

Â  Â  Â  Â  that do the roundtripping requested in the first message of the<br>

Â  Â  Â  Â  thread. As long as they have new names that seems to fall under<br>

Â  Â  Â  Â  "practicality beats purity". (Modifying existing encodings seems<br>

Â  Â  Â  Â  wrong -- did the feature request somehow transmogrify into that?)<br>

<br>

<br>

Â  Â  In any case you need to change your code. If add new error handler<br>

Â  Â  -- you need to change the decoding code to use this error handler:<br>

<br>

Â  Â  Â Â  Â  text = data.decode(encoding, 'whatwgreplace')<br>

<br>

Â  Â  If add new encodings -- you need to support an alias table that maps<br>

Â  Â  standard encoding names to corresponding names of WHATWG encoding:<br>

<br>

Â  Â  Â Â  Â  aliases = {'windows_1252': 'windows-1252-whatwg',<br>

Â  Â  Â Â  Â  Â  Â  Â  Â  Â  Â 'windows_1251': 'windows-1251-whatwg',<br>

Â  Â  Â Â  Â  Â  Â  Â  Â  Â  Â 'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass<br>

Â  Â  Â Â  Â  Â  Â  Â  Â  Â  Â ...<br>

Â  Â  Â Â  Â  Â  Â  Â  Â  Â  }<br>

Â  Â  Â Â  Â  ...<br>

Â  Â  Â Â  Â  text = data.decode(aliases.get(normal<wbr>ize_encoding(encoding),<br>

Â  Â  encoding))<br>

<br>

Â  Â  I don't see an advantage of the second approach for the end user.<br>

Â  Â  And of course it is more costly for maintainers, because we will<br>

Â  Â  needÂ  to implement around 20 new encodings, and adds a cognitive<br>

Â  Â  burden for new Python users, which now have more tables of encodings<br>

Â  Â  in the documentation.<br>

<br>

<br>

Hm. As a user, unless I run into problems with a specific encoding, I never care about how many encodings we have, so I don't see how adding extra encodings bothers those users who have no need for them.<br>

</blockquote>

<br></div></div>

The codecs module documentation contains several tables of encodings: standard encodings, Python-specific text encodings, binary transforms and text transforms (a single one). This will add yet one large table. The user that learn Python will need to learn the difference of these encodings from others encodings and how to use them correctly. The new user doesn't know what is important for he, and what he can ignore until he will need it (and how to know that he needs it).<span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

There's a reason to prefer new encoding names (maybe augmented with alias table) over a new error handler: there are lots of places where encodings are passed around via text files, Internet protocols, RPC calls, layers and layers of function calls. Many of these treat the encoding as a string, not as a (string, errorhandler) pair. So there may be situations where there is no way in a given API to preserve the need for using a special error handler, while the API would not have a problem preserving just the encoding name.<br>

</blockquote>

<br></span>

The passed encoding differs from the name of new Python encoding. It is just 'windows-1252', not 'windows-1252-whatwg'. If just change the existing encoding, this can break other code that expects the standard 'windows-1252'. Thus every time when you need 'windows-1252-whatwg' instead of 'windows-1252' passed with the text, you need to map encoding names. How this differs from using a special error handler?<br>

<br>

Yet one problem, is that actually we need two error handlers. WHATWG specifies two behaviors for unmapped codes outside of C0-C1 range: replacing with a special character or error. This corresponds standard Python handlers 'replace' and 'strict'. Thus we need either add two new error handlers 'whatwgreplace' and 'whatwgstrict', or add *two* sets of new encodings (more than 70 encodings totally!).<div class="HOEnZb"><div class="h5"><br>

<br>

______________________________<wbr>_________________<br>

Python-ideas mailing list<br>

<a href="mailto:Python-ideas@python.org" target="_blank">Python-ideas@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/python-ideas" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/python-ideas</a><br>

Code of Conduct: <a href="http://python.org/psf/codeofconduct/" rel="noreferrer" target="_blank">http://python.org/psf/codeofco<wbr>nduct/</a><br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">--Guido van Rossum (<a href="http://python.org/~guido" target="_blank">python.org/~guido</a>)</div>

</div>