Re: [Python-ideas] Support WHATWG versions of legacy encodings

Jan. 19, 2018


      On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg <mal@egenix.com> wrote:
...
...
On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum <guido@python.org>
wrote:
...
Can someone explain to me why this is such a controversial issue?
I guess practicality versus purity is always controversial :-)
...
It seems reasonable to me to add new encodings to the stdlib that do the
roundtripping requested in the first message of the thread. As long as
On 19.01.2018 05:38, Nathaniel Smith wrote:
they
...
...
have new names that seems to fall under "practicality beats purity".
There are a few issues here:
* WHATWG encodings are mostly for decoding content in order to
  show it in the browser, accepting broken encoding data.
And sometimes Python apps that pull data from the web.
...
Python already has support for this by using one of the available
  error handlers, or adding new ones to suit the needs.
This seems cumbersome though.
...
If we'd add the encodings, people will start creating more
  broken data, since this is what the WHATWG codecs output
  when encoding Unicode.
That's FUD. Only apps that specifically use the new WHATWG encodings would
be able to consume that data. And surely the practice of web browsers will
have a much bigger effect than Python's choice.
...
As discussed, this could be addressed by making the WHATWG
  codecs decode-only.
But that would defeat the point of roundtripping, right?
...
* The use case seems limited to implementing browsers or headless
  implementations working like browsers.
That's not really general enough to warrant adding lots of
  new codecs to the stdlib. A PyPI package is better suited
  for this.
Perhaps, but such a package already exists and its author (who surely has
read a lot of bug reports from its users) says that this is cumbersome.
...
* The WHATWG codecs do not only cover simple mapping codecs,
  but also many multi-byte ones for e.g. Asian languages.
I doubt that we'd want to maintain such codecs in the stdlib,
  since this will increase the download sizes of the installers
  and also require people knowledgeable about these variants
  to work on them and fix any issues.
Really? Why is adding a bunch of codecs so much effort? Surely the
translation tables contain data that compresses well? And surely we don't
need a separate dedicated piece of C code for each new codec?
...
Overall, I think either pointing people to error handlers
or perhaps adding a new one specifically for the case of
dealing with control character mappings would provide a better
maintenance / usefulness ratio than adding lots of new
legacy codecs to the stdlib.
Wouldn't error handlers be much slower? And to me it seems a new error
handler is a much *bigger* deal than some new encodings -- error handlers
must work for *all* encodings.
...
BTW: WHATWG pushes for always using UTF-8 as far as I can tell
from their website.
As does Python. But apparently it will take decades more to get there.
...
...
...
(Modifying existing encodings seems wrong -- did the feature request
somehow
transmogrify into that?)
Someone did discover that Microsoft's current implementations of the
windows-* encodings matches the WHAT-WG spec, rather than the Unicode
spec that Microsoft originally wrote.
No, MS implements somethings called "best fit encodings"
and these are different than what WHATWG uses.
Unlike the WHATWG encodings, these are documented as vendor encodings
on the Unicode site, which is what we normally use as reference
for out stdlib codecs.
However, whether these are actually a good idea, is open to discussion
as well, since they sometimes go a bit far with "best fit", e.g.
mapping the infinity symbol to 8.
Again, using the error handles we have for dealing with
situations which require non-standard encoding behavior are
the better approach:
https://docs.python.org/3.7/library/codecs.html#error-handlers
Adding new ones is possible as well.
...
So there is some argument that
the Python's existing encodings are simply out of date, and changing
them would be a bugfix. (And standards aside, it is surely going to be
somewhat error-prone if Python's windows-1252 doesn't match everyone
else's implementations of windows-1252.) But yeah, AFAICT the original
requesters would be happy either way; they just want it available
under some name.
The encodings are not out of date. I don't know where you got
that impression from.
The Windows API WideCharToMultiByte  which was quoted in the discussion:
https://msdn.microsoft.com/en-us/library/windows/desktop/
dd374130%28v=vs.85%29.aspx
unfortunately uses the above mentioned best fit encodings,
but this can and should be switched off by specifying the
WC_NO_BEST_FIT_CHARS for anything that requires validation
or needs to be interoperable:
"""
For strings that require validation, such as file, resource, and user
names, the application should always use the WC_NO_BEST_FIT_CHARS flag.
This flag prevents the function from mapping characters to characters
that appear similar but have very different semantics. In some cases,
the semantic change can be extreme. For example, the symbol for "∞"
(infinity) maps to 8 (eight) in some code pages.
"""
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Experts (#1, Jan 19 2018)
...
...
...
Python Projects, Coaching and Consulting ...  http://www.egenix.com/
Python Database Interfaces ...           http://products.egenix.com/
Plone/Zope Database Interfaces ...           http://zope.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/
-- 
--Guido van Rossum (python.org/~guido)