[Python-ideas] Support WHATWG versions of legacy encodings

M.-A. Lemburg mal at egenix.com
Fri Jan 19 13:13:56 EST 2018

On 19.01.2018 18:12, Rob Speer wrote:
> Error handlers are quite orthogonal to this problem. If you try to solve
> this problem with an error handler, you will have a different problem.
> Suppose you made "c1-control-passthrough" or whatever into an error
> handler, similar to "replace" or "ignore", and then you encounter an
> unassigned character that's *not* in the range 0x80 to 0x9f. (Many
> encodings have these.) Do you replace it? Do you ignore it? You don't
> know because you just replaced the error handler with something that's
> not about error handling.

It depends on what you want to achieve. You may want to fail,
assign a code point from a private area or use a surrogate
escape approach. Based on the context it may also make sense
to escape the input data using a different syntax, e.g.
XML escapes, backslash notations, HTML numeric entities, etc.

You could also add a "latin1replace" error handler which
simply passes through everything that's undefined as-is.

The Unicode error handlers are pretty flexible when it comes
to providing a solution:


You can even have the handler work "patch" an encoding, since
it also gets the encoding name as input.

You could probably create an error handler which implements
most of their workarounds into a single "whatwg" handler.

> I will also repeat that having these encodings (in both directions) will
> provide more ways for Python to *reduce* the amount of mojibake that
> exists. If acknowledging that mojibake exists offends your sense of
> purity, and you'd rather just destroy all mojibake at the source...
> that's great, and please get back to me after you've fixed Microsoft Excel.

I acknowledge that we have different views on this :-)

Note that I'm not saying that the encodings are bad idea,
or should not be used.

I just don't want to have people start using "web-1252" as
encoding simply because they they are writing out text for
a web application - they should use "utf-8" instead.

The extra hurdle to pip-install a package for this feels
like the right way to turn this into a more conscious
decision and who knows... perhaps it'll even help fix Excel
once they have decided on including Python as scripting


> I hope to make a pull request shortly that implements these mappings as
> new encodings that work just like the other ones.
> On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg <mal at egenix.com
> <mailto:mal at egenix.com>> wrote:
>     On 19.01.2018 17:20, Guido van Rossum wrote:
>     > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg <mal at egenix.com
>     <mailto:mal at egenix.com>
>     > <mailto:mal at egenix.com <mailto:mal at egenix.com>>> wrote:
>     >
>     >     On 19.01.2018 05:38, Nathaniel Smith wrote:
>     >     > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum
>     <guido at python.org <mailto:guido at python.org> <mailto:guido at python.org
>     <mailto:guido at python.org>>> wrote:
>     >     >> Can someone explain to me why this is such a controversial
>     issue?
>     >     >
>     >     > I guess practicality versus purity is always controversial :-)
>     >     >
>     >     >> It seems reasonable to me to add new encodings to the
>     stdlib that do the
>     >     >> roundtripping requested in the first message of the thread.
>     As long as they
>     >     >> have new names that seems to fall under "practicality beats
>     purity".
>     >
>     >     There are a few issues here:
>     >
>     >     * WHATWG encodings are mostly for decoding content in order to
>     >       show it in the browser, accepting broken encoding data.
>     >
>     >
>     > And sometimes Python apps that pull data from the web.
>     >  
>     >
>     >       Python already has support for this by using one of the
>     available
>     >       error handlers, or adding new ones to suit the needs.
>     >
>     >
>     > This seems cumbersome though.
>     Why is that ?
>     Python 3 uses such error handlers for most of the I/O that's done
>     with the OS already and for very similar reasons: dealing with
>     broken data or broken configurations.
>     >       If we'd add the encodings, people will start creating more
>     >       broken data, since this is what the WHATWG codecs output
>     >       when encoding Unicode.
>     >
>     >
>     > That's FUD. Only apps that specifically use the new WHATWG encodings
>     > would be able to consume that data. And surely the practice of web
>     > browsers will have a much bigger effect than Python's choice.
>     It's not FUD. I don't think we ought to encourage having
>     Python create more broken data. The purpose of the WHATWG
>     encodings is to help browsers deal with decoding broken
>     data in a uniform way. It's not to generate more such data.
>     That may be regarded as purists view, but also has a very
>     practical meaning. The output of the codecs will only readable
>     by browsers implementing the WHATWG encodings. Other tools
>     receiving the data will run into the same decoding problems.
>     Once you have Unicode, it's better to stay there and use
>     UTF-8 for encoding to avoid any such issues.
>     >       As discussed, this could be addressed by making the WHATWG
>     >       codecs decode-only.
>     >
>     >
>     > But that would defeat the point of roundtripping, right?
>     Yes, intentionally. Once you have Unicode, the data should
>     be encoded correctly back into UTF-8 or whatever legacy encoding
>     is needed, fixing any issues while in Unicode.
>     As always, it's better to explicitly address such problems than
>     to simply punt on them and write back broken data.
>     >     * The use case seems limited to implementing browsers or headless
>     >       implementations working like browsers.
>     >
>     >       That's not really general enough to warrant adding lots of
>     >       new codecs to the stdlib. A PyPI package is better suited
>     >       for this.
>     >
>     >
>     > Perhaps, but such a package already exists and its author (who surely
>     > has read a lot of bug reports from its users) says that this is
>     cumbersome.
>     The only critique I read was that registering the codecs
>     is not explicit enough, but that's really only a nit, since
>     you can easily have the codec package expose a register
>     function which you then call explicitly in the code using
>     the codecs.
>     >     * The WHATWG codecs do not only cover simple mapping codecs,
>     >       but also many multi-byte ones for e.g. Asian languages.
>     >
>     >       I doubt that we'd want to maintain such codecs in the stdlib,
>     >       since this will increase the download sizes of the installers
>     >       and also require people knowledgeable about these variants
>     >       to work on them and fix any issues.
>     >
>     >
>     > Really? Why is adding a bunch of codecs so much effort? Surely the
>     > translation tables contain data that compresses well? And surely we
>     > don't need a separate dedicated piece of C code for each new codec?
>     For the simple charmap style codecs that's true. Not so for the
>     Asian ones and the latter also do require dedicated C code (see
>     Modules/cjkcodecs).
>     >     Overall, I think either pointing people to error handlers
>     >     or perhaps adding a new one specifically for the case of
>     >     dealing with control character mappings would provide a better
>     >     maintenance / usefulness ratio than adding lots of new
>     >     legacy codecs to the stdlib.
>     >
>     >
>     > Wouldn't error handlers be much slower? And to me it seems a new error
>     > handler is a much *bigger* deal than some new encodings -- error
>     > handlers must work for *all* encodings.
>     Error handlers have a standard interface and so they will work
>     for all codecs. Some codecs limits the number of handlers that
>     can be used, but most accept all registered handlers.
>     If a handler is too slow in Python, it can be coded in C for
>     speed.
>     >     BTW: WHATWG pushes for always using UTF-8 as far as I can tell
>     >     from their website.
>     >
>     >
>     > As does Python. But apparently it will take decades more to get there.
>     Yes indeed, so let's not add even more confusion by adding more
>     variants of the legacy encodings.
>     --
>     Marc-Andre Lemburg
>     eGenix.com
>     Professional Python Services directly from the Experts (#1, Jan 19 2018)
>     >>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>     >>> Python Database Interfaces ...           http://products.egenix.com/
>     >>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
>     ________________________________________________________________________
>     ::: We implement business ideas - efficiently in both time and costs :::
>        eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>         D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>                Registered at Amtsgericht Duesseldorf: HRB 46611
>                    http://www.egenix.com/company/contact/
>                           http://www.malemburg.com/
>     _______________________________________________
>     Python-ideas mailing list
>     Python-ideas at python.org <mailto:Python-ideas at python.org>
>     https://mail.python.org/mailman/listinfo/python-ideas
>     Code of Conduct: http://python.org/psf/codeofconduct/

Marc-Andre Lemburg

Professional Python Services directly from the Experts (#1, Jan 19 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

More information about the Python-ideas mailing list