[Python-ideas] Support WHATWG versions of legacy encodings

Guido van Rossum guido at python.org
Fri Jan 19 12:24:27 EST 2018


OK, I will tune out this conversation. It is clearly not going anywhere.

On Fri, Jan 19, 2018 at 9:12 AM, Rob Speer <rspeer at luminoso.com> wrote:

> Error handlers are quite orthogonal to this problem. If you try to solve
> this problem with an error handler, you will have a different problem.
>
> Suppose you made "c1-control-passthrough" or whatever into an error
> handler, similar to "replace" or "ignore", and then you encounter an
> unassigned character that's *not* in the range 0x80 to 0x9f. (Many
> encodings have these.) Do you replace it? Do you ignore it? You don't know
> because you just replaced the error handler with something that's not about
> error handling.
>
> I will also repeat that having these encodings (in both directions) will
> provide more ways for Python to *reduce* the amount of mojibake that
> exists. If acknowledging that mojibake exists offends your sense of purity,
> and you'd rather just destroy all mojibake at the source... that's great,
> and please get back to me after you've fixed Microsoft Excel.
>
> I hope to make a pull request shortly that implements these mappings as
> new encodings that work just like the other ones.
>
> On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg <mal at egenix.com> wrote:
>
>> On 19.01.2018 17:20, Guido van Rossum wrote:
>> > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg <mal at egenix.com
>> > <mailto:mal at egenix.com>> wrote:
>> >
>> >     On 19.01.2018 05:38, Nathaniel Smith wrote:
>> >     > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum <
>> guido at python.org <mailto:guido at python.org>> wrote:
>> >     >> Can someone explain to me why this is such a controversial issue?
>> >     >
>> >     > I guess practicality versus purity is always controversial :-)
>> >     >
>> >     >> It seems reasonable to me to add new encodings to the stdlib
>> that do the
>> >     >> roundtripping requested in the first message of the thread. As
>> long as they
>> >     >> have new names that seems to fall under "practicality beats
>> purity".
>> >
>> >     There are a few issues here:
>> >
>> >     * WHATWG encodings are mostly for decoding content in order to
>> >       show it in the browser, accepting broken encoding data.
>> >
>> >
>> > And sometimes Python apps that pull data from the web.
>> >
>> >
>> >       Python already has support for this by using one of the available
>> >       error handlers, or adding new ones to suit the needs.
>> >
>> >
>> > This seems cumbersome though.
>>
>> Why is that ?
>>
>> Python 3 uses such error handlers for most of the I/O that's done
>> with the OS already and for very similar reasons: dealing with
>> broken data or broken configurations.
>>
>> >       If we'd add the encodings, people will start creating more
>> >       broken data, since this is what the WHATWG codecs output
>> >       when encoding Unicode.
>> >
>> >
>> > That's FUD. Only apps that specifically use the new WHATWG encodings
>> > would be able to consume that data. And surely the practice of web
>> > browsers will have a much bigger effect than Python's choice.
>>
>> It's not FUD. I don't think we ought to encourage having
>> Python create more broken data. The purpose of the WHATWG
>> encodings is to help browsers deal with decoding broken
>> data in a uniform way. It's not to generate more such data.
>>
>> That may be regarded as purists view, but also has a very
>> practical meaning. The output of the codecs will only readable
>> by browsers implementing the WHATWG encodings. Other tools
>> receiving the data will run into the same decoding problems.
>>
>> Once you have Unicode, it's better to stay there and use
>> UTF-8 for encoding to avoid any such issues.
>>
>> >       As discussed, this could be addressed by making the WHATWG
>> >       codecs decode-only.
>> >
>> >
>> > But that would defeat the point of roundtripping, right?
>>
>> Yes, intentionally. Once you have Unicode, the data should
>> be encoded correctly back into UTF-8 or whatever legacy encoding
>> is needed, fixing any issues while in Unicode.
>>
>> As always, it's better to explicitly address such problems than
>> to simply punt on them and write back broken data.
>>
>> >     * The use case seems limited to implementing browsers or headless
>> >       implementations working like browsers.
>> >
>> >       That's not really general enough to warrant adding lots of
>> >       new codecs to the stdlib. A PyPI package is better suited
>> >       for this.
>> >
>> >
>> > Perhaps, but such a package already exists and its author (who surely
>> > has read a lot of bug reports from its users) says that this is
>> cumbersome.
>>
>> The only critique I read was that registering the codecs
>> is not explicit enough, but that's really only a nit, since
>> you can easily have the codec package expose a register
>> function which you then call explicitly in the code using
>> the codecs.
>>
>> >     * The WHATWG codecs do not only cover simple mapping codecs,
>> >       but also many multi-byte ones for e.g. Asian languages.
>> >
>> >       I doubt that we'd want to maintain such codecs in the stdlib,
>> >       since this will increase the download sizes of the installers
>> >       and also require people knowledgeable about these variants
>> >       to work on them and fix any issues.
>> >
>> >
>> > Really? Why is adding a bunch of codecs so much effort? Surely the
>> > translation tables contain data that compresses well? And surely we
>> > don't need a separate dedicated piece of C code for each new codec?
>>
>> For the simple charmap style codecs that's true. Not so for the
>> Asian ones and the latter also do require dedicated C code (see
>> Modules/cjkcodecs).
>>
>> >     Overall, I think either pointing people to error handlers
>> >     or perhaps adding a new one specifically for the case of
>> >     dealing with control character mappings would provide a better
>> >     maintenance / usefulness ratio than adding lots of new
>> >     legacy codecs to the stdlib.
>> >
>> >
>> > Wouldn't error handlers be much slower? And to me it seems a new error
>> > handler is a much *bigger* deal than some new encodings -- error
>> > handlers must work for *all* encodings.
>>
>> Error handlers have a standard interface and so they will work
>> for all codecs. Some codecs limits the number of handlers that
>> can be used, but most accept all registered handlers.
>>
>> If a handler is too slow in Python, it can be coded in C for
>> speed.
>>
>> >     BTW: WHATWG pushes for always using UTF-8 as far as I can tell
>> >     from their website.
>> >
>> >
>> > As does Python. But apparently it will take decades more to get there.
>>
>> Yes indeed, so let's not add even more confusion by adding more
>> variants of the legacy encodings.
>>
>> --
>> Marc-Andre Lemburg
>> eGenix.com
>>
>> Professional Python Services directly from the Experts (#1, Jan 19 2018)
>> >>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>> >>> Python Database Interfaces ...           http://products.egenix.com/
>> >>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
>> ________________________________________________________________________
>>
>> ::: We implement business ideas - efficiently in both time and costs :::
>>
>>    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>> <https://maps.google.com/?q=Pastor-Loeh-Str.48+%0D+%C2%A0+%C2%A0+D-40764+Langenfeld,+Germany&entry=gmail&source=g>
>>     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>>            Registered at Amtsgericht Duesseldorf: HRB 46611
>>                http://www.egenix.com/company/contact/
>>                       http://www.malemburg.com/
>>
>> _______________________________________________
>> Python-ideas mailing list
>> Python-ideas at python.org
>> https://mail.python.org/mailman/listinfo/python-ideas
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>


-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180119/f432711d/attachment-0001.html>


More information about the Python-ideas mailing list