[Python-ideas] Support WHATWG versions of legacy encodings

Stephen J. Turnbull turnbull.stephen.fw at u.tsukuba.ac.jp
Thu Jan 18 11:04:41 EST 2018

Nathaniel Smith writes:

 > It's also nice to be able to parse some HTML data, make a few changes
 > in memory, and then serialize it back to HTML. Having this crash on
 > random documents is rather irritating, esp. if these documents are
 > standards-compliant HTML as in this case.

This example doesn't make sense to me.  Why would *conformant* HTML
crash the codec?  Unless you're saying the source is non-conformant
and *lied* about the encoding?  Then errors=surrogateescape should do
what you want here, no?  If not, new codecs won't help you---the
"crash" is somewhere else.

Similarly, Soni's use case of control characters for formatting in an
IRC client.  If they're C0, then AFAICT all of the ASCII-compatible
codecs do pass all of those through.[1]  If they're C1, then you've got
big trouble because the multibyte encodings will either error due to a
malformed character or produce an unintended character (except for
UTF-8, where you can encode the character in UTF-8).  The windows-*
encodings are quite inconsistent about the graphics they put in C1
space as well as where they leave holes, so this is not just
application-specific, it's even encoding-specific behavior.

The more examples of claimed use cases I see, the more I think most of
them are already addressed more safely by Python's existing
mechanisms, and the less I see a real need for this in the stdlib,
with the single exception that WHAT-WG may be a better authority to
follow than Microsoft for windows-* codecs.

[1]  I don't like that much, I'd rather restrict to the ones that have
universally accepted semantics including CR, LF, HT, ESC, BEL, and FF.
But passthrough is traditional there, a few more are in somewhat
common use, and I'm not crazy enough to break backward compatibility.

More information about the Python-ideas mailing list