[Python-ideas] Support WHATWG versions of legacy encodings
Stephen J. Turnbull
turnbull.stephen.fw at u.tsukuba.ac.jp
Thu Jan 18 11:04:41 EST 2018
Nathaniel Smith writes:
> It's also nice to be able to parse some HTML data, make a few changes
> in memory, and then serialize it back to HTML. Having this crash on
> random documents is rather irritating, esp. if these documents are
> standards-compliant HTML as in this case.
This example doesn't make sense to me. Why would *conformant* HTML
crash the codec? Unless you're saying the source is non-conformant
and *lied* about the encoding? Then errors=surrogateescape should do
what you want here, no? If not, new codecs won't help you---the
"crash" is somewhere else.
Similarly, Soni's use case of control characters for formatting in an
IRC client. If they're C0, then AFAICT all of the ASCII-compatible
codecs do pass all of those through.[1] If they're C1, then you've got
big trouble because the multibyte encodings will either error due to a
malformed character or produce an unintended character (except for
UTF-8, where you can encode the character in UTF-8). The windows-*
encodings are quite inconsistent about the graphics they put in C1
space as well as where they leave holes, so this is not just
application-specific, it's even encoding-specific behavior.
The more examples of claimed use cases I see, the more I think most of
them are already addressed more safely by Python's existing
mechanisms, and the less I see a real need for this in the stdlib,
with the single exception that WHAT-WG may be a better authority to
follow than Microsoft for windows-* codecs.
Footnotes:
[1] I don't like that much, I'd rather restrict to the ones that have
universally accepted semantics including CR, LF, HT, ESC, BEL, and FF.
But passthrough is traditional there, a few more are in somewhat
common use, and I'm not crazy enough to break backward compatibility.
More information about the Python-ideas
mailing list