[Python-ideas] Support WHATWG versions of legacy encodings

Random832 random832 at fastmail.com
Thu Jan 18 12:32:42 EST 2018

On Thu, Jan 18, 2018, at 11:04, Stephen J. Turnbull wrote:
> Nathaniel Smith writes:
>  > It's also nice to be able to parse some HTML data, make a few changes
>  > in memory, and then serialize it back to HTML. Having this crash on
>  > random documents is rather irritating, esp. if these documents are
>  > standards-compliant HTML as in this case.
> This example doesn't make sense to me.  Why would *conformant* HTML
> crash the codec?  Unless you're saying the source is non-conformant
> and *lied* about the encoding?

I think his point is that the WHATWG standard is the one that governs HTML and therefore HTML that uses these encodings (including the C1 characters) are conformant to *that* standard, regardless of their status with regards to anything published by Unicode, and that the new encodings (whatever they are called), including the round-trip for b'\x81' as \u0081, are the ones identified by a statement in an HTML document that it uses windows-1252, and therefore such a statement is not a lie.

More information about the Python-ideas mailing list