[Python-ideas] Support WHATWG versions of legacy encodings

Fri Jan 12 10:23:01 EST 2018

On Fri, Jan 12, 2018, at 03:10, Stephen J. Turnbull wrote:
>  > Other than that, all the differences are adding the fall-throughs in the
>  > range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte
>  > b'\xff' is undefined, and it remains undefined in WHATWG's mapping.
> 
> I really do not want those fall-throughs to control characters in the
> stdlib, since they have no textual interpretation in any standard
> encoding.  My interpretation is "you're under attack, shutter the
> windows and call the cops".  If people want to use codecs
> incorporating them, they should have to import them separately in the
> context of a defensive framework that deals with them at a higher
> level.

There are plenty of standard encodings that do have actual representations of the control characters. It's not clear why you consider it more dangerous for the "windows-1252" encoding to be able to return '\x81' for b'\x81' than for "latin-1" to do the same, or for "utf-8" to return it for b'\xc2\x81'. These characters exist. Supporting them in encodings that contain them in the real world, regardless what was submitted to the Unicode consortium, doesn't add any new attack surface.