
On Fri, Jan 12, 2018, at 03:10, Stephen J. Turnbull wrote:
Other than that, all the differences are adding the fall-throughs in the range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte b'\xff' is undefined, and it remains undefined in WHATWG's mapping.
I really do not want those fall-throughs to control characters in the stdlib, since they have no textual interpretation in any standard encoding. My interpretation is "you're under attack, shutter the windows and call the cops". If people want to use codecs incorporating them, they should have to import them separately in the context of a defensive framework that deals with them at a higher level.
There are plenty of standard encodings that do have actual representations of the control characters. It's not clear why you consider it more dangerous for the "windows-1252" encoding to be able to return '\x81' for b'\x81' than for "latin-1" to do the same, or for "utf-8" to return it for b'\xc2\x81'. These characters exist. Supporting them in encodings that contain them in the real world, regardless what was submitted to the Unicode consortium, doesn't add any new attack surface.