Re: [Python-ideas] Support WHATWG versions of legacy encodings

Jan. 12, 2018


      On Fri, Jan 12, 2018, at 03:10, Stephen J. Turnbull wrote:
...
...
Other than that, all the differences are adding the fall-throughs in the
range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte
b'\xff' is undefined, and it remains undefined in WHATWG's mapping.
I really do not want those fall-throughs to control characters in the
stdlib, since they have no textual interpretation in any standard
encoding.  My interpretation is "you're under attack, shutter the
windows and call the cops".  If people want to use codecs
incorporating them, they should have to import them separately in the
context of a defensive framework that deals with them at a higher
level.
There are plenty of standard encodings that do have actual representations of the control characters. It's not clear why you consider it more dangerous for the "windows-1252" encoding to be able to return '\x81' for b'\x81' than for "latin-1" to do the same, or for "utf-8" to return it for b'\xc2\x81'. These characters exist. Supporting them in encodings that contain them in the real world, regardless what was submitted to the Unicode consortium, doesn't add any new attack surface.

Re: [Python-ideas] Support WHATWG versions of legacy encodings

Random832