
On 2018-01-12 06:10 AM, Stephen J. Turnbull wrote:
Rob Speer writes:
There is one more difference I have found between Python's encodings and WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down what the Unicode Consortium has to say about this.
In the past Microsoft has changed windows-125x coded character sets in Windows without updating the IANA registry. It's not clear to me how to deal with these nonstandards. I suspect that Microsoft will follow WHAT-WG in this in the end.
Given that in practice Windows encodings are nonstandards not even followed by their defining authority, it seems reasonable to me that Python could update to following WHAT-WG, as long as it's a superset of the current codec (in a 3.x release, not a 3.x.y release); at least the way the encoding standard is presented they're pretty good at this, and likely more reliable going forward than Microsoft itself is on the legacy encodings.
Other than that, all the differences are adding the fall-throughs in the range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte b'\xff' is undefined, and it remains undefined in WHATWG's mapping.
I really do not want those fall-throughs to control characters in the stdlib, since they have no textual interpretation in any standard encoding. My interpretation is "you're under attack, shutter the windows and call the cops". If people want to use codecs incorporating them, they should have to import them separately in the context of a defensive framework that deals with them at a higher level.
This is surprising to me because I always took those encodings to have those fallbacks. It's pretty wild to think someone wouldn't want them.
Probably there's no harm in a browser that does visual presentation, but in other contexts where there is text mixed with control codes we cannot predict what will happen since there is no standard interpretation in common (cross-platform) use AFAIK. And even in visual representation, out-of-channel codes can be problematic. I once crashed a Prime minicomputer by forwarding some ASCII art tuned for a VT-220 back to its author, who had stolen the very nice Prime console terminal and was using it for email. Hilarity ensued (for me, all my deadlines were weeks off). Programs are generally more robust today, but in most cases it would a lot safer to use xmlcharrefreplace or backslashreplace, or surrogateescape to ensure that paranoid Unicode processes would reject it. Especially since there are real hostiles out there.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/