[Python-ideas] Support WHATWG versions of legacy encodings
fakedme+py at gmail.com
Fri Jan 12 05:01:03 EST 2018
On 2018-01-12 06:10 AM, Stephen J. Turnbull wrote:
> Rob Speer writes:
> > There is one more difference I have found between Python's encodings and
> > WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it
> > maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down
> > what the Unicode Consortium has to say about this.
> In the past Microsoft has changed windows-125x coded character sets in
> Windows without updating the IANA registry. It's not clear to me how
> to deal with these nonstandards. I suspect that Microsoft will follow
> WHAT-WG in this in the end.
> Given that in practice Windows encodings are nonstandards not even
> followed by their defining authority, it seems reasonable to me that
> Python could update to following WHAT-WG, as long as it's a superset
> of the current codec (in a 3.x release, not a 3.x.y release); at least
> the way the encoding standard is presented they're pretty good at
> this, and likely more reliable going forward than Microsoft itself is
> on the legacy encodings.
> > Other than that, all the differences are adding the fall-throughs in the
> > range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte
> > b'\xff' is undefined, and it remains undefined in WHATWG's mapping.
> I really do not want those fall-throughs to control characters in the
> stdlib, since they have no textual interpretation in any standard
> encoding. My interpretation is "you're under attack, shutter the
> windows and call the cops". If people want to use codecs
> incorporating them, they should have to import them separately in the
> context of a defensive framework that deals with them at a higher
This is surprising to me because I always took those encodings to have
It's pretty wild to think someone wouldn't want them.
> Probably there's no harm in a browser that does visual presentation,
> but in other contexts where there is text mixed with control codes we
> cannot predict what will happen since there is no standard
> interpretation in common (cross-platform) use AFAIK. And even in
> visual representation, out-of-channel codes can be problematic. I
> once crashed a Prime minicomputer by forwarding some ASCII art tuned
> for a VT-220 back to its author, who had stolen the very nice Prime
> console terminal and was using it for email. Hilarity ensued (for
> me, all my deadlines were weeks off). Programs are generally more
> robust today, but in most cases it would a lot safer to use
> xmlcharrefreplace or backslashreplace, or surrogateescape to ensure
> that paranoid Unicode processes would reject it. Especially since
> there are real hostiles out there.
> Python-ideas mailing list
> Python-ideas at python.org
> Code of Conduct: http://python.org/psf/codeofconduct/
More information about the Python-ideas