[Python-ideas] Support WHATWG versions of legacy encodings

Fri Jan 12 03:10:29 EST 2018

Rob Speer writes:

 > There is one more difference I have found between Python's encodings and
 > WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it
 > maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down
 > what the Unicode Consortium has to say about this.

In the past Microsoft has changed windows-125x coded character sets in
Windows without updating the IANA registry.  It's not clear to me how
to deal with these nonstandards.  I suspect that Microsoft will follow
WHAT-WG in this in the end.

Given that in practice Windows encodings are nonstandards not even
followed by their defining authority, it seems reasonable to me that
Python could update to following WHAT-WG, as long as it's a superset
of the current codec (in a 3.x release, not a 3.x.y release); at least
the way the encoding standard is presented they're pretty good at
this, and likely more reliable going forward than Microsoft itself is
on the legacy encodings.

 > Other than that, all the differences are adding the fall-throughs in the
 > range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte
 > b'\xff' is undefined, and it remains undefined in WHATWG's mapping.

I really do not want those fall-throughs to control characters in the
stdlib, since they have no textual interpretation in any standard
encoding.  My interpretation is "you're under attack, shutter the
windows and call the cops".  If people want to use codecs
incorporating them, they should have to import them separately in the
context of a defensive framework that deals with them at a higher
level.

Probably there's no harm in a browser that does visual presentation,
but in other contexts where there is text mixed with control codes we
cannot predict what will happen since there is no standard
interpretation in common (cross-platform) use AFAIK.  And even in
visual representation, out-of-channel codes can be problematic.  I
once crashed a Prime minicomputer by forwarding some ASCII art tuned
for a VT-220 back to its author, who had stolen the very nice Prime
console terminal and was using it for email.  Hilarity ensued (for
me, all my deadlines were weeks off).  Programs are generally more
robust today, but in most cases it would a lot safer to use
xmlcharrefreplace or backslashreplace, or surrogateescape to ensure
that paranoid Unicode processes would reject it.  Especially since
there are real hostiles out there.