[Python-ideas] Support WHATWG versions of legacy encodings

Soni L. fakedme+py at gmail.com
Fri Jan 12 05:01:03 EST 2018



On 2018-01-12 06:10 AM, Stephen J. Turnbull wrote:
> Rob Speer writes:
>
>   > There is one more difference I have found between Python's encodings and
>   > WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it
>   > maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down
>   > what the Unicode Consortium has to say about this.
>
> In the past Microsoft has changed windows-125x coded character sets in
> Windows without updating the IANA registry.  It's not clear to me how
> to deal with these nonstandards.  I suspect that Microsoft will follow
> WHAT-WG in this in the end.
>
> Given that in practice Windows encodings are nonstandards not even
> followed by their defining authority, it seems reasonable to me that
> Python could update to following WHAT-WG, as long as it's a superset
> of the current codec (in a 3.x release, not a 3.x.y release); at least
> the way the encoding standard is presented they're pretty good at
> this, and likely more reliable going forward than Microsoft itself is
> on the legacy encodings.
>
>   > Other than that, all the differences are adding the fall-throughs in the
>   > range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte
>   > b'\xff' is undefined, and it remains undefined in WHATWG's mapping.
>
> I really do not want those fall-throughs to control characters in the
> stdlib, since they have no textual interpretation in any standard
> encoding.  My interpretation is "you're under attack, shutter the
> windows and call the cops".  If people want to use codecs
> incorporating them, they should have to import them separately in the
> context of a defensive framework that deals with them at a higher
> level.

This is surprising to me because I always took those encodings to have 
those fallbacks.

It's pretty wild to think someone wouldn't want them.

>
> Probably there's no harm in a browser that does visual presentation,
> but in other contexts where there is text mixed with control codes we
> cannot predict what will happen since there is no standard
> interpretation in common (cross-platform) use AFAIK.  And even in
> visual representation, out-of-channel codes can be problematic.  I
> once crashed a Prime minicomputer by forwarding some ASCII art tuned
> for a VT-220 back to its author, who had stolen the very nice Prime
> console terminal and was using it for email.  Hilarity ensued (for
> me, all my deadlines were weeks off).  Programs are generally more
> robust today, but in most cases it would a lot safer to use
> xmlcharrefreplace or backslashreplace, or surrogateescape to ensure
> that paranoid Unicode processes would reject it.  Especially since
> there are real hostiles out there.
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/



More information about the Python-ideas mailing list