[Python-ideas] Support WHATWG versions of legacy encodings

Random832 random832 at fastmail.com
Thu Jan 11 15:15:34 EST 2018

On Thu, Jan 11, 2018, at 14:55, Rob Speer wrote:
> There is one more difference I have found between Python's encodings and
> WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it
> maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down
> what the Unicode Consortium has to say about this.

It appears in the best fit mapping (with a comment suggesting it unclear what vowel point it is actually meant to be) but not the normal mapping.

> Other than that, all the differences are adding the fall-throughs in the
> range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte
> b'\xff' is undefined, and it remains undefined in WHATWG's mapping.

This is, for the record, also consistent with the results of my test program - 0xCA is treated as a perfectly ordinary mapping that goes to U+05BA, whereas 0xFF returns an error. In permissive mode it maps to U+F896.

0xCA U+05BA appears (with no glyph, though) in the code chart Microsoft published with https://www.microsoft.com/typography/unicode/cscp.htm, but not in the corresponding mapping list. It also does not appear in https://msdn.microsoft.com/en-us/library/cc195057.aspx.

More information about the Python-ideas mailing list