[Python-ideas] Support WHATWG versions of legacy encodings

Stephan Houben stephanh42 at gmail.com
Thu Jan 11 05:49:33 EST 2018


Op 11 jan. 2018 10:56 schreef "Serhiy Storchaka" <storchaka at gmail.com>:

09.01.18 23:15, Rob Speer пише:

>
>
> For the sake of discussion, let's call this encoding "web-1252". WHATWG
> calls it "windows-1252",


I'd suggest to name it then
"whatwg-windows-152".

and in general

"whatwg-" + whatgwgs_name_of_encoding

Stephan



but

notice that it's subtly different from Python's "windows-1252" encoding..
> Python's windows-1252 has bytes that are undefined:
>
>
>  >>> b'\x90'.decode('windows-1252')
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0:
> character maps to <undefined>
>
> In web-1252, the bytes that are undefined according to windows-1252 map to
> the control characters in those positions in iso-8859-1 -- that is, the
> Unicode codepoints with the same number as the byte. In web-1252, b'\x90'
> would decode as '\u0090'.
>
> This may seem like a silly encoding that encourages doing horrible things
> with text. That's pretty much the case. But there's a reason every Web
> browser implements it:
>
> - It's compatible with windows-1252
> - Any sequence of bytes can be round-tripped through it without losing
> information
>
> It's not just this one encoding. WHATWG's encoding standard (
> https://encoding.spec.whatwg.org/ <https://encoding..spec.whatwg.org/>)
> contains modified versions of windows-1250 through windows-1258 and
> windows-874.
>

The way of solving this issue in Python is using an error handler. The
"surrogateescape" error handler is specially designed for lossless
reversible decoding. It maps every unassigned byte in the range 0x80-0xff
to a single character in the range U+dc80-U+dcff. This allows you to
distinguish correctly decoded characters from the escaped bytes, perform
character by character processing of the decoded text, and encode the
result back with the same encoding.

>>> b'\x90\x91\x92\x93'.decode('windows-1252', 'surrogateescape')
'\udc90‘’“'
>>> '\udc90‘’“'.encode('windows-1252', 'surrogateescape')
b'\x90\x91\x92\x93'

If you want to map unassigned bytes to other characters, you should just
create a new error handler. There are caveats, since such characters are
not distinguished from correctly decoded characters.

The same problem with the UTF-8 encoding. WHATWG allows encoding and
decoding surrogate characters in the range U+d800-U+dcff. This is contrary
to the Unicode Standard and raises an error by default in Python. But you
can allow encoding and decoding of surrogate characters by explicitly
specifying the "surrogatepass" error handler.


_______________________________________________
Python-ideas mailing list
Python-ideas at python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180111/310c2cdc/attachment.html>


More information about the Python-ideas mailing list