[Python-ideas] Support WHATWG versions of legacy encodings

Serhiy Storchaka storchaka at gmail.com
Thu Jan 11 04:55:38 EST 2018


09.01.18 23:15, Rob Speer пише:
> There is an encoding with no name of its own. It's supported by every 
> current web browser and standardized by WHATWG. It's so prevalent that 
> if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you 
> will get this encoding _instead_. It is probably the second or third 
> most common text encoding in the world. And Python doesn't quite support it.
> 
> You can see the character table for this encoding at:
> https://encoding.spec.whatwg.org/index-windows-1252.txt
> 
> For the sake of discussion, let's call this encoding "web-1252". WHATWG 
> calls it "windows-1252", but notice that it's subtly different from 
> Python's "windows-1252" encoding.. Python's windows-1252 has bytes that 
> are undefined:
> 
>  >>> b'\x90'.decode('windows-1252')
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 
> 0: character maps to <undefined>
> 
> In web-1252, the bytes that are undefined according to windows-1252 map 
> to the control characters in those positions in iso-8859-1 -- that is, 
> the Unicode codepoints with the same number as the byte. In web-1252, 
> b'\x90' would decode as '\u0090'.
> 
> This may seem like a silly encoding that encourages doing horrible 
> things with text. That's pretty much the case. But there's a reason 
> every Web browser implements it:
> 
> - It's compatible with windows-1252
> - Any sequence of bytes can be round-tripped through it without losing 
> information
> 
> It's not just this one encoding. WHATWG's encoding standard 
> (https://encoding.spec.whatwg.org/ <https://encoding..spec.whatwg.org/>) 
> contains modified versions of windows-1250 through windows-1258 and 
> windows-874.

The way of solving this issue in Python is using an error handler. The 
"surrogateescape" error handler is specially designed for lossless 
reversible decoding. It maps every unassigned byte in the range 
0x80-0xff to a single character in the range U+dc80-U+dcff. This allows 
you to distinguish correctly decoded characters from the escaped bytes, 
perform character by character processing of the decoded text, and 
encode the result back with the same encoding.

 >>> b'\x90\x91\x92\x93'.decode('windows-1252', 'surrogateescape')
'\udc90‘’“'
 >>> '\udc90‘’“'.encode('windows-1252', 'surrogateescape')
b'\x90\x91\x92\x93'

If you want to map unassigned bytes to other characters, you should just 
create a new error handler. There are caveats, since such characters are 
not distinguished from correctly decoded characters.

The same problem with the UTF-8 encoding. WHATWG allows encoding and 
decoding surrogate characters in the range U+d800-U+dcff. This is 
contrary to the Unicode Standard and raises an error by default in 
Python. But you can allow encoding and decoding of surrogate characters 
by explicitly specifying the "surrogatepass" error handler.



More information about the Python-ideas mailing list