[Python-ideas] Support WHATWG versions of legacy encodings

Ivan Pozdeev vano at mail.mipt.ru
Tue Jan 9 16:51:55 EST 2018


First of all, many thanks for such a excellently writen letter. It was a 
real pleasure to read.

On 10.01.2018 0:15, Rob Speer wrote:
> Hi! I joined this list because I'm interested in filling a gap in 
> Python's standard library, relating to text encodings.
>
> There is an encoding with no name of its own. It's supported by every 
> current web browser and standardized by WHATWG. It's so prevalent that 
> if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you 
> will get this encoding _instead_. It is probably the second or third 
> most common text encoding in the world. And Python doesn't quite 
> support it.
>
> You can see the character table for this encoding at:
> https://encoding.spec.whatwg.org/index-windows-1252.txt
>
> For the sake of discussion, let's call this encoding "web-1252". 
> WHATWG calls it "windows-1252", but notice that it's subtly different 
> from Python's "windows-1252" encoding. Python's windows-1252 has bytes 
> that are undefined:
>
> >>> b'\x90'.decode('windows-1252')
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 
> 0: character maps to <undefined>
>
> In web-1252, the bytes that are undefined according to windows-1252 
> map to the control characters in those positions in iso-8859-1 -- that 
> is, the Unicode codepoints with the same number as the byte. In 
> web-1252, b'\x90' would decode as '\u0090'.
According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does 
the same:

     "According to the information on Microsoft's and the Unicode 
Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; 
however, the Windows API |MultiByteToWideChar 
<http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx>| 
maps these to the corresponding C1 control codes 
<https://en.wikipedia.org/wiki/C0_and_C1_control_codes>."

And in ISO-8859-1, the same handling is done for unused code points even 
by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) :

     "*ISO-8859-1* is the IANA 
<https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority> 
preferred name for this standard when supplemented with the C0 and C1 
control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes> 
from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>"

And what would you think -- these "C1 control codes" are also the 
corresponding Unicode points! ( 
https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) 
<https://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29> )

Since Windows is pretty much the reference implementation for 
"windows-xxxx" encodings, it even makes sense to alter the existing 
encodings rather than add new ones.

>
> This may seem like a silly encoding that encourages doing horrible 
> things with text. That's pretty much the case. But there's a reason 
> every Web browser implements it:
>
> - It's compatible with windows-1252
> - Any sequence of bytes can be round-tripped through it without losing 
> information
>
> It's not just this one encoding. WHATWG's encoding standard 
> (https://encoding.spec.whatwg.org/) contains modified versions of 
> windows-1250 through windows-1258 and windows-874.
>
> Support for these encodings matters to me, in part, because I maintain 
> a Unicode data-cleaning library, "ftfy". One thing it does is to 
> detect and undo encoding/decoding errors that cause mojibake, as long 
> as they're detectible and reversible. Looking at real-world examples 
> of text that has been damaged by mojibake, it's clear that lots of 
> text is transferred through what I'm calling the "web-1252" encoding, 
> in a way that's incompatible with Python's "windows-1252".
>
> In order to be able to work with and fix this kind of text, ftfy 
> registers new codecs -- and I implemented this even before I knew that 
> they were standardized in Web browsers. When ftfy is imported, you can 
> decode text as "sloppy-windows-1252" (the name I chose for this 
> encoding), for example.
>
> ftfy can tell people a sequence of steps that they can use in the 
> future to fix text that's like the text they provided. Very often, 
> these steps require the sloppy-windows-1252 or sloppy-windows-1251 
> encoding, which means the steps only work with ftfy imported, even for 
> people who are not using the features of ftfy.
>
> Support for these encodings also seems highly relevant to people who 
> use Python for web scraping, as it would be desirable to maximize 
> compatibility with what a Web browser would do.
>
> This really seems like it belongs in the standard library instead of 
> being an incidental feature of my library. I know that code in the 
> standard library has "one foot in the grave". I _want_ these legacy 
> encodings to have one foot in the grave. But some of them are 
> extremely common, and Python code should be able to deal with them.
>
> Adding these encodings to Python would be straightforward to 
> implement. Does this require a PEP, a pull request, or further discussion?
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/

-- 
Regards,
Ivan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180110/7358f9c4/attachment.html>


More information about the Python-ideas mailing list