[Python-ideas] Support WHATWG versions of legacy encodings
Ivan Pozdeev
vano at mail.mipt.ru
Tue Jan 9 16:51:55 EST 2018
First of all, many thanks for such a excellently writen letter. It was a
real pleasure to read.
On 10.01.2018 0:15, Rob Speer wrote:
> Hi! I joined this list because I'm interested in filling a gap in
> Python's standard library, relating to text encodings.
>
> There is an encoding with no name of its own. It's supported by every
> current web browser and standardized by WHATWG. It's so prevalent that
> if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you
> will get this encoding _instead_. It is probably the second or third
> most common text encoding in the world. And Python doesn't quite
> support it.
>
> You can see the character table for this encoding at:
> https://encoding.spec.whatwg.org/index-windows-1252.txt
>
> For the sake of discussion, let's call this encoding "web-1252".
> WHATWG calls it "windows-1252", but notice that it's subtly different
> from Python's "windows-1252" encoding. Python's windows-1252 has bytes
> that are undefined:
>
> >>> b'\x90'.decode('windows-1252')
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position
> 0: character maps to <undefined>
>
> In web-1252, the bytes that are undefined according to windows-1252
> map to the control characters in those positions in iso-8859-1 -- that
> is, the Unicode codepoints with the same number as the byte. In
> web-1252, b'\x90' would decode as '\u0090'.
According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does
the same:
"According to the information on Microsoft's and the Unicode
Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused;
however, the Windows API |MultiByteToWideChar
<http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx>|
maps these to the corresponding C1 control codes
<https://en.wikipedia.org/wiki/C0_and_C1_control_codes>."
And in ISO-8859-1, the same handling is done for unused code points even
by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) :
"*ISO-8859-1* is the IANA
<https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority>
preferred name for this standard when supplemented with the C0 and C1
control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>
from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>"
And what would you think -- these "C1 control codes" are also the
corresponding Unicode points! (
https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)
<https://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29> )
Since Windows is pretty much the reference implementation for
"windows-xxxx" encodings, it even makes sense to alter the existing
encodings rather than add new ones.
>
> This may seem like a silly encoding that encourages doing horrible
> things with text. That's pretty much the case. But there's a reason
> every Web browser implements it:
>
> - It's compatible with windows-1252
> - Any sequence of bytes can be round-tripped through it without losing
> information
>
> It's not just this one encoding. WHATWG's encoding standard
> (https://encoding.spec.whatwg.org/) contains modified versions of
> windows-1250 through windows-1258 and windows-874.
>
> Support for these encodings matters to me, in part, because I maintain
> a Unicode data-cleaning library, "ftfy". One thing it does is to
> detect and undo encoding/decoding errors that cause mojibake, as long
> as they're detectible and reversible. Looking at real-world examples
> of text that has been damaged by mojibake, it's clear that lots of
> text is transferred through what I'm calling the "web-1252" encoding,
> in a way that's incompatible with Python's "windows-1252".
>
> In order to be able to work with and fix this kind of text, ftfy
> registers new codecs -- and I implemented this even before I knew that
> they were standardized in Web browsers. When ftfy is imported, you can
> decode text as "sloppy-windows-1252" (the name I chose for this
> encoding), for example.
>
> ftfy can tell people a sequence of steps that they can use in the
> future to fix text that's like the text they provided. Very often,
> these steps require the sloppy-windows-1252 or sloppy-windows-1251
> encoding, which means the steps only work with ftfy imported, even for
> people who are not using the features of ftfy.
>
> Support for these encodings also seems highly relevant to people who
> use Python for web scraping, as it would be desirable to maximize
> compatibility with what a Web browser would do.
>
> This really seems like it belongs in the standard library instead of
> being an incidental feature of my library. I know that code in the
> standard library has "one foot in the grave". I _want_ these legacy
> encodings to have one foot in the grave. But some of them are
> extremely common, and Python code should be able to deal with them.
>
> Adding these encodings to Python would be straightforward to
> implement. Does this require a PEP, a pull request, or further discussion?
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
--
Regards,
Ivan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180110/7358f9c4/attachment.html>
More information about the Python-ideas
mailing list