[Python-ideas] Support WHATWG versions of legacy encodings
Rob Speer
rspeer at luminoso.com
Tue Jan 9 16:15:23 EST 2018
Hi! I joined this list because I'm interested in filling a gap in Python's
standard library, relating to text encodings.
There is an encoding with no name of its own. It's supported by every
current web browser and standardized by WHATWG. It's so prevalent that if
you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will
get this encoding _instead_. It is probably the second or third most common
text encoding in the world. And Python doesn't quite support it.
You can see the character table for this encoding at:
https://encoding.spec.whatwg.org/index-windows-1252.txt
For the sake of discussion, let's call this encoding "web-1252". WHATWG
calls it "windows-1252", but notice that it's subtly different from
Python's "windows-1252" encoding. Python's windows-1252 has bytes that are
undefined:
>>> b'\x90'.decode('windows-1252')
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0:
character maps to <undefined>
In web-1252, the bytes that are undefined according to windows-1252 map to
the control characters in those positions in iso-8859-1 -- that is, the
Unicode codepoints with the same number as the byte. In web-1252, b'\x90'
would decode as '\u0090'.
This may seem like a silly encoding that encourages doing horrible things
with text. That's pretty much the case. But there's a reason every Web
browser implements it:
- It's compatible with windows-1252
- Any sequence of bytes can be round-tripped through it without losing
information
It's not just this one encoding. WHATWG's encoding standard (
https://encoding.spec.whatwg.org/) contains modified versions of
windows-1250 through windows-1258 and windows-874.
Support for these encodings matters to me, in part, because I maintain a
Unicode data-cleaning library, "ftfy". One thing it does is to detect and
undo encoding/decoding errors that cause mojibake, as long as they're
detectible and reversible. Looking at real-world examples of text that has
been damaged by mojibake, it's clear that lots of text is transferred
through what I'm calling the "web-1252" encoding, in a way that's
incompatible with Python's "windows-1252".
In order to be able to work with and fix this kind of text, ftfy registers
new codecs -- and I implemented this even before I knew that they were
standardized in Web browsers. When ftfy is imported, you can decode text as
"sloppy-windows-1252" (the name I chose for this encoding), for example.
ftfy can tell people a sequence of steps that they can use in the future to
fix text that's like the text they provided. Very often, these steps
require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which
means the steps only work with ftfy imported, even for people who are not
using the features of ftfy.
Support for these encodings also seems highly relevant to people who use
Python for web scraping, as it would be desirable to maximize compatibility
with what a Web browser would do.
This really seems like it belongs in the standard library instead of being
an incidental feature of my library. I know that code in the standard
library has "one foot in the grave". I _want_ these legacy encodings to
have one foot in the grave. But some of them are extremely common, and
Python code should be able to deal with them.
Adding these encodings to Python would be straightforward to implement.
Does this require a PEP, a pull request, or further discussion?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180109/e94e1d75/attachment.html>
More information about the Python-ideas
mailing list