This may seem like a silly encoding that encourages doing horrible things with text. That's pretty much the case. But there's a reason every Web browser implements it:

- It's compatible with windows-1252

- Any sequence of bytes can be round-tripped through it without losing information

It's not just this one encoding. WHATWG's encoding standard (https://encoding.spec.whatwg.org/) contains modified versions of windows-1250 through windows-1258 and windows-874.

Support for these encodings matters to me, in part, because I maintain a Unicode data-cleaning library, "ftfy". One thing it does is to detect and undo encoding/decoding errors that cause mojibake, as long as they're detectible and reversible. Looking at real-world examples of text that has been damaged by mojibake, it's clear that lots of text is transferred through what I'm calling the "web-1252" encoding, in a way that's incompatible with Python's "windows-1252".

In order to be able to work with and fix this kind of text, ftfy registers new codecs -- and I implemented this even before I knew that they were standardized in Web browsers. When ftfy is imported, you can decode text as "sloppy-windows-1252" (the name I chose for this encoding), for example.

ftfy can tell people a sequence of steps that they can use in the future to fix text that's like the text they provided. Very often, these steps require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which means the steps only work with ftfy imported, even for people who are not using the features of ftfy.

Support for these encodings also seems highly relevant to people who use Python for web scraping, as it would be desirable to maximize compatibility with what a Web browser would do.

This really seems like it belongs in the standard library instead of being an incidental feature of my library. I know that code in the standard library has "one foot in the grave". I _want_ these legacy encodings to have one foot in the grave. But some of them are extremely common, and Python code should be able to deal with them.

Adding these encodings to Python would be straightforward to implement. Does this require a PEP, a pull request, or further discussion?

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/