By now, it sounds right to me that I should implement these codecs in a package. I accept that I've established the use case, but not sufficiently established why it belongs in Python.

The package can easily be ftfy -- although I should point out that what's in ftfy at the moment isn't quite right! "ftfy.bad_codecs" implements the "fall back on Latin-1" idea that many people here have intuitively suggested, because I was implementing it just based on the evidence of text I saw; I didn't know at the time that there was an actual standard involved. The result differs subtly from what Web browsers do in cases outside the C1 range. But of course I can work on re-implementing the encodings correctly based on what I've learned.

I think it would be best if these encodings were actually implemented in the "webencodings" package, or in a package that both ftfy and webencodings could use. I have certainly encountered cases in web scraping where, because webencodings doesn't use the same Windows-1252 as the actual web does, I have had to decode the text even more incorrectly using Latin-1 and _then_ run it through ftfy -- in effect, adding a layer of mojibake so I can fix two layers of mojibake. That's kind of absurd and it's why I thought this belonged in Python itself. But I'll talk to the webencodings author instead.

On Tue, 6 Feb 2018 at 05:12 Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Nick Coghlan writes:

 > Personally, I think a See Also note pointing to ftfy in the "codecs"
 > module documentation would be quite a reasonable outcome of the thread

Yes please.  The more I hear about purported use cases (with the
exception of Nathaniel's "don't crash when I manipulate the DOM" case,
which is best handled by errors='surrogateescape'), the less I see
anything "standard" about them.

_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/