
By now, it sounds right to me that I should implement these codecs in a package. I accept that I've established the use case, but not sufficiently established why it belongs in Python. The package can easily be ftfy -- although I should point out that what's in ftfy at the moment isn't quite right! "ftfy.bad_codecs" implements the "fall back on Latin-1" idea that many people here have intuitively suggested, because I was implementing it just based on the evidence of text I saw; I didn't know at the time that there was an actual standard involved. The result differs subtly from what Web browsers do in cases outside the C1 range. But of course I can work on re-implementing the encodings correctly based on what I've learned. I think it would be best if these encodings were actually implemented in the "webencodings" package, or in a package that both ftfy and webencodings could use. I have certainly encountered cases in web scraping where, because webencodings doesn't use the same Windows-1252 as the actual web does, I have had to decode the text even more incorrectly using Latin-1 and _then_ run it through ftfy -- in effect, adding a layer of mojibake so I can fix two layers of mojibake. That's kind of absurd and it's why I thought this belonged in Python itself. But I'll talk to the webencodings author instead. On Tue, 6 Feb 2018 at 05:12 Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Nick Coghlan writes:
Personally, I think a See Also note pointing to ftfy in the "codecs" module documentation would be quite a reasonable outcome of the thread
Yes please. The more I hear about purported use cases (with the exception of Nathaniel's "don't crash when I manipulate the DOM" case, which is best handled by errors='surrogateescape'), the less I see anything "standard" about them.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/