
On 05.02.2018 04:01, Nick Coghlan wrote:
On 2 February 2018 at 16:52, Steven D'Aprano <steve@pearwood.info> wrote:
If it were my decision, I'd have these codecs raise a warning (not an error) when used for encoding. But I guess some people will consider that either going too far or not far enough :-)
Rob pointed out that one of the main use cases for these codecs is when going "Oh, this was decoded with a WHATWG encoding, which isn't right, so I need to re-encode it with that encoding, and then decode it with the right encoding". So encoding is very much part of the usage model: it's needed when you've received the data over a Unicode based interface rather than a binary one.
So the use case for encoding into WHATWG is to undo the WHATWG mappings by then decoding using the standard mappings and using an error handler to deal with decoding issues ? This strikes me as a rather unrealistic use case, esp. since it's likely that the original decoding was also done in Python, so the much more intuitive approach to fix this problem would be to not use WHATWG encodings for the initial decoding in the first place.
So I think the *use case* for the WHATWG encodings has been pretty well established. What hasn't been established is whether our answer to "How do I handle the WHATWG encodings?" is going to be:
* "Here they are in the standard library (for 3.8+)!"; or * "These are available as part of the 'ftfy' library on PyPI, which also helps fixes various other problems in decoded text"
Personally, I think a See Also note pointing to ftfy in the "codecs" module documentation would be quite a reasonable outcome of the thread - when it comes to consuming arbitrary data from the internet and cleaning up decoding issues, ftfy's data introspection based approach is likely to be far easier to start with than characterising the common errors for specific data sources and applying them individually, and if you're already using ftfy to figure out which fixes are needed, then it shouldn't be a big deal to keep it around for the more relaxed codecs that it provides. I think we've been going around in circles long enough.
Let's leave things as they are and perhaps a section to the codecs documentation, as you suggest, where to find other encodings which a user might want to use and tools to help with fixing encoding or decoding errors. Here's a random list from PyPI with some packages: https://pypi.python.org/pypi/ebcdic/ https://pypi.python.org/pypi/latexcodec/ https://pypi.python.org/pypi/mysql-latin1-codec/ https://pypi.python.org/pypi/cbmcodecs/ Perhaps fun variants such as: https://pypi.python.org/pypi/emoji-encoding/ -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 05 2018)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/