Re: [Python-ideas] Support WHATWG versions of legacy encodings

Feb. 5, 2018

      On 05.02.2018 04:01, Nick Coghlan wrote:
...
On 2 February 2018 at 16:52, Steven D'Aprano <steve@pearwood.info> wrote:
...
If it were my decision, I'd have these codecs raise a warning (not an
error) when used for encoding. But I guess some people will consider
that either going too far or not far enough :-)
Rob pointed out that one of the main use cases for these codecs is
when going "Oh, this was decoded with a WHATWG encoding, which isn't
right, so I need to re-encode it with that encoding, and then decode
it with the right encoding". So encoding is very much part of the
usage model: it's needed when you've received the data over a Unicode
based interface rather than a binary one.
So the use case for encoding into WHATWG is to undo the WHATWG
mappings by then decoding using the standard mappings and using
an error handler to deal with decoding issues ?

This strikes me as a rather unrealistic use case, esp. since
it's likely that the original decoding was also done in Python,
so the much more intuitive approach to fix this problem would
be to not use WHATWG encodings for the initial decoding in the first
place.
...
So I think the *use case* for the WHATWG encodings has been pretty
well established. What hasn't been established is whether our answer
to "How do I handle the WHATWG encodings?" is going to be:
* "Here they are in the standard library (for 3.8+)!"; or
* "These are available as part of the 'ftfy' library on PyPI, which
also helps fixes various other problems in decoded text"
Personally, I think a See Also note pointing to ftfy in the "codecs"
module documentation would be quite a reasonable outcome of the thread
- when it comes to consuming arbitrary data from the internet and
cleaning up decoding issues, ftfy's data introspection based approach
is likely to be far easier to start with than characterising the
common errors for specific data sources and applying them
individually, and if you're already using ftfy to figure out which
fixes are needed, then it shouldn't be a big deal to keep it around
for the more relaxed codecs that it provides.
I think we've been going around in circles long enough.
Let's leave things as they are and perhaps a section to the codecs
documentation, as you suggest, where to find other encodings which
a user might want to use and tools to help with fixing encoding or
decoding errors.

Here's a random list from PyPI with some packages:
https://pypi.python.org/pypi/ebcdic/
https://pypi.python.org/pypi/latexcodec/
https://pypi.python.org/pypi/mysql-latin1-codec/
https://pypi.python.org/pypi/cbmcodecs/

Perhaps fun variants such as:
https://pypi.python.org/pypi/emoji-encoding/

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Feb 05 2018)
...
...
...
Python Projects, Coaching and Consulting ...  http://www.egenix.com/
Python Database Interfaces ...           http://products.egenix.com/
Plone/Zope Database Interfaces ...           http://zope.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/

Re: [Python-ideas] Support WHATWG versions of legacy encodings

M.-A. Lemburg