[Python-ideas] Support WHATWG versions of legacy encodings

Sun Feb 4 22:01:15 EST 2018

On 2 February 2018 at 16:52, Steven D'Aprano <steve at pearwood.info> wrote:
> If it were my decision, I'd have these codecs raise a warning (not an
> error) when used for encoding. But I guess some people will consider
> that either going too far or not far enough :-)

Rob pointed out that one of the main use cases for these codecs is
when going "Oh, this was decoded with a WHATWG encoding, which isn't
right, so I need to re-encode it with that encoding, and then decode
it with the right encoding". So encoding is very much part of the
usage model: it's needed when you've received the data over a Unicode
based interface rather than a binary one.

So I think the *use case* for the WHATWG encodings has been pretty
well established. What hasn't been established is whether our answer
to "How do I handle the WHATWG encodings?" is going to be:

* "Here they are in the standard library (for 3.8+)!"; or
* "These are available as part of the 'ftfy' library on PyPI, which
also helps fixes various other problems in decoded text"

Personally, I think a See Also note pointing to ftfy in the "codecs"
module documentation would be quite a reasonable outcome of the thread
- when it comes to consuming arbitrary data from the internet and
cleaning up decoding issues, ftfy's data introspection based approach
is likely to be far easier to start with than characterising the
common errors for specific data sources and applying them
individually, and if you're already using ftfy to figure out which
fixes are needed, then it shouldn't be a big deal to keep it around
for the more relaxed codecs that it provides.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia