Re: [Python-ideas] Support WHATWG versions of legacy encodings

Feb. 5, 2018

      On 5 February 2018 at 06:40, Serhiy Storchaka <storchaka@gmail.com> wrote:
...
05.02.18 05:01, Nick Coghlan пише:
...
On 2 February 2018 at 16:52, Steven D'Aprano <steve@pearwood.info> wrote:
...
If it were my decision, I'd have these codecs raise a warning (not an
error) when used for encoding. But I guess some people will consider
that either going too far or not far enough :-)
Rob pointed out that one of the main use cases for these codecs is
when going "Oh, this was decoded with a WHATWG encoding, which isn't
right, so I need to re-encode it with that encoding, and then decode
it with the right encoding". So encoding is very much part of the
usage model: it's needed when you've received the data over a Unicode
based interface rather than a binary one.
Wasn't the "surrogateescape" error handler designed for this purpose?
WHATWG encodings solve the same problem that "surrogateescape", but
1) They use different range for representing unmapped characters.
2) Not all unmapped characters can be decoded, thus a decoding is lossy, and
a round-trip not always works.
Surrogateescape is for when the source of the Unicode data is also
Python. The WHATWG encodings (AIUI) can be used by any tool to attempt
to decode data. If that "I think this is what it is" data is passed as
Unicode to Python, and the Python code determines that the guess was
wrong, then re-encoding it using the WHATWG encoding lets you try
again to decode it properly. The result would be lossy, yes. Whether
this is a problem, I can't say, as I've never encountered the sorts of
use cases being discussed here. I assume that the people advocating
for this have, and consider this option, even if it's lossy, to be the
best approach.

For a non-stdlib based solution, I see no problem with this. If the
codecs are to go into the stdlib, then I do think we should be able to
document clearly what the use case is for these encodings, and why a
user reading the codecs docs should pick these encodings over another
one. That's where I think the proposal currently falls down - not in
the usefulness of the codecs, nor in the naming (both of which seem to
me to have been covered) but in providing a good enough explanation
*to non-specialists* of why these codecs exist, how they should be
used, and what the caveats are. Something that we'd be comfortable
including in the docs.

Paul

Re: [Python-ideas] Support WHATWG versions of legacy encodings

Paul Moore