[Python-ideas] Support WHATWG versions of legacy encodings

Mon Feb 5 05:07:03 EST 2018

On 5 February 2018 at 06:40, Serhiy Storchaka <storchaka at gmail.com> wrote:
> 05.02.18 05:01, Nick Coghlan пише:
>>
>> On 2 February 2018 at 16:52, Steven D'Aprano <steve at pearwood.info> wrote:
>>>
>>> If it were my decision, I'd have these codecs raise a warning (not an
>>> error) when used for encoding. But I guess some people will consider
>>> that either going too far or not far enough :-)
>>
>>
>> Rob pointed out that one of the main use cases for these codecs is
>> when going "Oh, this was decoded with a WHATWG encoding, which isn't
>> right, so I need to re-encode it with that encoding, and then decode
>> it with the right encoding". So encoding is very much part of the
>> usage model: it's needed when you've received the data over a Unicode
>> based interface rather than a binary one.
>
>
> Wasn't the "surrogateescape" error handler designed for this purpose?
>
> WHATWG encodings solve the same problem that "surrogateescape", but
>
> 1) They use different range for representing unmapped characters.
> 2) Not all unmapped characters can be decoded, thus a decoding is lossy, and
> a round-trip not always works.

Surrogateescape is for when the source of the Unicode data is also
Python. The WHATWG encodings (AIUI) can be used by any tool to attempt
to decode data. If that "I think this is what it is" data is passed as
Unicode to Python, and the Python code determines that the guess was
wrong, then re-encoding it using the WHATWG encoding lets you try
again to decode it properly. The result would be lossy, yes. Whether
this is a problem, I can't say, as I've never encountered the sorts of
use cases being discussed here. I assume that the people advocating
for this have, and consider this option, even if it's lossy, to be the
best approach.

For a non-stdlib based solution, I see no problem with this. If the
codecs are to go into the stdlib, then I do think we should be able to
document clearly what the use case is for these encodings, and why a
user reading the codecs docs should pick these encodings over another
one. That's where I think the proposal currently falls down - not in
the usefulness of the codecs, nor in the naming (both of which seem to
me to have been covered) but in providing a good enough explanation
*to non-specialists* of why these codecs exist, how they should be
used, and what the caveats are. Something that we'd be comfortable
including in the docs.

Paul