
On 5 February 2018 at 06:40, Serhiy Storchaka <storchaka@gmail.com> wrote:
05.02.18 05:01, Nick Coghlan пише:
On 2 February 2018 at 16:52, Steven D'Aprano <steve@pearwood.info> wrote:
If it were my decision, I'd have these codecs raise a warning (not an error) when used for encoding. But I guess some people will consider that either going too far or not far enough :-)
Rob pointed out that one of the main use cases for these codecs is when going "Oh, this was decoded with a WHATWG encoding, which isn't right, so I need to re-encode it with that encoding, and then decode it with the right encoding". So encoding is very much part of the usage model: it's needed when you've received the data over a Unicode based interface rather than a binary one.
Wasn't the "surrogateescape" error handler designed for this purpose?
WHATWG encodings solve the same problem that "surrogateescape", but
1) They use different range for representing unmapped characters. 2) Not all unmapped characters can be decoded, thus a decoding is lossy, and a round-trip not always works.
Surrogateescape is for when the source of the Unicode data is also Python. The WHATWG encodings (AIUI) can be used by any tool to attempt to decode data. If that "I think this is what it is" data is passed as Unicode to Python, and the Python code determines that the guess was wrong, then re-encoding it using the WHATWG encoding lets you try again to decode it properly. The result would be lossy, yes. Whether this is a problem, I can't say, as I've never encountered the sorts of use cases being discussed here. I assume that the people advocating for this have, and consider this option, even if it's lossy, to be the best approach. For a non-stdlib based solution, I see no problem with this. If the codecs are to go into the stdlib, then I do think we should be able to document clearly what the use case is for these encodings, and why a user reading the codecs docs should pick these encodings over another one. That's where I think the proposal currently falls down - not in the usefulness of the codecs, nor in the naming (both of which seem to me to have been covered) but in providing a good enough explanation *to non-specialists* of why these codecs exist, how they should be used, and what the caveats are. Something that we'd be comfortable including in the docs. Paul