[Python-ideas] Support WHATWG versions of legacy encodings

Fri Jan 12 02:48:48 EST 2018

On 12 January 2018 at 14:55, Steve Dower <steve.dower at python.org> wrote:
> On 12Jan2018 0342, Random832 wrote:
>>
>> On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote:
>>>
>>> The way of solving this issue in Python is using an error handler. The
>>> "surrogateescape" error handler is specially designed for lossless
>>> reversible decoding. It maps every unassigned byte in the range
>>> 0x80-0xff to a single character in the range U+dc80-U+dcff. This allows
>>> you to distinguish correctly decoded characters from the escaped bytes,
>>> perform character by character processing of the decoded text, and
>>> encode the result back with the same encoding.
>>
>> Maybe we need a new error handler that maps unassigned bytes in the range
>> 0x80-0x9f to a single character in the range U+0080-U+009F. Do any of the
>> encodings being discussed have behavior other than the "normal" version of
>> the encoding plus what I just described?
>
>
> +1 on this being an error handler (if possible). I suspect the semantics
> will be more complex than suggested above, but as this seems to be able
> handling normally un[en/de]codable characters, using an error handler to
> return something more sensible best represents what is going on. Call it
> something like 'web' or 'relaxed' or 'whatwg'.
>
> I don't know if error handlers have enough context for this though. If not,
> we should ensure they can have it. I'd much rather explain one new error
> handler to most people (and a more complex API for implementing them to the
> few people who do it) than explain a whole suite of new encodings.

+1 from me, which shifts my position to be:

1. If we can make a decoding-only error handler that does the desired
thing in combination with our existing codecs, lets do that (perhaps
using a name like "controlpass", since the intent is to pass through
otherwise unassigned latin-1 control characters, similar to the way
"surrogatepass" allows lone surrogates)

2. Only if 1 fails for some reason would we look at adding the extra
decode-only codec variants.

Given the power of errors handlers, though, I expect the
surrogatepass-style error handler approach will work (see
https://docs.python.org/3/library/codecs.html#codecs.register_error
and https://docs.python.org/3/library/exceptions.html#UnicodeError for
an overview of the information they're given and what they can do
about it).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia