[Python-ideas] Support WHATWG versions of legacy encodings

Steve Dower steve.dower at python.org
Thu Jan 11 23:55:01 EST 2018


On 12Jan2018 0342, Random832 wrote:
> On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote:
>> The way of solving this issue in Python is using an error handler. The
>> "surrogateescape" error handler is specially designed for lossless
>> reversible decoding. It maps every unassigned byte in the range
>> 0x80-0xff to a single character in the range U+dc80-U+dcff. This allows
>> you to distinguish correctly decoded characters from the escaped bytes,
>> perform character by character processing of the decoded text, and
>> encode the result back with the same encoding.
>
> Maybe we need a new error handler that maps unassigned bytes in the range 0x80-0x9f to a single character in the range U+0080-U+009F. Do any of the encodings being discussed have behavior other than the "normal" version of the encoding plus what I just described?

+1 on this being an error handler (if possible). I suspect the semantics 
will be more complex than suggested above, but as this seems to be able 
handling normally un[en/de]codable characters, using an error handler to 
return something more sensible best represents what is going on. Call it 
something like 'web' or 'relaxed' or 'whatwg'.

I don't know if error handlers have enough context for this though. If 
not, we should ensure they can have it. I'd much rather explain one new 
error handler to most people (and a more complex API for implementing 
them to the few people who do it) than explain a whole suite of new 
encodings.

Cheers,
Steve



More information about the Python-ideas mailing list