Re: [Python-ideas] Support WHATWG versions of legacy encodings

Jan. 12, 2018

      On 12 January 2018 at 14:55, Steve Dower <steve.dower@python.org> wrote:
...
On 12Jan2018 0342, Random832 wrote:
...
On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote:
...
The way of solving this issue in Python is using an error handler. The
"surrogateescape" error handler is specially designed for lossless
reversible decoding. It maps every unassigned byte in the range
0x80-0xff to a single character in the range U+dc80-U+dcff. This allows
you to distinguish correctly decoded characters from the escaped bytes,
perform character by character processing of the decoded text, and
encode the result back with the same encoding.
Maybe we need a new error handler that maps unassigned bytes in the range
0x80-0x9f to a single character in the range U+0080-U+009F. Do any of the
encodings being discussed have behavior other than the "normal" version of
the encoding plus what I just described?
+1 on this being an error handler (if possible). I suspect the semantics
will be more complex than suggested above, but as this seems to be able
handling normally un[en/de]codable characters, using an error handler to
return something more sensible best represents what is going on. Call it
something like 'web' or 'relaxed' or 'whatwg'.
I don't know if error handlers have enough context for this though. If not,
we should ensure they can have it. I'd much rather explain one new error
handler to most people (and a more complex API for implementing them to the
few people who do it) than explain a whole suite of new encodings.
+1 from me, which shifts my position to be:

1. If we can make a decoding-only error handler that does the desired
thing in combination with our existing codecs, lets do that (perhaps
using a name like "controlpass", since the intent is to pass through
otherwise unassigned latin-1 control characters, similar to the way
"surrogatepass" allows lone surrogates)

2. Only if 1 fails for some reason would we look at adding the extra
decode-only codec variants.

Given the power of errors handlers, though, I expect the
surrogatepass-style error handler approach will work (see
https://docs.python.org/3/library/codecs.html#codecs.register_error
and https://docs.python.org/3/library/exceptions.html#UnicodeError for
an overview of the information they're given and what they can do
about it).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan@gmail.com   |   Brisbane, Australia