[Python-ideas] Support WHATWG versions of legacy encodings

Mon Jan 29 00:13:50 EST 2018

Sorry for the long delay.  I had a lot on my plate at work, and was
spending 14 hours a day sleeping because of the flu.  "It got better."

Rob Speer writes:

 > I don't really understand what you're doing when you take a
 > fragment of my sentence where I explain a wrong understanding of
 > WHATWG encodings, and say "that's wrong, as you explain". I know
 > it's wrong. That's what I was saying.

Sure, but you're not my entire audience: the part I care most about is
the committers.  I've seen proposals to "fill in" seriously made in
other contexts, I wanted to agree that's wrong for Python.

 > In this pseudocode that implements a "whatwg_error_mode", can you describe
 > what the Python code to call it would look like?

There isn't any Python code that calls it.  It's an error handler,
like 'strict' or 'surrogateescape', and all the functions that call it
are in C.

 > Does every call to .encode and .decode now have a
 > "whatwg_error_mode" parameter, in addition to the "errors"
 > parameter? Or are there twice as many possible strings you could
 > pass as the "errors" parameter, so you can have "replace",
 > "replace-whatwg", "surrogateescape", "surrogateescape-whatwg", etc?

It would be the latter.

I haven't thought about it carefully, but what I would likely do is
define a factory function taking an encoding name (str), an error
handler name, and a bytes-str mapping for the exceptional cases like
windows-1255 where WHAT-WG enhances the graphic repertoire, and
returns a name like "whatwg-windows-1255-fatal".  Internally it
would

1.  Check if the error handler name is 'fatal' or 'strict', or 'html'
    or 'xmlcharrefreplace' ('strict' and 'xmlcharrefreplace' would be
    used internally to the factory function, the registered name would
    be 'fatal' or 'html').  'replace' has the same semantics in Python
    and in WHAT-WG, and other error handlers 'backslashreplace',
    'ignore', and 'surrogateescape' would be up to the programmer to
    use or avoid.  They'd go by their Python names.  Alternatively we
    could follow the strict WHAT-WG standard and not allow those, or
    provide another argument to allow "lax" checking of the handler
    argument.
2.  Check if the name is already registered.  If so, return it.
3.  Otherwise, def a function that takes an Unicode error and a
    mapping that defaults to the one passed to the factory, and
    a.  passes C0 and C1 control characters through, else
    b.  returns the mapped value if present, else
    c.  passes the Unicode error to the named error handler and
        returns what that returns
4.  Register the new handler with that name, and return the name.

You would use it like

handler = factory('windows-1255', 'html', [(b'0x00', '\Udeadbeef')])
b'deadbeef'.decode('windows-1255', errors=handler)

The mapping would default to [], and the remaining question would be
what the default for the error handler should be.  I guess that would
'strict' (the problem is that the WHAT-WG defaults differ for decoding
and encoding).  (The choice of a list of tuples for the mapping is due
to JIS, where the map is not 1-1, and a specific reverse mapping is
defined.)

 > My objection here isn't efficiency, it's adding confusing extra
 > options to .encode() and .decode() that aren't relevant in most
 > cases.

There wouldn't be extra *arguments*, but there would be additional
handler names to use as values.  We'd want three standard handlers for
everything but windows-1255 and JIS (AFAIK).  One would be mainly for
validating XML, and the name would be 'whatwg-any-fatal'.  (Note that
the name of the encoding is actually only used in the name of the
handler, and that only to identify auxiliary mappings, such as that
for windows-1255.)  The others would be for everyday HTML (and maybe
for XHTML form input?).  They would be named 'whatwg-any-replace' and
'whatwg-any-html'.

I'm not sure whether to have a separate suite for windows-1255, or let
the programmer take care of that.  Also, since 'replace' is a pretty
simplistic handler, I suspect a lot of programmers would like to use
surrogateescape, but since WHAT-WG explicitly restricts error modes to
fatal, replace, and html, that's on the programmer to define, at least
until it's clear there's overwhelming demand for it.

 > I'd like to limit this proposal to single-byte encodings,
 > addressing the discrepancies in the C1 characters and possibly that
 > Hebrew vowel point.

I wonder what Microsoft's representatives to Unicode and WHAT-WG would
say about that.  I think it should definitely be handled somehow.  I
find adding it to the stdlib 1255 codec attractive, and I think the
chance that Microsoft would sign off on that is nonzero.  If they
didn't, it would go into 1255-specific handlers.

 > If there are differences in the JIS encodings, that is a can of
 > worms I'd like to not open at the moment.

Addressed by the factory function, which is needed anyway as discussed
above.

Footnotes: 
[1]  I had this wrong.  It's not the number of tokens to skip, it's
the position to restart reading the input.

[2]  The actual handlers are all in C, and return 0 if they don't know
what to do.  I haven't had time to figure out what actually happens
here (None is an actual object and I'm sure it doesn't live at 0x0).
I'm guessing that a pure Python handler would return None, but perhaps
it should reraise.  That doesn't affect the ability to construct a
chaining handler, only what such a handler would do if it "knows" the
input is *bad* and decides to stop rather than delegate.