
Sorry for the long delay. I had a lot on my plate at work, and was spending 14 hours a day sleeping because of the flu. "It got better." Rob Speer writes:
I don't really understand what you're doing when you take a fragment of my sentence where I explain a wrong understanding of WHATWG encodings, and say "that's wrong, as you explain". I know it's wrong. That's what I was saying.
Sure, but you're not my entire audience: the part I care most about is the committers. I've seen proposals to "fill in" seriously made in other contexts, I wanted to agree that's wrong for Python.
In this pseudocode that implements a "whatwg_error_mode", can you describe what the Python code to call it would look like?
There isn't any Python code that calls it. It's an error handler, like 'strict' or 'surrogateescape', and all the functions that call it are in C.
Does every call to .encode and .decode now have a "whatwg_error_mode" parameter, in addition to the "errors" parameter? Or are there twice as many possible strings you could pass as the "errors" parameter, so you can have "replace", "replace-whatwg", "surrogateescape", "surrogateescape-whatwg", etc?
It would be the latter. I haven't thought about it carefully, but what I would likely do is define a factory function taking an encoding name (str), an error handler name, and a bytes-str mapping for the exceptional cases like windows-1255 where WHAT-WG enhances the graphic repertoire, and returns a name like "whatwg-windows-1255-fatal". Internally it would 1. Check if the error handler name is 'fatal' or 'strict', or 'html' or 'xmlcharrefreplace' ('strict' and 'xmlcharrefreplace' would be used internally to the factory function, the registered name would be 'fatal' or 'html'). 'replace' has the same semantics in Python and in WHAT-WG, and other error handlers 'backslashreplace', 'ignore', and 'surrogateescape' would be up to the programmer to use or avoid. They'd go by their Python names. Alternatively we could follow the strict WHAT-WG standard and not allow those, or provide another argument to allow "lax" checking of the handler argument. 2. Check if the name is already registered. If so, return it. 3. Otherwise, def a function that takes an Unicode error and a mapping that defaults to the one passed to the factory, and a. passes C0 and C1 control characters through, else b. returns the mapped value if present, else c. passes the Unicode error to the named error handler and returns what that returns 4. Register the new handler with that name, and return the name. You would use it like handler = factory('windows-1255', 'html', [(b'0x00', '\Udeadbeef')]) b'deadbeef'.decode('windows-1255', errors=handler) The mapping would default to [], and the remaining question would be what the default for the error handler should be. I guess that would 'strict' (the problem is that the WHAT-WG defaults differ for decoding and encoding). (The choice of a list of tuples for the mapping is due to JIS, where the map is not 1-1, and a specific reverse mapping is defined.)
My objection here isn't efficiency, it's adding confusing extra options to .encode() and .decode() that aren't relevant in most cases.
There wouldn't be extra *arguments*, but there would be additional handler names to use as values. We'd want three standard handlers for everything but windows-1255 and JIS (AFAIK). One would be mainly for validating XML, and the name would be 'whatwg-any-fatal'. (Note that the name of the encoding is actually only used in the name of the handler, and that only to identify auxiliary mappings, such as that for windows-1255.) The others would be for everyday HTML (and maybe for XHTML form input?). They would be named 'whatwg-any-replace' and 'whatwg-any-html'. I'm not sure whether to have a separate suite for windows-1255, or let the programmer take care of that. Also, since 'replace' is a pretty simplistic handler, I suspect a lot of programmers would like to use surrogateescape, but since WHAT-WG explicitly restricts error modes to fatal, replace, and html, that's on the programmer to define, at least until it's clear there's overwhelming demand for it.
I'd like to limit this proposal to single-byte encodings, addressing the discrepancies in the C1 characters and possibly that Hebrew vowel point.
I wonder what Microsoft's representatives to Unicode and WHAT-WG would say about that. I think it should definitely be handled somehow. I find adding it to the stdlib 1255 codec attractive, and I think the chance that Microsoft would sign off on that is nonzero. If they didn't, it would go into 1255-specific handlers.
If there are differences in the JIS encodings, that is a can of worms I'd like to not open at the moment.
Addressed by the factory function, which is needed anyway as discussed above. Footnotes: [1] I had this wrong. It's not the number of tokens to skip, it's the position to restart reading the input. [2] The actual handlers are all in C, and return 0 if they don't know what to do. I haven't had time to figure out what actually happens here (None is an actual object and I'm sure it doesn't live at 0x0). I'm guessing that a pure Python handler would return None, but perhaps it should reraise. That doesn't affect the ability to construct a chaining handler, only what such a handler would do if it "knows" the input is *bad* and decides to stop rather than delegate.