
I don't really understand what you're doing when you take a fragment of my sentence where I explain a wrong understanding of WHATWG encodings, and say "that's wrong, as you explain". I know it's wrong. That's what I was saying. You quoted the part where I said "Filling in all the gaps with Latin-1", cut out the part where I said "is wrong", and replied with "that's wrong". I guess I'm glad we're in agreement, but this has been a strange bit of discourse. In this pseudocode that implements a "whatwg_error_mode", can you describe what the Python code to call it would look like? Does every call to .encode and .decode now have a "whatwg_error_mode" parameter, in addition to the "errors" parameter? Or are there twice as many possible strings you could pass as the "errors" parameter, so you can have "replace", "replace-whatwg", "surrogateescape", "surrogateescape-whatwg", etc? My objection here isn't efficiency, it's adding confusing extra options to .encode() and .decode() that aren't relevant in most cases. I'd like to limit this proposal to single-byte encodings, addressing the discrepancies in the C1 characters and possibly that Hebrew vowel point. If there are differences in the JIS encodings, that is a can of worms I'd like to not open at the moment. -- Rob Speer On Mon, 22 Jan 2018 at 01:43 Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
I don't expect to change your mind about the "right" way to deal with this, but this is a more explicit description of what those of us who advocate error handlers are thinking about. It may be useful in writing your PEP (PEPs describe rejected counterproposals and amendments along with adopted proposals and rationale in either case).
Rob Speer writes:
The question to my mind is whether or not this "latin1replace" handler, in conjunction with existing codecs, will do the same thing as the WHATWG codecs. If I have understood you correctly, I think it will. Have I missed something?
It won't do the same thing, and neither will the "chaining coders" proposal.
The "chaining coders" proposal isn't well-enough specified to be sure.
However, for practical purposes you may think of a Python *codec* as a "whole array" decoder/encoder, and an *error handler* as a "token-by- token" decoder/encoder. The distinction in type is for efficiency, of course. Codecs can't be "chained" (I think, but I didn't think very hard), but handlers can, in the sense that each handler can handle some input values and delegate anything it can't deal with to the next handler in the chain (under the hood handler implementationss are just Python functions with a particular signature, so this is just "loop until non-None").
It's easy to miss details like this in all the counterproposals.
I see no reason why a 'whatwgreplace' error handler with the logic
# I am assuming decoding, and single-byte encodings. Encoding # with 'html' error mode would insert format("&#%d;", ord(unicode)). # Multibyte is a little harder.
# ASCII bytes never error except maybe in UTF16, UTF32, Shift JIS # and Big5. assert the_byte >= 0x80 # Handle C1 control characters. if the_byte < 0xA0: append_to_output(chr(the_byte)) # Handle extended repertoire with a dict. # This condition will depend on the particular codec. elif the_byte in additional_code_points: append_to_output(additional_code_points[the_byte]) # Implement WHATWG error modes. elif whatwg_error_mode is replacement: append_to_output("\uFFFD") else: raise
doesn't have the effect you want. This can be done in pure Python. (Note: The actions in the pseudocode are not accurate. IIRC real handlers take a UnicodeError as argument, and return a tuple of the text to append to output and number of input tokens to skip, or return None to indicate an unhandled error, rather than doing the appending and raising themselves.)
The main objection to doing it this way would be efficiency. To be honest, I personally don't think that's an important objection since this handler is frequently invoked only if the source text is badly broken. (Remember, you'll already be greatly expanding the repertoire of at least ASCII and ISO 8859/1 by promoting to windows-1252.) And it would surely be "fast enough" if written in C.
Caveat: I'm not sure I agree with MAL about windows-1255. I think it's arguable that the WHAT-WG index is a better approximation to reality, and I'd like to hear Hebrew speakers argue about that (I'm not one).
The difference between WHATWG encodings and the ones in Python is, in all but one case, *only* in the C1 control character range (0x80 to 0x9F),
Also in Japanese, where "corporate characters" have been added (frequently twice, preventing round-tripping ... yuck) to the JIS standard. I haven't checked the Chinese and Korean tables for similar damage, but they're not quite as wacky about this stuff as the JISC is, so they're probably OK (and of course Big5 was "corporate" from the get-go).
a range of Unicode characters that has historically evaded standardization because they never had a clear purpose even before Unicode. Filling in all the gaps with Latin-1
That's wrong, as you explain:
[Eg, in Greek, some code points] are simply unassigned. Other software sometimes maps them to the Private Use Area, but this is not standardized at all, and it seems clear that Python should handle them with its usual error handler for unassigned bytes. (Which is one of the reasons not to replace the error handler with something different: we still need the error handler.)
The logic above handles all this. As mentioned, a stdlib error handler ('strict', 'replace', or 'xmlcharrefreplace' for WHAT-WG conformance, or 'surrogatereplace' for the Pythonic equivalent of mapping to the private area) could be chained if desired, and the defaults could be changed and the names aliased to the WHAT-WG terms.
This could be automated with a factory function that takes a list of predefined handlers and composes them, although that would add another layer of inefficiency (the composition would presumably be done in a loop, and possibly using try although I think the error handler convention is to return the text to insert if handled, and None if the error can't be handled).
Steve