[Python-ideas] Support WHATWG versions of legacy encodings
Stephen J. Turnbull
turnbull.stephen.fw at u.tsukuba.ac.jp
Wed Jan 17 00:30:40 EST 2018
Soni L. writes:
> This is surprising to me because I always took those encodings to
> have those fallbacks [to raw control characters].
ISO-8859-1 implementations do, for historical reasons AFAICT. And
they frequently produce mojibake and occasionally wilder behavior.
Most legacy encodings don't, and their standards documents frequently
leave the behavior undefined for control character codes (which means
you can error on them) and define use of unassigned codes as an error.
> It's pretty wild to think someone wouldn't want them.
In what context? WHAT-WG's encoding standard is *all about browsers*.
If a codec is feeding text into a process that renders them all as
glyphs for a human to look at, that's one thing. The codec doesn't
want to fatal there, and the likely fallback glyph is something from
the control glyphs block if even windows-125x doesn't have a glyph
there. I guess it sort of makes sense.
If you're feeding a program (as with JSON data, which I believe is
"supposed" to be UTF-8, but many developers use the legacy charsets
they're used to and which are often embedded in the underlying
databases etc, ditto XML), the codec has no idea when or how that's
going to get interpreted. In one application I've maintained, an
editor, it has to deal with whatever characters are sent to it, but we
preferred to take charset designations seriously because users were
able to flexibly change those if they wanted to, so the error handler
is some form of replacement with a human-readable representation (not
pass-through), except for the usual HT, CR, LF, FF, and DEL (and ESC
in encodings using ISO 2022 extensions). Mostly users would use the
editor to remove or replace invalid codes, although of course they
could just leave them in (and they would be converted from display
form to the original codes on output).
In another, a mailing list manager, codes outside the defined
repertoires were a recurring nightmare that crashed server processes
and blocked queues. It took a decade before we sealed the last known
"leak" and I am not confident there are no leaks left.
So I don't actually have experience of a use case for control
character pass-through, and I wouldn't even automate the superset
substitutions if I could avoid it. (In the editor case, I would
provide a dialog saying "This is supposed to be iso-8859-1, but I'm
seeing C1 control codes. Would you like me to try windows-1252, which
uses those codes for graphic characters?")
So to my mind, the use case here is relatively restricted (writing
user display interfaces) and does not need to be in the stdlib, and
would constitute an attractive nuisance there (developers would say
"these users will stop complaining about inability to process their
dirty data if I use a WHAT-WG version of a codec, then they don't have
to clean up"). I don't have an objection to supporting even that use
case, but I don't see why that support needs to be available in the
More information about the Python-ideas