
Soni L. writes:
This is surprising to me because I always took those encodings to have those fallbacks [to raw control characters].
ISO-8859-1 implementations do, for historical reasons AFAICT. And they frequently produce mojibake and occasionally wilder behavior. Most legacy encodings don't, and their standards documents frequently leave the behavior undefined for control character codes (which means you can error on them) and define use of unassigned codes as an error.
It's pretty wild to think someone wouldn't want them.
In what context? WHAT-WG's encoding standard is *all about browsers*. If a codec is feeding text into a process that renders them all as glyphs for a human to look at, that's one thing. The codec doesn't want to fatal there, and the likely fallback glyph is something from the control glyphs block if even windows-125x doesn't have a glyph there. I guess it sort of makes sense. If you're feeding a program (as with JSON data, which I believe is "supposed" to be UTF-8, but many developers use the legacy charsets they're used to and which are often embedded in the underlying databases etc, ditto XML), the codec has no idea when or how that's going to get interpreted. In one application I've maintained, an editor, it has to deal with whatever characters are sent to it, but we preferred to take charset designations seriously because users were able to flexibly change those if they wanted to, so the error handler is some form of replacement with a human-readable representation (not pass-through), except for the usual HT, CR, LF, FF, and DEL (and ESC in encodings using ISO 2022 extensions). Mostly users would use the editor to remove or replace invalid codes, although of course they could just leave them in (and they would be converted from display form to the original codes on output). In another, a mailing list manager, codes outside the defined repertoires were a recurring nightmare that crashed server processes and blocked queues. It took a decade before we sealed the last known "leak" and I am not confident there are no leaks left. So I don't actually have experience of a use case for control character pass-through, and I wouldn't even automate the superset substitutions if I could avoid it. (In the editor case, I would provide a dialog saying "This is supposed to be iso-8859-1, but I'm seeing C1 control codes. Would you like me to try windows-1252, which uses those codes for graphic characters?") So to my mind, the use case here is relatively restricted (writing user display interfaces) and does not need to be in the stdlib, and would constitute an attractive nuisance there (developers would say "these users will stop complaining about inability to process their dirty data if I use a WHAT-WG version of a codec, then they don't have to clean up"). I don't have an objection to supporting even that use case, but I don't see why that support needs to be available in the stdlib.