[Python-ideas] Support WHATWG versions of legacy encodings

Wed Jan 17 06:52:29 EST 2018

On 2018-01-17 03:30 AM, Stephen J. Turnbull wrote:
> Soni L. writes:
>
>   > This is surprising to me because I always took those encodings to
>   > have those fallbacks [to raw control characters].
>
> ISO-8859-1 implementations do, for historical reasons AFAICT.  And
> they frequently produce mojibake and occasionally wilder behavior.
> Most legacy encodings don't, and their standards documents frequently
> leave the behavior undefined for control character codes (which means
> you can error on them) and define use of unassigned codes as an error.
>
>   > It's pretty wild to think someone wouldn't want them.
>
> In what context?  WHAT-WG's encoding standard is *all about browsers*.
> If a codec is feeding text into a process that renders them all as
> glyphs for a human to look at, that's one thing.  The codec doesn't
> want to fatal there, and the likely fallback glyph is something from
> the control glyphs block if even windows-125x doesn't have a glyph
> there.  I guess it sort of makes sense.
>
> If you're feeding a program (as with JSON data, which I believe is
> "supposed" to be UTF-8, but many developers use the legacy charsets
> they're used to and which are often embedded in the underlying
> databases etc, ditto XML), the codec has no idea when or how that's
> going to get interpreted.  In one application I've maintained, an
> editor, it has to deal with whatever characters are sent to it, but we
> preferred to take charset designations seriously because users were
> able to flexibly change those if they wanted to, so the error handler
> is some form of replacement with a human-readable representation (not
> pass-through), except for the usual HT, CR, LF, FF, and DEL (and ESC
> in encodings using ISO 2022 extensions).  Mostly users would use the
> editor to remove or replace invalid codes, although of course they
> could just leave them in (and they would be converted from display
> form to the original codes on output).
>
> In another, a mailing list manager, codes outside the defined
> repertoires were a recurring nightmare that crashed server processes
> and blocked queues.  It took a decade before we sealed the last known
> "leak" and I am not confident there are no leaks left.
>
> So I don't actually have experience of a use case for control
> character pass-through, and I wouldn't even automate the superset
> substitutions if I could avoid it.  (In the editor case, I would
> provide a dialog saying "This is supposed to be iso-8859-1, but I'm
> seeing C1 control codes.  Would you like me to try windows-1252, which
> uses those codes for graphic characters?")
>
> So to my mind, the use case here is relatively restricted (writing
> user display interfaces) and does not need to be in the stdlib, and
> would constitute an attractive nuisance there (developers would say
> "these users will stop complaining about inability to process their
> dirty data if I use a WHAT-WG version of a codec, then they don't have
> to clean up").  I don't have an objection to supporting even that use
> case, but I don't see why that support needs to be available in the
> stdlib.
>

We use control characters as formatting/control characters on IRC all 
the time.

ISO-8859-1 explicitly defines control characters in the \x80-\x9F range, 
IIRC.

Windows codepages implicitly define control characters in that range, 
but they're still technically defined. It's a de-facto standard for 
those encodings.

I think python should follow the (de-facto) standard. This is it.