[Python-ideas] Support WHATWG versions of legacy encodings

Soni L. fakedme+py at gmail.com
Thu Jan 18 18:21:48 EST 2018

On 2018-01-18 04:12 PM, Stephen J. Turnbull wrote:
> Soni L. writes:
>   > ISO-8859-1 explicitly defines control characters in the \x80-\x9F range,
>   > IIRC.
> You recall incorrectly.  You're probably thinking of RFC 1345.  But
> I've never seen that cited except in the IANA registry.
> All of ISO 2022, ISO 4873, ISO 8859, and Unicode suggest the ISO 6429
> primary and supplementary control sets as good choices.  (Unicode goes
> so far as to use ISO 6429's names for the supplementary set for C1
> code points while explicitly denying them *any* semantics.)  But
> none specifies a default, and as far as I know there is no widespread
> agreement on what control codes are good for, except for a handful of
> "whitespace" characters in C0, and a couple of C1 controls that are
> used by (and reserved to) ISO 2022.  In fact, Python ISO-8859 codecs
> do pass them through (both C0 and C1), and the UTF-8 codec passes
> through C0 and allows encoding and decoding of C1 code points.
> On the other hand, the ISO standards forbid use of unassigned graphic
> code points as characters (graphic or control), and codecs quite
> reasonably treat unassigned graphic code points as errors.  In Python,
> that practice is extended to the windows-* sets, which seems
> reasonable to me.  But the windows-* encodings do not support C1
> controls.  Instead the entire right half of the code page is graphic
> (per Microsoft's IANA registrations), and that, I suppose, is why
> Python does not allow fallthrough of unassigned code points 0x80-0x9F
> in windows-* codecs.
>   > I think python should follow the (de-facto) standard. This is it.
> WHAT-WG encoding isn't a "de facto" standard, it's a published
> standard by a recognized (though forked) standards body.  However,
> different standards are designed for different contexts, and WHAT-WG's
> encoding standard is clearly specifically aimed at browsers.  It also
> may be useful for more specialized UI applications such as your IRC
> client, although IMO that's asking for trouble.  Note also that the
> WHAT-WG standard is in a peculiar limbo between informative and
> normative.  The standard encoding is UTF-8, end-of-story.  What we're
> talking about here is best practices for UIs that are faced with
> non-conformant "legacy" documents, and want to display something
> anyway.
> But Python is a general-purpose programming language, and should
> cleave to the most generally-accepted, well-defined standards, which
> are the ISO standards themselves in the case of ISO-defined coded
> character sets.  Aliasing the ISO character sets (and ASCII! oh, my
> aching RFC 822 header!) to the corresponding windows-* as a *general*
> practice is pretty abominable, though it makes some sense in the case
> of browsers.  For windows-* character sets, ISTM that the WHAT-WG
> repertoires of graphic characters are improvements of Microsoft's
> (assuming that WHAT-WG version their standards).
> Applications can do what they want, of course, and I'm all for a PyPI
> package to make it easier to do that, whether by providing additional
> codecs, additional error handlers, or by post-processing surrogate-
> escaped bytes.  I still don't think the WHAT-WG approach is a good fit
> for most use cases, nor should it be included in the stdlib.  Most of
> the use cases I've seen proposed so far are well-served by existing
> Python features like errors='surrogateescape'.

I'm just glad I *always* use bytestrings when dealing with network 
protocols, I guess. It's the only reasonable option.

> Steve

More information about the Python-ideas mailing list