[Python-ideas] Support WHATWG versions of legacy encodings

Thu Jan 18 13:12:17 EST 2018

Soni L. writes:

 > ISO-8859-1 explicitly defines control characters in the \x80-\x9F range, 
 > IIRC.

You recall incorrectly.  You're probably thinking of RFC 1345.  But
I've never seen that cited except in the IANA registry.

All of ISO 2022, ISO 4873, ISO 8859, and Unicode suggest the ISO 6429
primary and supplementary control sets as good choices.  (Unicode goes
so far as to use ISO 6429's names for the supplementary set for C1
code points while explicitly denying them *any* semantics.)  But
none specifies a default, and as far as I know there is no widespread
agreement on what control codes are good for, except for a handful of
"whitespace" characters in C0, and a couple of C1 controls that are
used by (and reserved to) ISO 2022.  In fact, Python ISO-8859 codecs
do pass them through (both C0 and C1), and the UTF-8 codec passes
through C0 and allows encoding and decoding of C1 code points.

On the other hand, the ISO standards forbid use of unassigned graphic
code points as characters (graphic or control), and codecs quite
reasonably treat unassigned graphic code points as errors.  In Python,
that practice is extended to the windows-* sets, which seems
reasonable to me.  But the windows-* encodings do not support C1
controls.  Instead the entire right half of the code page is graphic
(per Microsoft's IANA registrations), and that, I suppose, is why
Python does not allow fallthrough of unassigned code points 0x80-0x9F
in windows-* codecs.

 > I think python should follow the (de-facto) standard. This is it.

WHAT-WG encoding isn't a "de facto" standard, it's a published
standard by a recognized (though forked) standards body.  However,
different standards are designed for different contexts, and WHAT-WG's
encoding standard is clearly specifically aimed at browsers.  It also
may be useful for more specialized UI applications such as your IRC
client, although IMO that's asking for trouble.  Note also that the
WHAT-WG standard is in a peculiar limbo between informative and
normative.  The standard encoding is UTF-8, end-of-story.  What we're
talking about here is best practices for UIs that are faced with
non-conformant "legacy" documents, and want to display something
anyway.

But Python is a general-purpose programming language, and should
cleave to the most generally-accepted, well-defined standards, which
are the ISO standards themselves in the case of ISO-defined coded
character sets.  Aliasing the ISO character sets (and ASCII! oh, my
aching RFC 822 header!) to the corresponding windows-* as a *general*
practice is pretty abominable, though it makes some sense in the case
of browsers.  For windows-* character sets, ISTM that the WHAT-WG
repertoires of graphic characters are improvements of Microsoft's
(assuming that WHAT-WG version their standards).

Applications can do what they want, of course, and I'm all for a PyPI
package to make it easier to do that, whether by providing additional
codecs, additional error handlers, or by post-processing surrogate-
escaped bytes.  I still don't think the WHAT-WG approach is a good fit
for most use cases, nor should it be included in the stdlib.  Most of
the use cases I've seen proposed so far are well-served by existing
Python features like errors='surrogateescape'.

Steve