![](https://secure.gravatar.com/avatar/3a304881dc609b0382ec45d20b7ae9c5.jpg?s=120&d=mm&r=g)
On 2018-01-18 04:12 PM, Stephen J. Turnbull wrote:
Soni L. writes:
ISO-8859-1 explicitly defines control characters in the \x80-\x9F range, IIRC.
You recall incorrectly. You're probably thinking of RFC 1345. But I've never seen that cited except in the IANA registry.
All of ISO 2022, ISO 4873, ISO 8859, and Unicode suggest the ISO 6429 primary and supplementary control sets as good choices. (Unicode goes so far as to use ISO 6429's names for the supplementary set for C1 code points while explicitly denying them *any* semantics.) But none specifies a default, and as far as I know there is no widespread agreement on what control codes are good for, except for a handful of "whitespace" characters in C0, and a couple of C1 controls that are used by (and reserved to) ISO 2022. In fact, Python ISO-8859 codecs do pass them through (both C0 and C1), and the UTF-8 codec passes through C0 and allows encoding and decoding of C1 code points.
On the other hand, the ISO standards forbid use of unassigned graphic code points as characters (graphic or control), and codecs quite reasonably treat unassigned graphic code points as errors. In Python, that practice is extended to the windows-* sets, which seems reasonable to me. But the windows-* encodings do not support C1 controls. Instead the entire right half of the code page is graphic (per Microsoft's IANA registrations), and that, I suppose, is why Python does not allow fallthrough of unassigned code points 0x80-0x9F in windows-* codecs.
I think python should follow the (de-facto) standard. This is it.
WHAT-WG encoding isn't a "de facto" standard, it's a published standard by a recognized (though forked) standards body. However, different standards are designed for different contexts, and WHAT-WG's encoding standard is clearly specifically aimed at browsers. It also may be useful for more specialized UI applications such as your IRC client, although IMO that's asking for trouble. Note also that the WHAT-WG standard is in a peculiar limbo between informative and normative. The standard encoding is UTF-8, end-of-story. What we're talking about here is best practices for UIs that are faced with non-conformant "legacy" documents, and want to display something anyway.
But Python is a general-purpose programming language, and should cleave to the most generally-accepted, well-defined standards, which are the ISO standards themselves in the case of ISO-defined coded character sets. Aliasing the ISO character sets (and ASCII! oh, my aching RFC 822 header!) to the corresponding windows-* as a *general* practice is pretty abominable, though it makes some sense in the case of browsers. For windows-* character sets, ISTM that the WHAT-WG repertoires of graphic characters are improvements of Microsoft's (assuming that WHAT-WG version their standards).
Applications can do what they want, of course, and I'm all for a PyPI package to make it easier to do that, whether by providing additional codecs, additional error handlers, or by post-processing surrogate- escaped bytes. I still don't think the WHAT-WG approach is a good fit for most use cases, nor should it be included in the stdlib. Most of the use cases I've seen proposed so far are well-served by existing Python features like errors='surrogateescape'.
I'm just glad I *always* use bytestrings when dealing with network protocols, I guess. It's the only reasonable option.
Steve