Re: [Python-ideas] Support WHATWG versions of legacy encodings

On Tue, Jan 16, 2018 at 9:30 PM, Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
In what context? WHAT-WG's encoding standard is *all about browsers*. If a codec is feeding text into a process that renders them all as glyphs for a human to look at, that's one thing. The codec doesn't want to fatal there, and the likely fallback glyph is something from the control glyphs block if even windows-125x doesn't have a glyph there. I guess it sort of makes sense.
sure it does -- and python is not a browser, and python itself has nothigni visual -- but we sure want to be abel to write code that produces visual representations of maybe messy text... if you're feeding a program ...
the codec has no idea when or how that's going to get interpreted.
sure -- which is why others have suggested that if WATWG is supported, then it *should* only be used for encoding, not encoding. But we are supposed to be consenting adults here -- I see no reason to prevent encoding -- maybe it would be useful for testing??? (as with JSON data, which I believe is
"supposed" to be UTF-8, but many developers use the legacy charsets they're used to and which are often embedded in the underlying databases etc, ditto XML),
OK -- if developers do the wrong thing, then they do the wrong thing -- we can't prevent that! And Python's lovely "text is unicode" model actually makes that hard to do wong. But we do need a way to decode messy text, and then send it off to JSON or whatever properly encoded. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

I'm going to push back on the idea that this should only be used for decoding, not encoding. The use case I started with -- showing people how to fix mojibake using Python -- would *only* use these codecs in the encoding direction. To fix the most common case of mojibake, you encode it as web-1252 and decode it as UTF-8 (because you got the data from someone who did the opposite). I have implemented some decode-only codecs (such as CESU-8), for exactly the reason of "why would you want more text in this encoding", but the situation is different here. On Wed, 17 Jan 2018 at 13:00 Chris Barker <chris.barker@noaa.gov> wrote:
On Tue, Jan 16, 2018 at 9:30 PM, Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
In what context? WHAT-WG's encoding standard is *all about browsers*. If a codec is feeding text into a process that renders them all as glyphs for a human to look at, that's one thing. The codec doesn't want to fatal there, and the likely fallback glyph is something from the control glyphs block if even windows-125x doesn't have a glyph there. I guess it sort of makes sense.
sure it does -- and python is not a browser, and python itself has nothigni visual -- but we sure want to be abel to write code that produces visual representations of maybe messy text...
if you're feeding a program
...
the codec has no idea when or how that's going to get interpreted.
sure -- which is why others have suggested that if WATWG is supported, then it *should* only be used for encoding, not encoding. But we are supposed to be consenting adults here -- I see no reason to prevent encoding -- maybe it would be useful for testing???
(as with JSON data, which I believe is
"supposed" to be UTF-8, but many developers use the legacy charsets they're used to and which are often embedded in the underlying databases etc, ditto XML),
OK -- if developers do the wrong thing, then they do the wrong thing -- we can't prevent that!
And Python's lovely "text is unicode" model actually makes that hard to do wong. But we do need a way to decode messy text, and then send it off to JSON or whatever properly encoded.
-CHB
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On Wed, Jan 17, 2018 at 10:13 AM, Rob Speer <rspeer@luminoso.com> wrote:
I'm going to push back on the idea that this should only be used for decoding, not encoding.
The use case I started with -- showing people how to fix mojibake using Python -- would *only* use these codecs in the encoding direction. To fix the most common case of mojibake, you encode it as web-1252 and decode it as UTF-8 (because you got the data from someone who did the opposite).
It's also nice to be able to parse some HTML data, make a few changes in memory, and then serialize it back to HTML. Having this crash on random documents is rather irritating, esp. if these documents are standards-compliant HTML as in this case. -n -- Nathaniel J. Smith -- https://vorpus.org

Nathaniel Smith writes:
It's also nice to be able to parse some HTML data, make a few changes in memory, and then serialize it back to HTML. Having this crash on random documents is rather irritating, esp. if these documents are standards-compliant HTML as in this case.
This example doesn't make sense to me. Why would *conformant* HTML crash the codec? Unless you're saying the source is non-conformant and *lied* about the encoding? Then errors=surrogateescape should do what you want here, no? If not, new codecs won't help you---the "crash" is somewhere else. Similarly, Soni's use case of control characters for formatting in an IRC client. If they're C0, then AFAICT all of the ASCII-compatible codecs do pass all of those through.[1] If they're C1, then you've got big trouble because the multibyte encodings will either error due to a malformed character or produce an unintended character (except for UTF-8, where you can encode the character in UTF-8). The windows-* encodings are quite inconsistent about the graphics they put in C1 space as well as where they leave holes, so this is not just application-specific, it's even encoding-specific behavior. The more examples of claimed use cases I see, the more I think most of them are already addressed more safely by Python's existing mechanisms, and the less I see a real need for this in the stdlib, with the single exception that WHAT-WG may be a better authority to follow than Microsoft for windows-* codecs. Footnotes: [1] I don't like that much, I'd rather restrict to the ones that have universally accepted semantics including CR, LF, HT, ESC, BEL, and FF. But passthrough is traditional there, a few more are in somewhat common use, and I'm not crazy enough to break backward compatibility.

On Thu, Jan 18, 2018, at 11:04, Stephen J. Turnbull wrote:
Nathaniel Smith writes:
It's also nice to be able to parse some HTML data, make a few changes in memory, and then serialize it back to HTML. Having this crash on random documents is rather irritating, esp. if these documents are standards-compliant HTML as in this case.
This example doesn't make sense to me. Why would *conformant* HTML crash the codec? Unless you're saying the source is non-conformant and *lied* about the encoding?
I think his point is that the WHATWG standard is the one that governs HTML and therefore HTML that uses these encodings (including the C1 characters) are conformant to *that* standard, regardless of their status with regards to anything published by Unicode, and that the new encodings (whatever they are called), including the round-trip for b'\x81' as \u0081, are the ones identified by a statement in an HTML document that it uses windows-1252, and therefore such a statement is not a lie.

Random832 writes:
I think his point is that the WHATWG standard is the one that governs HTML and therefore HTML that uses these encodings (including the C1 characters) are conformant to *that* standard,
I don't think that is a tenable interpretation of this standard. The WHAT-WG standard encoding for HTML is UTF-8. This is what https://encoding.spec.whatwg.org/#names-and-labels says: Authors must use the UTF-8 encoding and must use the ASCII case-insensitive "utf-8" label to identify it. New protocols and formats, as well as existing formats deployed in new contexts[1], must use the UTF-8 encoding exclusively. If these protocols and formats need to expose the encoding’s name or label, they must expose it as "utf-8". Non-UTF-8 *documents* do not conform. There's nothing anywhere that says you may use other encodings, with the single exception of implied permission when encoding form input to send to the server (and that's not even HTML!) Even there you're encouraged to use UTF-8. The rest of the standard provides for how *processes* should handle encodings in purported HTML documents that fail the requirement to encode in UTF-8. That doesn't mean such documents conform; it simply *gives permission* to a conformant process to try to deal with them, and rules for doing that. Yes, it's true that WHAT-WG processing probably would have saved Nathaniel some aggravation with his manipulations of HTML. It's equally likely that errors='surrogateescape' would do so, and a better job on encodings like Hebrew that leave code points in graphic regions undefined. Footnotes: [1] I take this to mean that when I take an EUC-JP HTML document and move it from my legacy document tree to my new Django static resource collection, I *must* transcode it to UTF-8.
participants (5)
-
Chris Barker
-
Nathaniel Smith
-
Random832
-
Rob Speer
-
Stephen J. Turnbull