Ian Bicking writes:
On Sun, Jan 9, 2011 at 1:47 AM, Stephen J. Turnbull
wrote: Robert Brewer writes:
Python 3.1 was released June 27th, 2009. We're coming up faster on the two-year period than we seem to be on a revised WSGI spec. Maybe we should shoot for a "bytes of a known encoding" type first.
You have one. It's called "ISO 2022: Information processing -- ISO 7-bit and 8-bit coded character sets -- Code extension techniques". The popularity of that standard speaks for itself.
The kind of object PJE was referring to is more like Ruby's strings,
Notice that Ruby was written by a Japanese, the same culture that brought us Mule, TRON, X Compound Text, and ISO-2022 in the first place. Matsumoto himself probably isn't infected with the "Unicode is going to be the death of all Japanese culture" bug, but that's the attitude that is behind ISO 2022.
which do not embed the encoding inside the bytes themselves but have the encoding as a kind of annotation on the bytes,
My pointis that ISO-2022 is basically just a serialization of that. And it sucks; nobody uses it, except in Japanese and Korean email. Maybe Mandarin (but Taiwan and Hong Kong use Big5 or EUC, not an escape-extended representation).
and do lazy transcoding when combining strings of different encodings.
Which buys WSGI nothing, AIUI, since the people who want this claim that translating to Unicode either correctly or as "big bytes" (ie, zero-extension) is inefficient. They're shoveling bits; much of the time, by the time the out-of-band information catches up, it's going to be too late.
The goal with respect to WSGI is that you could annotate bytes with an encoding but also change or fix that encoding if other out-of-band information implied that you got the encoding wrong (e.g., some data is submitted with the encoding of the page the browser was on, and so nothing inside the request itself will indicate the encoding of the data).
A noble goal, but nobody's gonna bell that cat. This is all just wishful thinking. 2 decades of experience with Emacs/Mule and similar efforts show that if you provide this facility, people will use it, and that use will include a lot of abuse (ie, throwing the garbage into somebody else's backyard, rather than disposing of it yourself) -- in the end, the garbage gets piled high enough that it's not worth the effort to try to make it work.
Latin1 is kind of the poor man's version of this -- it's a good guess at an encoding, that at worst requires transcoding that can be done in a predictable way. (Personally I think Latin1 gets us 99% of the way there, and so bytes-of-a-known-encoding are not really that important to the WSGI case.)
In particular, it gets PJE 100% of the way there, since he proposes always targeting ISO 8859/1, anyway. And if it's not useful to WSGI, who is it useful to?