[Python-Dev] PEP 3333: wsgi_string() function

Tue Jan 11 02:03:12 CET 2011

Ian Bicking writes:
 > On Sun, Jan 9, 2011 at 1:47 AM, Stephen J. Turnbull <stephen at xemacs.org>wrote:
 > 
 > > Robert Brewer writes:
 > >
 > >  > Python 3.1 was released June 27th, 2009. We're coming up faster on the
 > >  > two-year period than we seem to be on a revised WSGI spec. Maybe we
 > >  > should shoot for a "bytes of a known encoding" type first.
 > >
 > > You have one.  It's called "ISO 2022: Information processing -- ISO
 > > 7-bit and 8-bit coded character sets -- Code extension techniques".
 > > The popularity of that standard speaks for itself.
 > >
 > 
 > The kind of object PJE was referring to is more like Ruby's strings,

Notice that Ruby was written by a Japanese, the same culture that
brought us Mule, TRON, X Compound Text, and ISO-2022 in the first
place.  Matsumoto himself probably isn't infected with the "Unicode is
going to be the death of all Japanese culture" bug, but that's the
attitude that is behind ISO 2022.

 > which do not embed the encoding inside the bytes themselves but have the encoding
 > as a kind of annotation on the bytes,

My pointis that ISO-2022 is basically just a serialization of that.

And it sucks; nobody uses it, except in Japanese and Korean email.
Maybe Mandarin (but Taiwan and Hong Kong use Big5 or EUC, not an
escape-extended representation).

 > and do lazy transcoding when combining strings of different
 > encodings.

Which buys WSGI nothing, AIUI, since the people who want this claim
that translating to Unicode either correctly or as "big bytes" (ie,
zero-extension) is inefficient.  They're shoveling bits; much of the
time, by the time the out-of-band information catches up, it's going
to be too late.

 > The goal with respect to WSGI is that you could annotate bytes with
 > an encoding but also change or fix that encoding if other
 > out-of-band information implied that you got the encoding wrong
 > (e.g., some data is submitted with the encoding of the page the
 > browser was on, and so nothing inside the request itself will
 > indicate the encoding of the data).

A noble goal, but nobody's gonna bell that cat.  This is all just
wishful thinking.  2 decades of experience with Emacs/Mule and similar
efforts show that if you provide this facility, people will use it,
and that use will include a lot of abuse (ie, throwing the garbage
into somebody else's backyard, rather than disposing of it yourself)
-- in the end, the garbage gets piled high enough that it's not worth
the effort to try to make it work.

 > Latin1 is kind of the poor man's version of this -- it's a good
 > guess at an encoding, that at worst requires transcoding that can
 > be done in a predictable way.  (Personally I think Latin1 gets us
 > 99% of the way there, and so bytes-of-a-known-encoding are not
 > really that important to the WSGI case.)

In particular, it gets PJE 100% of the way there, since he proposes
always targeting ISO 8859/1, anyway.

And if it's not useful to WSGI, who is it useful to?