[Web-SIG] bytes, strings, and Unicode in Jython,
IronPython, and CPython 3.0
Phillip J. Eby
pje at telecommunity.com
Wed Sep 15 18:23:41 CEST 2004
At 03:28 PM 9/15/04 +0100, Alan Kennedy wrote:
>String parameters in jython are always passed as unicode strings,
>containing either textual strings or the binary-string/byte-arrays
>described above. So the strings received by the jython
>start_response_callable will be either textual or binary unicode strings.
>The start_response_callable has to be able to operate on these strings
>regardless, i.e. transform them using standard python functions, e.g.
>.split(' '), int(), etc. If these functions fail to operate correctly on a
>binary string, then there is little the start_response_callable can do,
>without knowing the encoding of the binary string so that it can decode to
>a textual string. If the operations fail on a textual string, it is
>because the string contains invalid data for the operation.
The point here is that a Jython WSGI server should either invoke
'.encode("latin1")' on all strings supplied to it (whether in
'start_response()', 'write()', or yielded by the iterable), or otherwise
verify that there are either no non-latin1 characters, or (optionally)
transcode any non-latin1 characters as prescribed by RFC 2047
(status/headers only). It should be a fatal error to send a non-latin1
string to 'write()' or yield one from the iterable, however.
>What jython should do
>So any jython middleware, gateway or server that receives a Unicode string
>for a header value must
>A: Send it without transformation if all upper-bytes are zero.
>B: Encode it according to RFC 2047 if there are non-zero upper-bytes, then
>In the case of B, how should the jython code know which iso-8859-X charset
>to use for RFC 2047? Is there library code? Is mimify the right module to use?
Actually, 'B' is optional. (Note that my proposal said a server *may*
accept Unicode, not that it was required to do so.) It is also perfectly
valid for a server or gateway to reject Unicode that can't be rendered as
latin1. In other words, only 'A' is required. That's because applications
are already required to do their own latin1/RFC 2047 encoding.
But after looking at all of your comments and thinking this over a bit, I'm
thinking that there's a simpler way to specify the intent of my proposal;
"""On Python platforms where the 'str' or 'StringType' type is
Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all strings
must contain only characters representable in ISO-8859-1 encoding (\u0000
through \u00FF, inclusive). It should be considered a fatal error for an
application to supply strings containing any other Unicode character,
whether the string is in the 'headers', the 'status', supplied to
'write()', or is produced by the application's returned iterable."""
Adding this to the current "Unicode" section would suffice, I think, to
deal with the current and future platforms in a cleanly compatible way. It
also makes it clear that there is no additional burden on either the
server/gateway or application sides: it's just a clarification of what it
means to be a 'str' for WSGI's purposes.
More information about the Web-SIG