[Web-SIG] bytes, strings, and Unicode in Jython, IronPython,
and CPython 3.0
Alan Kennedy
py-web-sig at xhaus.com
Wed Sep 15 16:28:14 CEST 2004
[Phillip J. Eby]
> I've reviewed last month's Python-Dev discussion about the future
> Python 'bytes()' type, and the eventual transition away from Python's
> current 8-bit strings.
>
> Mainly, the impression I get is that significant change in this
> respect really can't happen until Python 3.0, because too many
> things have to change at once for it to work.
>
> So, here's what I propose to do about the open issue in PEP 333.
> Servers and gateways that run under Python implementations where all
> strings are Unicode (e.g. Jython) *may*:
Encoding issues? "Oh no", screams Alan, turning tail and sprinting away!
;-)
Before starting my response, I just want to point out two things:
1. I'm no bot when it comes to python and character encodings.
2. that the text below may come across a little cold. I've spent a few
hours thinking through the issues, checking code, rewriting text,
rewriting, rewriting, .... I think the below is the most accurate
picture I can present: it won't win any poetry competitions.
Before getting into the WSGI parameter encoding issues, just a quick
overview of character strings vs. binary strings in jython.
Strings in jython: textual vs. binary
=====================================
Java stores all textual strings as unicode strings, i.e. sequences of
2-byte characters. These strings can be transcoded to any encoding: when
they are so transcoded, that delivers a sequence of bytes.
Java keeps the concept of textual unicode strings and byte sequences
separate, through the use of (rigidly enforced) method signatures. This
ensures both static type correctness and memory efficiency.
Jython blends the two concepts, by using java.lang.String's to store
both python text strings and python binary strings, i.e. byte arrays. It
stores the latter by the trick of only using the lower byte of each
two-byte unicode character to store data, leaving the upper byte unused.
You can see this by running this code on jython.
#--------------------------------------------
s = u'\u00E1\u00E9\u00ED\u00F3\u00FA'
u8 = s.encode('utf-8')
u16 = s.encode('utf-16')
for x in [s, u8, u16]:
print "%d:%s:%s" % (len(x), str(type(x)), `x`)
#--------------------------------------------
which outputs
"""
5:org.python.core.PyString:'\xE1\xE9\xED\xF3\xFA'
10:org.python.core.PyString:'\xC3\xA1\xC3\xA9\xC3\xAD\xC3\xB3\xC3\xBA'
12:org.python.core.PyString:'\xFE\xFF\x00\xE1\x00\xE9\x00\xED\x00\xF3\x00\xFA'"""
"""
The only way to create binary strings in jython is to create them
explicitly, for example, by transcoding text strings as above, or by
reading from a byte-oriented stream like a socket, or binary file. These
binary strings do not have their encoding metadata associated with them,
in common with cpython: the programmer must know the encoding of the
byte-array/binary-string they're handling.
When these binary strings are created, and stored as textual unicode
strings, they look like latin-1 textual strings, since all of the
upper-bytes of the characters are zero. So on jython, a binary encoded
latin-1 string and a unicode string containing only latin-1 characters
are represented identically.
In jython, any other time a string is created, by assignment to a string
literal ('', "", """ """), or by reading from a text file, text stream,
etc, the result is always a textual unicode string.
So, on to WSGI
[Phillip J. Eby]
> * accept Unicode statuses and headers, so long as they properly encode
> them for transmission (latin-1 + RFC 2047)
String parameters in jython are always passed as unicode strings,
containing either textual strings or the binary-string/byte-arrays
described above. So the strings received by the jython
start_response_callable will be either textual or binary unicode strings.
The start_response_callable has to be able to operate on these strings
regardless, i.e. transform them using standard python functions, e.g.
.split(' '), int(), etc. If these functions fail to operate correctly on
a binary string, then there is little the start_response_callable can
do, without knowing the encoding of the binary string so that it can
decode to a textual string. If the operations fail on a textual string,
it is because the string contains invalid data for the operation.
Note that this is common with cpython, under which code must also simply
assume that .split() and int() will simply work on the string passed,
without knowing its encoding.
Status
======
So, in the case of the http status value, as long as
int(status_str.split(' ')) returns an integer, that's fine. Which should
be the case all of the time, as long as what was passed really was a
string containing an ascii integer followed by a space.
Headers
=======
In the case of the header list, both header names and header values
could also be passed as either textual or binary strings. There are
three scenarios for the content of those strings
1. They are binary strings, i.e. have zero upper-bytes, and are
presumably suitable (application knows best) for use as http headers
without transformation.
2. They are latin-1 strings, i.e. have zero upper-bytes, and are thus
suitable for use as http headers without transformation.
3. They are non latin-1 strings, i.e. have non-zero upper-bytes, and so
will have to be encoded before transmission, according to RFC 2047.
What jython should do
=====================
So any jython middleware, gateway or server that receives a Unicode
string for a header value must
A: Send it without transformation if all upper-bytes are zero.
B: Encode it according to RFC 2047 if there are non-zero upper-bytes,
then send it.
In the case of B, how should the jython code know which iso-8859-X
charset to use for RFC 2047? Is there library code? Is mimify the right
module to use?
A couple of notes about J2EE
============================
1. Under J2EE, the HttpServletResponse method signatures specify that a
java.lang.String, i.e. 2-byte unicode, value must be given for header
names and values (although see next point).
2. The most recent 2.4 version of the servlet specification now permits
header strings to be an "octet string ... encoded according to RFC
2047". This was not specified in previous versions of the spec, i.e. 2.3
or 2.2).
http://java.sun.com/j2ee/1.4/docs/api/javax/servlet/http/HttpServletResponse.html#addHeader(java.lang.String,%20java.lang.String)
3. Which indicates to me that J2EE expects that you have completely
taken care of encoding yourself, i.e. that you will have RFC-2047
encoded your header, if required, before passing it to J2EE.
4. So if a jython start_response_callable receives a binary string, it
should simply transmit it directly. If it receives a unicode string with
non-zero upper-bytes, it should attempt to encode it in RFC-2047 before
transmission. This could be done like so
unicode_header = "my value"
try:
wire_string = unicode_header.encode('latin-1')
except UnicodeError:
wire_string = encode_in_rfc2047(unicode_header)
Standalone pure jython server
=============================
When running a standalone pure jython WSGI server, jython code will be
writing header values directly to the client socket. In this case, the
jython start_response_callable/server needs latin-1/RFC2047 strings to
transmit down the socket. The same rules as J2EE above apply to the
treatment of strings in this case.
So, in regards to the WSGI requirement above, the application *must*
transmit Unicode statuses and headers to a jython
start_response_callable, which will attempt to appropriately RFC-2047
encode the strings if they contain anything other than latin-1 characters.
Which I think completely agrees with your requirement as stated, just
with different wording.
[Phillip J. Eby]
> * accept Unicode for response body segments, so long as each segment
> may be encoded as latin-1 (i.e. only uses chars 0-255)
I would say "jython servers can *only* accept unicode strings for
response body segments", since this is the jython mechanism for passing
binary strings.
As you (kind-of) specify, the response body segment is not really a
latin-1 encoded textual string, it is really a binary string of varying
encoding, depending on the application. But treating it as a latin-1
string has the effect of preserving its content as a binary string.
So again, I think that this meets with your requirement, except stated
differently.
If WSGI response bodies "crossed over" somehow from a cpython
application to a jython application, through either swig-style linkage
or through some form of http relay protocol such as FastCGI, the jython
receiving end of that would have to produce a response body encoded as a
jython binary string. Which is exactly what jython socket operations,
etc, produce. So pure python middleware code that distributes WSGI
requests over, say a network socket, should run identically between
jython and cpython. Which is nice to know.
And which would probably true for IronPython too: That Jim Hugunin is a
clever lad. Jython really does all this stuff pretty seamlessly in
relation to cpython.
[Phillip J. Eby]
> * produce Unicode input headers and body strings by decoding from
> latin-1, as long as the produced values are considered type 'str' for
> that Python implementation.
On jython, there is no point in decoding latin-1 strings to unicode
strings, because their representations are identical: both are
types.StringType, both take 2 bytes per character/byte, with the upper
byte as zero.
If the recipient is another jython component, all string types will be
received correctly.
If the recipient is a cpython component, then it will still receive the
correct string, because whatever interface lies between the cpython and
the jython will have correctly converted the data (if it was latin-1 data).
So perhaps this requirement could be stated as "jython
components/applications must produce unicode input headers and body
strings, which must only contain latin-1 characters"?
Whew! That turned out to be not so bad after all! (Alan crosses his
fingers behind his back :-)
Regards,
Alan.
More information about the Web-SIG
mailing list