[Web-SIG] bytes, strings, and Unicode in Jython, IronPython, and CPython 3.0

Wed Sep 15 16:28:14 CEST 2004

[Phillip J. Eby]
 > I've reviewed last month's Python-Dev discussion about the future
 > Python  'bytes()' type, and the eventual transition away from Python's
 > current 8-bit strings.
 >
 > Mainly, the impression I get is that significant change in this
 > respect really can't happen until Python 3.0, because too many
 > things have to  change at once for it to work.
 >
 > So, here's what I propose to do about the open issue in PEP 333.
 > Servers and gateways that run under Python implementations where all
 > strings are Unicode (e.g. Jython) *may*:

Encoding issues? "Oh no", screams Alan, turning tail and sprinting away!

;-)

Before starting my response, I just want to point out two things:

1. I'm no bot when it comes to python and character encodings.

2. that the text below may come across a little cold. I've spent a few 
hours thinking through the issues, checking code, rewriting text, 
rewriting, rewriting, .... I think the below is the most accurate 
picture I can present: it won't win any poetry competitions.

Before getting into the WSGI parameter encoding issues, just a quick 
overview of character strings vs. binary strings in jython.

Strings in jython: textual vs. binary
=====================================

Java stores all textual strings as unicode strings, i.e. sequences of 
2-byte characters. These strings can be transcoded to any encoding: when 
they are so transcoded, that delivers a sequence of bytes.

Java keeps the concept of textual unicode strings and byte sequences 
separate, through the use of (rigidly enforced) method signatures. This 
ensures both static type correctness and memory efficiency.

Jython blends the two concepts, by using java.lang.String's to store 
both python text strings and python binary strings, i.e. byte arrays. It 
stores the latter by the trick of only using the lower byte of each 
two-byte unicode character to store data, leaving the upper byte unused. 
You can see this by running this code on jython.

#--------------------------------------------
s = u'\u00E1\u00E9\u00ED\u00F3\u00FA'

u8 = s.encode('utf-8')
u16 = s.encode('utf-16')

for x in [s, u8, u16]:
	print "%d:%s:%s" % (len(x), str(type(x)), `x`)
#--------------------------------------------

which outputs

"""
5:org.python.core.PyString:'\xE1\xE9\xED\xF3\xFA'
10:org.python.core.PyString:'\xC3\xA1\xC3\xA9\xC3\xAD\xC3\xB3\xC3\xBA'
12:org.python.core.PyString:'\xFE\xFF\x00\xE1\x00\xE9\x00\xED\x00\xF3\x00\xFA'"""
"""

The only way to create binary strings in jython is to create them 
explicitly, for example, by transcoding text strings as above, or by 
reading from a byte-oriented stream like a socket, or binary file. These 
binary strings do not have their encoding metadata associated with them, 
in common with cpython: the programmer must know the encoding of the 
byte-array/binary-string they're handling.

When these binary strings are created, and stored as textual unicode 
strings, they look like latin-1 textual strings, since all of the 
upper-bytes of the characters are zero. So on jython, a binary encoded 
latin-1 string and a unicode string containing only latin-1 characters 
are represented identically.

In jython, any other time a string is created, by assignment to a string 
literal ('', "", """ """), or by reading from a text file, text stream, 
etc, the result is always a textual unicode string.

So, on to WSGI

[Phillip J. Eby]
 >  * accept Unicode statuses and headers, so long as they properly encode
 > them for transmission (latin-1 + RFC 2047)

String parameters in jython are always passed as unicode strings, 
containing either textual strings or the binary-string/byte-arrays 
described above. So the strings received by the jython 
start_response_callable will be either textual or binary unicode strings.

The start_response_callable has to be able to operate on these strings 
regardless, i.e. transform them using standard python functions, e.g. 
.split(' '), int(), etc. If these functions fail to operate correctly on 
a binary string, then there is little the start_response_callable can 
do, without knowing the encoding of the binary string so that it can 
decode to a textual string. If the operations fail on a textual string, 
it is because the string contains invalid data for the operation.

Note that this is common with cpython, under which code must also simply 
assume that .split() and int() will simply work on the string passed, 
without knowing its encoding.

Status
======
So, in the case of the http status value, as long as 
int(status_str.split(' ')) returns an integer, that's fine. Which should 
be the case all of the time, as long as what was passed really was a 
string containing an ascii integer followed by a space.

Headers
=======
In the case of the header list, both header names and header values 
could also be passed as either textual or binary strings. There are 
three scenarios for the content of those strings

1. They are binary strings, i.e. have zero upper-bytes, and are 
presumably suitable (application knows best) for use as http headers 
without transformation.
2. They are latin-1 strings, i.e. have zero upper-bytes, and are thus 
suitable for use as http headers without transformation.
3. They are non latin-1 strings, i.e. have non-zero upper-bytes, and so 
will have to be encoded before transmission, according to RFC 2047.

What jython should do
=====================

So any jython middleware, gateway or server that receives a Unicode 
string for a header value must

A: Send it without transformation if all upper-bytes are zero.
B: Encode it according to RFC 2047 if there are non-zero upper-bytes, 
then send it.

In the case of B, how should the jython code know which iso-8859-X 
charset to use for RFC 2047? Is there library code? Is mimify the right 
module to use?

A couple of notes about J2EE
============================

1. Under J2EE, the HttpServletResponse method signatures specify that a 
java.lang.String, i.e. 2-byte unicode, value must be given for header 
names and values (although see next point).

2. The most recent 2.4 version of the servlet specification now permits 
header strings to be an "octet string ... encoded according to RFC 
2047". This was not specified in previous versions of the spec, i.e. 2.3 
or 2.2).

http://java.sun.com/j2ee/1.4/docs/api/javax/servlet/http/HttpServletResponse.html#addHeader(java.lang.String,%20java.lang.String)

3. Which indicates to me that J2EE expects that you have completely 
taken care of encoding yourself, i.e. that you will have RFC-2047 
encoded your header, if required, before passing it to J2EE.

4. So if a jython start_response_callable receives a binary string, it 
should simply transmit it directly. If it receives a unicode string with 
non-zero upper-bytes, it should attempt to encode it in RFC-2047 before 
transmission. This could be done like so

unicode_header = "my value"
try:
   wire_string = unicode_header.encode('latin-1')
except UnicodeError:
   wire_string = encode_in_rfc2047(unicode_header)

Standalone pure jython server
=============================

When running a standalone pure jython WSGI server, jython code will be 
writing header values directly to the client socket. In this case, the 
jython start_response_callable/server needs latin-1/RFC2047 strings to 
transmit down the socket. The same rules as J2EE above apply to the 
treatment of strings in this case.

So, in regards to the WSGI requirement above, the application *must* 
transmit Unicode statuses and headers to a jython 
start_response_callable, which will attempt to appropriately RFC-2047 
encode the strings if they contain anything other than latin-1 characters.

Which I think completely agrees with your requirement as stated, just 
with different wording.

[Phillip J. Eby]
 >  * accept Unicode for response body segments, so long as each segment
 > may be encoded as latin-1 (i.e. only uses chars 0-255)

I would say "jython servers can *only* accept unicode strings for 
response body segments", since this is the jython mechanism for passing 
binary strings.

As you (kind-of) specify, the response body segment is not really a 
latin-1 encoded textual string, it is really a binary string of varying 
encoding, depending on the application. But treating it as a latin-1 
string has the effect of preserving its content as a binary string.

So again, I think that this meets with your requirement, except stated 
differently.

If WSGI response bodies "crossed over" somehow from a cpython 
application to a jython application, through either swig-style linkage 
or through some form of http relay protocol such as FastCGI, the jython 
receiving end of that would have to produce a response body encoded as a 
jython binary string. Which is exactly what jython socket operations, 
etc, produce. So pure python middleware code that distributes WSGI 
requests over, say a network socket, should run identically between 
jython and cpython. Which is nice to know.

And which would probably true for IronPython too: That Jim Hugunin is a 
clever lad. Jython really does all this stuff pretty seamlessly in 
relation to cpython.

[Phillip J. Eby]
 >  * produce Unicode input headers and body strings by decoding from
 > latin-1, as long as the produced values are considered type 'str' for
 > that Python implementation.

On jython, there is no point in decoding latin-1 strings to unicode 
strings, because their representations are identical: both are 
types.StringType, both take 2 bytes per character/byte, with the upper 
byte as zero.

If the recipient is another jython component, all string types will be 
received correctly.

If the recipient is a cpython component, then it will still receive the 
correct string, because whatever interface lies between the cpython and 
the jython will have correctly converted the data (if it was latin-1 data).

So perhaps this requirement could be stated as "jython 
components/applications must produce unicode input headers and body 
strings, which must only contain latin-1 characters"?

Whew! That turned out to be not so bad after all! (Alan crosses his 
fingers behind his back :-)

Regards,

Alan.