[Web-SIG] bytes, strings, and Unicode in Jython, IronPython, and CPython 3.0

Tue Sep 14 18:59:30 CEST 2004

I've reviewed last month's Python-Dev discussion about the future Python 
'bytes()' type, and the eventual transition away from Python's current 
8-bit strings.

Mainly, the impression I get is that significant change in this respect 
really can't happen until Python 3.0, because too many things have to 
change at once for it to work.

So, here's what I propose to do about the open issue in PEP 333.  Servers 
and gateways that run under Python implementations where all strings are 
Unicode (e.g. Jython) *may*:

  * accept Unicode statuses and headers, so long as they properly encode 
them for transmission (latin-1 + RFC 2047)

  * accept Unicode for response body segments, so long as each segment may 
be encoded as latin-1 (i.e. only uses chars 0-255)

  * produce Unicode input headers and body strings by decoding from 
latin-1, as long as the produced values are considered type 'str' for that 
Python implementation.

I think that these rules allow conformance with the "letter of the law" for 
the rest of the WSGI spec, since servers, gateways, and applications are 
still required to use 'str' instances in all of the above cases.  The issue 
here is that non-CPython implementations may be able to place arbitrary 
Unicode characters in a 'str' instance, so the encoding rules need to be clear.

I think this is probably the right thing to do, leaving the adoption of any 
"byte array" usage to Python 3.0 and WSGI 2.0 or 3.0 or whatever we're on 
by then.  But I am not a Unicode guru, and I'm definitely not familiar with 
the details of non-CPython 'str' vs. Unicode issues.  So, I hope that there 
are some folks out there (Alan?) who can comment on this.  Thanks.