[Web-SIG] bytes, strings, and Unicode in Jython, IronPython,
and CPython 3.0
Phillip J. Eby
pje at telecommunity.com
Tue Sep 14 18:59:30 CEST 2004
I've reviewed last month's Python-Dev discussion about the future Python
'bytes()' type, and the eventual transition away from Python's current
8-bit strings.
Mainly, the impression I get is that significant change in this respect
really can't happen until Python 3.0, because too many things have to
change at once for it to work.
So, here's what I propose to do about the open issue in PEP 333. Servers
and gateways that run under Python implementations where all strings are
Unicode (e.g. Jython) *may*:
* accept Unicode statuses and headers, so long as they properly encode
them for transmission (latin-1 + RFC 2047)
* accept Unicode for response body segments, so long as each segment may
be encoded as latin-1 (i.e. only uses chars 0-255)
* produce Unicode input headers and body strings by decoding from
latin-1, as long as the produced values are considered type 'str' for that
Python implementation.
I think that these rules allow conformance with the "letter of the law" for
the rest of the WSGI spec, since servers, gateways, and applications are
still required to use 'str' instances in all of the above cases. The issue
here is that non-CPython implementations may be able to place arbitrary
Unicode characters in a 'str' instance, so the encoding rules need to be clear.
I think this is probably the right thing to do, leaving the adoption of any
"byte array" usage to Python 3.0 and WSGI 2.0 or 3.0 or whatever we're on
by then. But I am not a Unicode guru, and I'm definitely not familiar with
the details of non-CPython 'str' vs. Unicode issues. So, I hope that there
are some folks out there (Alan?) who can comment on this. Thanks.
More information about the Web-SIG
mailing list