<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7654.12">
<TITLE>Re: [Web-SIG] Python 3.0 and WSGI 1.0.</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->
<P><FONT SIZE=2>P.J. Eby wrote:<BR>
> At 08:07 AM 5/8/2009 -0700, Robert Brewer wrote:<BR>
>> I decided that that single type should be byte strings because I want<BR>
>> WSGI middleware and applications to be able to choose what encoding<BR>
>> their output is. Passing unicode to the server would require some<BR>
>> out-of-band method of telling the server which encoding to use per<BR>
>> response, which seemed unacceptable.<BR>
><BR>
> I find the above baffling, since PEP 333 explicitly states that<BR>
> when using unicode types, they're not actually supposed to *be*<BR>
> unicode -- they're just bytes decoded with latin-1.<BR>
<BR>
It also explicitly states that "HTTP does not directly support Unicode,<BR>
and neither does this interface. All encoding/decoding must be handled<BR>
by the application; all strings passed to or from the server must be<BR>
standard Python BYTE STRINGS (emphasis mine), not Unicode objects. The<BR>
result of using a Unicode object where a string object is required, is<BR>
undefined."<BR>
<BR>
PEP 333 is difficult to interpret because it uses the name "str"<BR>
synonymously with the concept "byte string", which Python 3000 defies. I<BR>
believe the intent was to differentiate unicode from bytes, not elevate<BR>
whatever type happens to be called "str" on your Python du jour. It was<BR>
and is a mistake to standardize on type names ("str") across platforms<BR>
and not on type behavior ("byte string").<BR>
<BR>
If Python3 WSGI apps emit unicode strings (py3k type 'str'), you're<BR>
effectively saying the server will always call<BR>
"chunk.encode('latin-1')". That negates any benefit of using unicode as<BR>
the type for the response. That's not "supporting unicode"; that's using<BR>
unicode exactly as if it were an opaque byte string. That's seems silly<BR>
to me when there is a perfectly useful byte string type.<BR>
<BR>
> So, the server doesn't need to know "what encoding to use" -- it's<BR>
> latin-1, plain and simple. (And it's an error for an application to<BR>
> produce a unicode string that can't be encoded as latin-1.)<BR>
><BR>
> To be even more specific: an application that produces strings can<BR>
> "choose what encoding to use" by encoding in it, then decoding those<BR>
> bytes via latin-1. (This is more or less what Jython and IronPython<BR>
> users are doing already, I believe.)<BR>
<BR>
That may make sense for Jython and IronPython if they truly do not have<BR>
a usable byte string type. But it doesn't make as much sense for Python3<BR>
which has a usable byte string type. My way:<BR>
<BR>
App Server<BR>
--- ------<BR>
bchunk = uchunk.encode('utf-8')<BR>
yield bchunk<BR>
write(bchunk)<BR>
<BR>
Your way:<BR>
<BR>
App Server<BR>
--- ------<BR>
bchunk = uchunk.encode('utf-8')<BR>
uchunk = chunk.decode('latin-1')<BR>
yield uchunk<BR>
bchunk = uchunk.encode('latin-1')<BR>
write(bchunk)<BR>
<BR>
I don't see any benefit to that.<BR>
<BR>
<BR>
Robert Brewer<BR>
fumanchu@aminus.org</FONT>
</P>
</BODY>
</HTML>