[Web-SIG] WSGI, Python 3 and Unicode

Mon Dec 10 19:31:20 CET 2007

On Dec 9, 2007 7:56 PM, Graham Dumpleton <graham.dumpleton at gmail.com> wrote:
> On 09/12/2007, Guido van Rossum <guido at python.org> wrote:
> > On Dec 8, 2007 12:37 AM, Graham Dumpleton <graham.dumpleton at gmail.com> wrote:
> > > On 08/12/2007, Phillip J. Eby <pje at telecommunity.com> wrote:
> > > > * When running under Python 3, servers MUST provide a text stream for
> > > > wsgi.errors
> > >
> > > In Python 3, what happens if user code attempts to output to a text
> > > stream a byte string? Ie., what would be displayed?
> >
> > Nothing. You get a TypeError.
>
> Hmmm, this in itself could be quite a pain for existing code where
> people have added debug code to print out details from request headers
> (if now to be passed as bytes), or part of the request content.

Sorry, I was just talking about the write() method on a text stream.
The print() function in 3.0 will print the repr() of the bytes.
Example:

Python 3.0a2 (py3k, Dec 10 2007, 09:38:42)
[GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a = b"xyz"
>>> print(a)
b'xyz'
>>> b = b"abc\377def"
>>> print(b)
b'abc\xffdef'
>>>

(Note that this works because print() always calls str() on the
argument and bytes.str is defined to be the same as bytes.repr.)

> What is the suggested way of best dumping out bytes for debugging
> purposes so one does not have to worry about encoding issues, just use
> repr()?

Just use print().

> > > Also, if wsgi.errors is a text stream, presume that if a WSGI adapter
> > > has to internally map this to a C char* like API for logging that it
> > > would need to apply standard Python encoding to yield usable char*
> > > string for output.
> >
> > The encoding can/must be specified per text stream.
>
> But what should the encoding associated with the wsgi.errors stream be?

Depends on the platform and your requirements.

> If code which outputs text to wsgi.errors can use any valid Unicode
> character, if one sets it to US-ASCII encoding, then chance that
> logging output will fail because of characters not being valid in that
> character set. If one instead uses UTF-8, then potentially have issues
> where that byte string coming out other end of text stream is passed
> to C API functions. Issues might arise here where C API not expecting
> variable width character encoding.
>
> I'll freely admit I am not across all this Unicode encode/decode stuff
> as I don't generally have to deal with foreign languages, but seems to
> be a few missing details in this area which need to be filled out for
> a modified WSGI specification.

The goal of this part of Py3k is to make it more obvious when you
haven't thought through your encoding issues enough by failing as soon
as (encoded) bytes meet (decoded) characters.

Of course, you can still run into delayed trouble by using an
inappropriate encoding, which only shows up when there is an actual
encoding or decoding error; but at least you will have carefully
distinguished between encoded and decoded text throughout your
program, so the fix is now to change the encoding rather than having
to restructure your code to properly separate encoded and decoded
text.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)