[Python-Dev] [Web-SIG] bytes / unicode

Henry Precheur henry at precheur.org
Wed Jun 23 23:35:38 CEST 2010

On Wed, Jun 23, 2010 at 09:36:45PM +0200, Antoine Pitrou wrote:
> I don't think you can't claim, though, that Python 3 makes things
> significantly harder for these frameworks. The proof is that many of
> them already give the user unicode strings in Python 2.x. They must
> have somehow got the decoding right.

Well... Frameworks usually 'simplify' the problem by partly ignoring it.
By default they assume the data in the request in UTF-8. You can specify
an alternative encoding in most of them. Django [1], Werkzeug [2], and
WebOb [3] do that.

The problem with this approach is that you still have to deal with weird
requests where one thing is unicode, and another is latin-1. Sometime
you can even have 2 different encodings in a single header like Cookies.
There's no solution to this problem, it has to be solved on a case by
case basis.

There was a big discussion a while ago on web-sig. I think the consensus
was that WSGI for Python 3 should assume that the data is encoded in
latin-1 since it's the default encoding according to the RFC.

[1] http://docs.djangoproject.com/en/dev/ref/request-response/#django.http.HttpRequest.encoding
[2] http://werkzeug.pocoo.org/documentation/dev/unicode.html#request-and-response-objects
[3] http://pythonpaste.org/webob/reference.html#unicode-variables

  Henry PrĂȘcheur

