[Web-SIG] Request for Comments on upcoming WSGI Changes

Tue Sep 22 04:40:54 CEST 2009

Henry Precheur wrote:
> On Mon, Sep 21, 2009 at 03:26:35PM -0700, Robert Brewer wrote:
> > It looks simpler until you have a site that is not primarily utf-8.
> > In that case, you multiply your (1 line * number of middlewares in
the
> > WSGI
> > stack * each request).
> > With wsgi.uri_encoding you get either (1 line * 1
> > middleware designed to transcode * each request), or even 0 if your
> > whole site uses just one charset.
> 
> I am not sure I understand your point.
> 
> The 0 lines hold true if the whole site is using latin-1 or utf-8 and
> you write your applications/middlewares only for this site. But if
it's
> using any other encoding you still have to transcode.
> 
> def middleware(start_response, environ):
>     value = environ['some_key'].\
>         encode('utf8', 'surrogateescape').\
>         decode(SITE_ENCODING)
>     ...

Yes; you have to transcode to the "correct" encoding. Once. Then every
other WSGI application interface "below" that one doesn't have to care.

> With wsgi.uri_encoding you would still have to do the following:
> 
> def middleware(start_response, environ):
>     value = environ['some_key'].\
>         encode(environ['some_key.encoding']).\
>         decode(SITE_ENCODING)
>     ...
> 
> Of course you can directly use `environ['some_key']` if you know
you'll
> get the 'right' encoding all the time. But when the encoding changes,
> you'll have to fix all your middlewares.

The decoding doesn't change spontaneously. You either get the correct
one or you get an incorrect one. If it's incorrect, you fix it, one
time, via a WSGI component which you've configured to determine the
"correct" decoding. Then every other WSGI component "below" that one can
go back to trusting the decoding was correct. In fact, if you do that
transcoding right away, no other WSGI components need to be rewritten to
take advantage of unicode. You just have to deploy a single transcoder,
that's 6 lines of code max. I know PJE will chime in here and say you
can't deploy a website that works differently if you happen to forget to
turn on a given piece of middleware, but I also know the rest of you
will drown him out from personal experience because you've *never* done
that. ;)

With utf8+surrogateescape, you don't transcode once, you transcode in
every WSGI component in your stack that needs to "correct" the decoding.
You have to do it more than once because, each time you
encode/re-decode, you use the result and then throw it away. Any
subsequent WSGI components have to encode/re-decode--you cannot store
the redecoded URI in SCRIPT_NAME/PATH_INFO, because the
utf8+surrogateescape scheme says...well, it's always utf8-decoded. In
addition, *every* component that needs to compare URI's then has to be
configured with the same logic, however convoluted, to perform the
"correct" decoding again. It's not just routing middleware: caches need
to reliably compare decoded URI's; so do sessions; so does auth
(especially!); so do static files. And Heaven forfend you actually
decode differently in two different components!

Robert Brewer
fumanchu at aminus.org