[Web-SIG] Request for Comments on upcoming WSGI Changes
graham.dumpleton at gmail.com
Tue Sep 22 06:30:06 CEST 2009
2009/9/22 Henry Precheur <henry at precheur.org>:
> On Mon, Sep 21, 2009 at 07:40:54PM -0700, Robert Brewer wrote:
>> The decoding doesn't change spontaneously.
>> You either get the correct one or you get an incorrect one. If it's
>> incorrect, you fix it, one time, via a WSGI component which you've
>> configured to determine the "correct" decoding. Then every other WSGI
>> component "below" that one can go back to trusting the decoding was
>> correct. In fact, if you do that transcoding right away, no other WSGI
>> components need to be rewritten to take advantage of unicode. You just
>> have to deploy a single transcoder, that's 6 lines of code max.
> And you can do that with utf8+surrogateescape too. Except that you don't
> have to determine what encoding the gateway sent you, it's always
>> With utf8+surrogateescape, you don't transcode once, you transcode in
>> every WSGI component in your stack that needs to "correct" the
>> decoding. You have to do it more than once because, each time you
>> encode/re-decode, you use the result and then throw it away. Any
>> subsequent WSGI components have to encode/re-decode--you cannot store
>> the redecoded URI in SCRIPT_NAME/PATH_INFO, because the
>> utf8+surrogateescape scheme says...well, it's always utf8-decoded.
> You don't get something REALLY important with surrogateescape: You can
> ALWAYS get the original bytes back.
> >>> b = b'fran\xe7cois'
> >>> s = b.decode('utf8', 'surrogateescape')
> >>> s
> >>> s.encode('utf8', 'surrogateescape')
Hooray, an example finally which shows what the data looks like. If one reads:
there is no actual example in it which shows what is actually in the
unicode string. So unless you go play with the code it is hard to
understand what is actually happening.
Yeah, yeah, I may be slow to get things but I don't have the time to
go playing with every suggestion. ;-)
Note, still not saying whether surrogateescape is good or not, but
this is helping me to understand.
Someone did say something about being able to half make it work on
Python 2.X. Can someone properly provide example code for Python 2.X.
If we want uniformity in how interface works on Python 2.X and 3.X,
they we have to be able to use same method without tricks. This is why
wsgi.uri_encoding at the moment seems better, as not reliant on a
feature only in Python 3.1+.
> See? I got my latin-1 character '\xe7' back! Because '\udce7' is not a
> normal UTF-8 character, this character use some 'free space' in the
> unicode supplementary characters.
> The only thing you have to do is to pass 'surrogateescape' each time you
> call encode/decode.
>> In addition, *every* component that needs to compare URI's then has to
>> be configured with the same logic, however convoluted, to perform the
>> "correct" decoding again. It's not just routing middleware: caches
>> need to reliably compare decoded URI's; so do sessions; so does auth
>> (especially!); so do static files. And Heaven forfend you actually
>> decode differently in two different components!
> I don't understand why I would need to throw away the decoded string.
> This works perfectly well a far as I know:
> environ['PATH_INFO'] = environ['PATH_INFO'].\
> encode('utf8', 'surrogateescape').\
> utf8+surrogateescape provides the same possibilities as
> wsgi.uri_encoding. You can transcode without losing information when you
> know what the correct encoding is. But utf8+surrogateescape is simpler
> because there's no need to pass around the name of the encoding in an
> additional variable.
> Henry Prêcheur
> Web-SIG mailing list
> Web-SIG at python.org
> Web SIG: http://www.python.org/sigs/web-sig
> Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com
More information about the Web-SIG