[Web-SIG] URL quoting in WSGI (or the lack therof)

Sun Jan 27 13:56:29 CET 2008

Hello, it's me again,

Phillip J. Eby wrote:
> MoinMoin, for example, has its own encoding scheme for handling 
> pseudo-slashes in paths, and IMO it's a better way to handle it than  
> trying to rely on finding a server that supports *not* decoding URLs.

I had the abstract knowledge that CGI is still used for deployment, but 
growing up with application servers must have spoiled me. Still, I think 
nothing stops mod_wsgi passing an encoded URL down to my apps but for 
adherence to the CGI spec. I've never checked it, nor the ajp + flup 
combination. Something more for the todo pile.

On the short run I'll $2F my slashes. I can't actually use %252F, 
because everyone seems to think they'll either get an encoded URL to 
unquote() or that unquote(unquote()) is a no-op: Routes was not alone in 
this.

Blake Winton wrote:
> I respectfully disagree.  I've been using %-escapes in urls for years, 
> intending that they get unescaped before being passed to 
> applications...  %7E instead of ~ mainly.
>
> in XML you can't tell the difference between <![CDATA[<]]> and &lt; 
> and &#60

You've given an example of separate ways to escape the same '<' 
character, and I agree that you shouldn't have to distinguish between 
them. But XML does treat '<' differently from '&lt;': if you just want 
to write a '<' instead of starting a tag, you need to escape it.

I don't want my SAX code[*] to deal with all the different ways to write 
a literal '<'. But I expect a "<tag" to generate a start_tag event, and 
"&lt;43" to be decoded into '<' in some element's text property, *not* 
to generate a start_43 event.

I think the same reasoning applies to '/'. Would it apply to '~' and ';' 
too?

[*] I've never actually written SAX-structured code; please pardon any 
mistaeks.

> in urls I would expect the url parser to unescape things, and pass you 
> the unescaped data.

Yeah, me too. I just don't want to lose information: "this was a literal 
slash, not an hierarchy delimiter". But if the framework splits on the 
real slashes and *then* unquotes each segment, I'd be happy to get that 
list of unquoted segments. This way, my URLs use the obvious way to 
escape slashes and by the time it gets to my code I have unescaped data.

This could be "dealt with" by using a REQUEST_URI instead. But then I 
have to manually trim the components that URL dispatching moved into 
SCRIPT_NAME. And I don't actually *have* a REQUEST_URI in the environ.

Ian Bicking wrote:
> distinguishing %2f and / is more of a corner case

I'll call it a canary in the URL mine. Should you have to balance '{' 
and '}' to find the quoted namespaces for GData terms? I haven't touched 
GData, but .split('/') and *then* unquoting looks like what's exactly 
needed in that case.

Thank you,
-- 
Luis Bruno