[Web-SIG] WSGI 2

Wed Aug 12 07:05:40 CEST 2009

On Tue, Aug 11, 2009 at 11:58 PM, Graham Dumpleton <
graham.dumpleton at gmail.com> wrote:

> 2009/8/12 Ian Bicking <ianb at colorstudy.com>:
> > On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer <fumanchu at aminus.org>
> wrote:
> >>
> >> > 5. When running under Python 3, servers MUST provide CGI HTTP and
> >> > server variables as strings. Where such values are sourced from a byte
> >> > string, be that a Python byte string or C string, they should be
> >> > converted as 'UTF-8'. If a specific web server infrastructure is able
> >> > to support different encodings, then the WSGI adapter MAY provide a
> >> > way for a user of the WSGI adapter to customise on a global basis, or
> >> > on a per value basis what encoding is used, but this is entirely
> >> > optional. Note that there is no requirement to deal with RFC 2047.
> >>
> >> We're passing unicode for almost everything.
> >>
> >> REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and
> >> must be ascii-decodable. So are SERVER_PROTOCOL and our custom
> >> ACTUAL_SERVER_PROTOCOL entries.
> >>
> >> The original bytes of the Request-URI are stored in REQUEST_URI.
> However,
> >> PATH_INFO and QUERY_STRING are parsed from it, and decoded via a
> >> configurable charset, defaulting to UTF-8. If the path cannot be decoded
> >> with that charset, ISO-8859-1 is tried. Whichever is successful is
> stored at
> >> environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if
> >> needed. Our origin server always sets SCRIPT_NAME to '', but if we
> populated
> >> it, we would make it decoded by the same charset.
> >
> > My understanding is that PATH_INFO *should* be UTF-8 regardless of what
> > encoding a page might be in. At least that's what I got when testing
> > Firefox.  It might not be valid UTF-8 if it was manually constructed, but
> > then there's little reason to think it is valid anything; only the bytes
> or
> > REQUEST_URI are likely to be an accurate representation.
>
> As I understood it, PJE was suggesting that wasn't the case.
>
> For example, what about case where URL appears for target of form POST
> and the encoding of that form page wasn't UTF-8. What is the browser
> going to send in that case.
>
> Or is this the sort of case you have tested and qualify as saying if
> manually constructed anything could happen?
>

Correct -- you can write any set of % encodings, and I don't think it even
has to be able to validly url-decode (e.g., /foo%zzz will work).  It
definitely doesn't have to be a valid encoding.  However, if you actually
include unicode characters, they will always be encoded as UTF-8 (as goes
with the IRI standard).  This is in a case like <a href="/some page">, the
browser will request /some%20page, because it escapes unsafe characters.
 Similarly if you request <a href="/français"> it will encode that ç in
UTF-8, then url-encode it, even if the page itself is ISO-8859-1.  Well, at
least on Firefox.  I used this to test:
http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py

-- 
Ian Bicking  |  http://blog.ianbicking.org  |
http://topplabs.org/civichacker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090812/bd590f6f/attachment-0001.htm>