On Tue, Aug 11, 2009 at 11:58 PM, Graham Dumpleton <span dir="ltr"><<a href="mailto:graham.dumpleton@gmail.com">graham.dumpleton@gmail.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
2009/8/12 Ian Bicking <<a href="mailto:ianb@colorstudy.com">ianb@colorstudy.com</a>>:<br>
<div><div></div><div class="h5">> On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer <<a href="mailto:fumanchu@aminus.org">fumanchu@aminus.org</a>> wrote:<br>
>><br>
>> > 5. When running under Python 3, servers MUST provide CGI HTTP and<br>
>> > server variables as strings. Where such values are sourced from a byte<br>
>> > string, be that a Python byte string or C string, they should be<br>
>> > converted as 'UTF-8'. If a specific web server infrastructure is able<br>
>> > to support different encodings, then the WSGI adapter MAY provide a<br>
>> > way for a user of the WSGI adapter to customise on a global basis, or<br>
>> > on a per value basis what encoding is used, but this is entirely<br>
>> > optional. Note that there is no requirement to deal with RFC 2047.<br>
>><br>
>> We're passing unicode for almost everything.<br>
>><br>
>> REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and<br>
>> must be ascii-decodable. So are SERVER_PROTOCOL and our custom<br>
>> ACTUAL_SERVER_PROTOCOL entries.<br>
>><br>
>> The original bytes of the Request-URI are stored in REQUEST_URI. However,<br>
>> PATH_INFO and QUERY_STRING are parsed from it, and decoded via a<br>
>> configurable charset, defaulting to UTF-8. If the path cannot be decoded<br>
>> with that charset, ISO-8859-1 is tried. Whichever is successful is stored at<br>
>> environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if<br>
>> needed. Our origin server always sets SCRIPT_NAME to '', but if we populated<br>
>> it, we would make it decoded by the same charset.<br>
><br>
> My understanding is that PATH_INFO *should* be UTF-8 regardless of what<br>
> encoding a page might be in. At least that's what I got when testing<br>
> Firefox. It might not be valid UTF-8 if it was manually constructed, but<br>
> then there's little reason to think it is valid anything; only the bytes or<br>
> REQUEST_URI are likely to be an accurate representation.<br>
<br>
</div></div>As I understood it, PJE was suggesting that wasn't the case.<br>
<br>
For example, what about case where URL appears for target of form POST<br>
and the encoding of that form page wasn't UTF-8. What is the browser<br>
going to send in that case.<br>
<br>
Or is this the sort of case you have tested and qualify as saying if<br>
manually constructed anything could happen?<br>
<div><div></div><div class="h5"></div></div></blockquote><div><br></div><div>Correct -- you can write any set of % encodings, and I don't think it even has to be able to validly url-decode (e.g., /foo%zzz will work). It definitely doesn't have to be a valid encoding. However, if you actually include unicode characters, they will always be encoded as UTF-8 (as goes with the IRI standard). This is in a case like <a href="/some page">, the browser will request /some%20page, because it escapes unsafe characters. Similarly if you request <a href="/f<span class="Apple-style-span" style="font-family: sans-serif; font-size: 13px; line-height: 19px; ">rançais</span>"> it will encode that <span class="Apple-style-span" style="font-family: sans-serif; font-size: 13px; line-height: 19px; ">ç in UTF-8, then url-encode it, even if the page itself is ISO-8859-1. Well, at least on Firefox. I used this to test: <a href="http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py">http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py</a></span></div>
</div><br>-- <br>Ian Bicking | <a href="http://blog.ianbicking.org">http://blog.ianbicking.org</a> | <a href="http://topplabs.org/civichacker">http://topplabs.org/civichacker</a><br>