On Tue, Aug 11, 2009 at 11:58 PM, Graham Dumpleton <span dir="ltr">&lt;<a href="mailto:graham.dumpleton@gmail.com">graham.dumpleton@gmail.com</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


2009/8/12 Ian Bicking &lt;<a href="mailto:ianb@colorstudy.com">ianb@colorstudy.com</a>&gt;:<br>

<div><div></div><div class="h5">&gt; On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer &lt;<a href="mailto:fumanchu@aminus.org">fumanchu@aminus.org</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; &gt; 5. When running under Python 3, servers MUST provide CGI HTTP and<br>

&gt;&gt; &gt; server variables as strings. Where such values are sourced from a byte<br>

&gt;&gt; &gt; string, be that a Python byte string or C string, they should be<br>

&gt;&gt; &gt; converted as &#39;UTF-8&#39;. If a specific web server infrastructure is able<br>

&gt;&gt; &gt; to support different encodings, then the WSGI adapter MAY provide a<br>

&gt;&gt; &gt; way for a user of the WSGI adapter to customise on a global basis, or<br>

&gt;&gt; &gt; on a per value basis what encoding is used, but this is entirely<br>

&gt;&gt; &gt; optional. Note that there is no requirement to deal with RFC 2047.<br>

&gt;&gt;<br>

&gt;&gt; We&#39;re passing unicode for almost everything.<br>

&gt;&gt;<br>

&gt;&gt; REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and<br>

&gt;&gt; must be ascii-decodable. So are SERVER_PROTOCOL and our custom<br>

&gt;&gt; ACTUAL_SERVER_PROTOCOL entries.<br>

&gt;&gt;<br>

&gt;&gt; The original bytes of the Request-URI are stored in REQUEST_URI. However,<br>

&gt;&gt; PATH_INFO and QUERY_STRING are parsed from it, and decoded via a<br>

&gt;&gt; configurable charset, defaulting to UTF-8. If the path cannot be decoded<br>

&gt;&gt; with that charset, ISO-8859-1 is tried. Whichever is successful is stored at<br>

&gt;&gt; environ[&#39;REQUEST_URI_ENCODING&#39;] so middleware and apps can transcode if<br>

&gt;&gt; needed. Our origin server always sets SCRIPT_NAME to &#39;&#39;, but if we populated<br>

&gt;&gt; it, we would make it decoded by the same charset.<br>

&gt;<br>

&gt; My understanding is that PATH_INFO *should* be UTF-8 regardless of what<br>

&gt; encoding a page might be in. At least that&#39;s what I got when testing<br>

&gt; Firefox.  It might not be valid UTF-8 if it was manually constructed, but<br>

&gt; then there&#39;s little reason to think it is valid anything; only the bytes or<br>

&gt; REQUEST_URI are likely to be an accurate representation.<br>

<br>

</div></div>As I understood it, PJE was suggesting that wasn&#39;t the case.<br>

<br>

For example, what about case where URL appears for target of form POST<br>

and the encoding of that form page wasn&#39;t UTF-8. What is the browser<br>

going to send in that case.<br>

<br>

Or is this the sort of case you have tested and qualify as saying if<br>

manually constructed anything could happen?<br>

<div><div></div><div class="h5"></div></div></blockquote><div><br></div><div>Correct -- you can write any set of % encodings, and I don&#39;t think it even has to be able to validly url-decode (e.g., /foo%zzz will work).  It definitely doesn&#39;t have to be a valid encoding.  However, if you actually include unicode characters, they will always be encoded as UTF-8 (as goes with the IRI standard).  This is in a case like &lt;a href=&quot;/some page&quot;&gt;, the browser will request /some%20page, because it escapes unsafe characters.  Similarly if you request &lt;a href=&quot;/f<span class="Apple-style-span" style="font-family: sans-serif; font-size: 13px; line-height: 19px; ">rançais</span>&quot;&gt; it will encode that <span class="Apple-style-span" style="font-family: sans-serif; font-size: 13px; line-height: 19px; ">ç in UTF-8, then url-encode it, even if the page itself is ISO-8859-1.  Well, at least on Firefox.  I used this to test: <a href="http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py">http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py</a></span></div>


</div><br>-- <br>Ian Bicking  |  <a href="http://blog.ianbicking.org">http://blog.ianbicking.org</a>  |  <a href="http://topplabs.org/civichacker">http://topplabs.org/civichacker</a><br>