[Web-SIG] Unicode in Python 3

Armin Ronacher armin.ronacher at active-4.com
Sat Sep 19 15:56:21 CEST 2009


Hi,

Graham Dumpleton schrieb:
> So, no strict need to make the WSGI adapter do it differently. You may
> want to only do that if concerned about overhead of transcoding.
> 
> Transcoding just these is most probably going to be less overhead than
> the WSGI adapter having to set up both unicode and raw values in a
> dictionary for everything.
So if I understand you correctly the wsgi.uri_encoding would be used
*only* as a information what the URI encoding was, the application
however should use the internal encoding it wants?  That sounds right,
but then let's make that should a MUST.

Your query_string example is flawed as the query string is always quoted
and encoding/decoding an ASCII only string will not change much if the
encoding is a superset of ASCII which is required anyways for various
reasons.

I would go with this wording for the spec then:

    wsgi.uri_encoding holds the encoding of the URI that was used to
    decode the SCRIPT_NAME and PATH_INFO.  If the application decodes
    the query string it MUST obey the encoding here.  If REQUEST_URI
    is available, the server will use the URI encoding to decode this
    value as well.

    However for encoding of URIs it MUST not use the wsgi.uri_encoding
    information but MUST use UTF-8 to encode the URI.

    Backwards compatibility for URIs: If the application depends on
    non UTF-8 URIs and the fallback encoding is NOT latin1 the
    application will have to check the wsgi.uri_encoding for latin1
    and if it detects it, it has to encode back to latin1 and decode
    from the fallback encoding (eg: iso-8859-7).

    WSGI 2.0 however requires the application to use UTF-8 for
    generated URIs.

I checked the browser implementations now and for arbitrary URIs (not
generated URIs in a page) the browser will always try UTF-8.  RFC 3987
also recommends UTF-8 for URIs.

> Even with your iso-8859-4 example, can't see how you can without
> knowing loose what original characters are, as wsgi.uri_encoding being
> provided always allows you to transcode to what you needed it to be
> when what was supplied didn't match.
Assuming the only possible values for wsgi.uri_encoding are
latin1/iso-8859-1 and utf-8 when the application is invoked, I'm totally
fine with that.  Because if the application's fallback URI encoding is
something like iso-8859-4, the application can itself check for latin1
and reencode the data.  I could live with that.  What I don't want to
see in WSGI is that the fallback encoding (latin1) could be changed in
the server configuration.

> Now you can go back to monologue, as definitely sleeping now. ;-)
\o/


Regards,
Armin


More information about the Web-SIG mailing list