[Web-SIG] Unicode in Python 3
Armin Ronacher
armin.ronacher at active-4.com
Sat Sep 19 15:56:21 CEST 2009
Hi,
Graham Dumpleton schrieb:
> So, no strict need to make the WSGI adapter do it differently. You may
> want to only do that if concerned about overhead of transcoding.
>
> Transcoding just these is most probably going to be less overhead than
> the WSGI adapter having to set up both unicode and raw values in a
> dictionary for everything.
So if I understand you correctly the wsgi.uri_encoding would be used
*only* as a information what the URI encoding was, the application
however should use the internal encoding it wants? That sounds right,
but then let's make that should a MUST.
Your query_string example is flawed as the query string is always quoted
and encoding/decoding an ASCII only string will not change much if the
encoding is a superset of ASCII which is required anyways for various
reasons.
I would go with this wording for the spec then:
wsgi.uri_encoding holds the encoding of the URI that was used to
decode the SCRIPT_NAME and PATH_INFO. If the application decodes
the query string it MUST obey the encoding here. If REQUEST_URI
is available, the server will use the URI encoding to decode this
value as well.
However for encoding of URIs it MUST not use the wsgi.uri_encoding
information but MUST use UTF-8 to encode the URI.
Backwards compatibility for URIs: If the application depends on
non UTF-8 URIs and the fallback encoding is NOT latin1 the
application will have to check the wsgi.uri_encoding for latin1
and if it detects it, it has to encode back to latin1 and decode
from the fallback encoding (eg: iso-8859-7).
WSGI 2.0 however requires the application to use UTF-8 for
generated URIs.
I checked the browser implementations now and for arbitrary URIs (not
generated URIs in a page) the browser will always try UTF-8. RFC 3987
also recommends UTF-8 for URIs.
> Even with your iso-8859-4 example, can't see how you can without
> knowing loose what original characters are, as wsgi.uri_encoding being
> provided always allows you to transcode to what you needed it to be
> when what was supplied didn't match.
Assuming the only possible values for wsgi.uri_encoding are
latin1/iso-8859-1 and utf-8 when the application is invoked, I'm totally
fine with that. Because if the application's fallback URI encoding is
something like iso-8859-4, the application can itself check for latin1
and reencode the data. I could live with that. What I don't want to
see in WSGI is that the fallback encoding (latin1) could be changed in
the server configuration.
> Now you can go back to monologue, as definitely sleeping now. ;-)
\o/
Regards,
Armin
More information about the Web-SIG
mailing list