[Web-SIG] Unicode in Python 3
ianb at colorstudy.com
Sun Sep 20 06:46:34 CEST 2009
I can't read all this thread carefully, too much stuff.
I will note however that people are STILL ignoring surrogateescape
(http://www.python.org/dev/peps/pep-0383/). This is like the third or
fourth time I've brought it up. It was added to Python 3.1 for some
of the exact issues we are encountering.
Particularly, imagine someone requests /foo%efbar (which is not valid UTF-8).
>>> SCRIPT_NAME = b'/foo\xefbar' # after url unquoting (urllib.request.unquote doesn't work for this currently)
>>> s = SCRIPT_NAME.decode('utf8', 'surrogateescape')
>>> s.encode('utf8', 'surrogateescape')
So we can have unicode values that can be safely and correctly
transcoded to other encodings (or handled in their raw form).
The constraints on surrogateescape are:
* You have to use 'surrogateescape' during decoding and encoding (I
think for decoding it should be part of the spec)
* You have to know the encoding; doing s.encode('latin1',
'surrogateescape') wouldn't necessarily preserve the correct bytes (it
does for this example, but wouldn't if there was a mix of valid UTF-8
and invalid bytes)
And there's a bit of an annoyance to the fact that
SCRIPT_NAME/PATH_INFO should always be treated as UTF-8 (which might
sometimes be wrong, but for any modern app/browser will be right), but
maybe other parts (HTTP_COOKIE?) are in "native" encoding. Well,
besides HTTP_COOKIE, I don't know what else would be in a different
encoding. Atompub adds Slug, but it's a URL/IRI, so it should be
ASCII. I have seen proposals for a Title header (e.g., when PUTting
an image and giving it a title), and that could be unicode. But in
all those cases it'll be a modern app and modern clients, and in those
cases people just use UTF-8.
Frankly I'm open to UTF-8-everywhere. People mentioned Jack and Rack,
and to what degree that works, it probably works because everyone uses
UTF-8. With surrogateescape we allow transcoding when needed (e.g.,
if you wanted to handle redirects from old/weird non-UTF-8 URLs) but
keep things reasonably simple otherwise.
More information about the Web-SIG