[Python-Dev] bytes / unicode
Stephen J. Turnbull
stephen at xemacs.org
Mon Jun 21 13:19:50 CEST 2010
Robert Collins writes:
> Also, url's are bytestrings - by definition;
Eh? RFC 3896 explicitly says
A URI is an identifier consisting of a sequence of characters
matching the syntax rule named <URI> in Section 3.
(where the phrase "sequence of characters" appears in all ancestors I
found back to RFC 1738), and
2. Characters
The URI syntax provides a method of encoding data, presumably for
the sake of identifying a resource, as a sequence of characters.
The URI characters are, in turn, frequently encoded as octets for
transport or presentation. This specification does not mandate any
particular character encoding for mapping between URI characters
and the octets used to store or transmit those characters. When a
URI appears in a protocol element, the character encoding is
defined by that protocol; without such a definition, a URI is
assumed to be in the same character encoding as the surrounding
text.
> if the standard library has made them unicode objects in 3, I
> expect a lot of pain in the webserver space.
Yup. But pain is inevitable if people are treating URIs (whether URLs
or otherwise) as octet sequences. Then your base URL is gonna be
b'mailto:stephen at xemacs.org', but the natural thing the UI will want
to do is
formurl = baseurl + '?subject=うるさいやつだなぁ…'
IMO, the UI is right. "Something" like the above "ought" to work.
So the function that actually handles composing the URL should take a
string (ie, unicode), and do all escaping. The UI code should not
need to know about escaping. If nothing escapes except the function
that puts the URL in composed form, and that function always escapes,
life is easy.
Of course, in real life it's not that easy. But it's possible to make
things unnecessarily hard for the users of your URI API(s), and one
way to do that is to make URIs into "just bytes" (and "just unicode"
is probably nearly as bad, except that at least you know it's not
ready for the wire).
More information about the Python-Dev
mailing list