[Python-Dev] bytes / unicode

Stephen J. Turnbull stephen at xemacs.org
Mon Jun 21 13:19:50 CEST 2010


Robert Collins writes:

 > Also, url's are bytestrings - by definition;

Eh?  RFC 3896 explicitly says

    A URI is an identifier consisting of a sequence of characters
    matching the syntax rule named <URI> in Section 3.

(where the phrase "sequence of characters" appears in all ancestors I
found back to RFC 1738), and

    2.  Characters

    The URI syntax provides a method of encoding data, presumably for
    the sake of identifying a resource, as a sequence of characters.
    The URI characters are, in turn, frequently encoded as octets for
    transport or presentation.  This specification does not mandate any
    particular character encoding for mapping between URI characters
    and the octets used to store or transmit those characters.  When a
    URI appears in a protocol element, the character encoding is
    defined by that protocol; without such a definition, a URI is
    assumed to be in the same character encoding as the surrounding
    text.

 > if the standard library has made them unicode objects in 3, I
 > expect a lot of pain in the webserver space.

Yup.  But pain is inevitable if people are treating URIs (whether URLs
or otherwise) as octet sequences.  Then your base URL is gonna be
b'mailto:stephen at xemacs.org', but the natural thing the UI will want
to do is 

    formurl = baseurl + '?subject=うるさいやつだなぁ…'

IMO, the UI is right.  "Something" like the above "ought" to work.

So the function that actually handles composing the URL should take a
string (ie, unicode), and do all escaping.  The UI code should not
need to know about escaping.  If nothing escapes except the function
that puts the URL in composed form, and that function always escapes,
life is easy.

Of course, in real life it's not that easy.  But it's possible to make
things unnecessarily hard for the users of your URI API(s), and one
way to do that is to make URIs into "just bytes" (and "just unicode"
is probably nearly as bad, except that at least you know it's not
ready for the wire).



More information about the Python-Dev mailing list