urllib.quote and unicode

Fri Dec 6 03:42:13 EST 2002

"Martin v. Löwis" <martin at v.loewis.de> wrote...
> Tim Roberts <timr at probo.com> writes:
>
> > URLs have to be ISO-8859-1,
> > so they cannot include Unicode characters.
>
> I believe URLs are just bytes, with no character set implied.

Correct.

Various URI schemes ('http' being a conspicuous exception) and various
application-level uses of URIs (e.g., cleanup of non-ASCII characters in
URI-type attribute values by HTML 4 user-agents and XSLT's HTML output
method) sometimes mandate the use of utf-8, and future standards are
gravitating toward utf-8; but in practice, you will find that arbitrary
decisions are often made regarding what encoding is used as the basis for
%-escaping of non-ASCII characters.

Many servers that provide APIs to access HTML form data, for example, assume
iso-8859-1 or windows-1252. Many browsers sending HTML form data use the
same encoding as the HTML document that contained the form (which can be
user-overridden), and they don't communicate this information to the server.