urllib.quote and unicode

Martin v. Löwis martin at v.loewis.de
Fri Dec 6 02:28:59 EST 2002


Kelly <kkranabetter at yahoo.com> writes:

> Urllib's quoting of Unicode characters doesn't seem to work right in Python 
> 2.2.1:
> 
> >>> urllib.quote(unichr(8225))    # double dagger
> '%2021'
> >>> urllib.unquote("%2021")
> ' 21'
> 
> I couldn't find anything very useful on the web about quoting Unicode but 
> Microsoft IIS does understand Unicode characters when quoted like: 
> "%u2021".

URLs currently don't support Unicode, period. There is a draft RFC
about IRIs (International Resource Identifiers), which are like URLs,
but are conceptually sequences of Unicode characters, not byte
sequences, and thus need no quoting, see

http://www.ietf.org/internet-drafts/draft-duerst-iri-02.txt

That draft explains that, in order to obtain an URI from an IRI, you
encode the IRI in UTF-8, and then escape all bytes > 127.

Python does not implement this draft (and likely won't until it
becomes an RFC).

As you can see, the Microsoft extension to the URI syntax is likely
*not* to become an internet standard. If you need to interoperate with
Microsoft servers, you'll have to implement it yourself. You may try
whether the method proposed in the draft RFC also works with Microsoft
servers, in that case, you might better implement that method instead.

HTH,
Martin



More information about the Python-list mailing list