[Python-3000] urllib.quote/unquote behavior?

Antoine Pitrou solipsis at pitrou.net
Fri May 30 16:07:32 CEST 2008


Oleg Broytmann <phd <at> phd.pp.ru> writes:
> On Fri, May 30, 2008 at 02:19:23PM +0200, Georg Brandl wrote:
> > Python 3.0's urllib.quote() and unquote() handle non-ASCII data strangely.
> > quote() encodes characters with codepoint < 256 using latin-1, but others
> > using utf-8. unquote() decodes everything using latin-1.
> > 
> > Is the correct behavior to always use utf-8?
> 
>    Always UTF-8. See
> http://en.wikipedia.org/wiki/Percent-encoding#Current_standard

Well, according to your link things are not that simple:
""" This requirement was introduced in January 2005 with the publication of RFC
3986. URI schemes introduced before this date are not affected. """

Practically, in the particular case of HTTP, you must probably distinguish
between the file path part (before the ? sign) and the query string part (after
the ? sign). The file path percent-encoding may depend on the actual filesystem
encoding, or the Web server configuration. The query string percent-encoding may
depend on the actual Web application being queried, or the programming language
in which it's written, or anything else altogether :-)

Regards

Antoine.




More information about the Python-3000 mailing list