URLs and ampersands
mattheww at chiark.greenend.org.uk
Tue Aug 5 19:06:58 CEST 2008
Steven D'Aprano wrote:
> I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
> snag with URLs containing ampersands:
> Somewhere in the process, urls like the above are escaped to:
> which naturally fails to exist.
> I could just do a string replace, but is there a "right" way to escape
> and unescape URLs? I've looked through the standard lib, but I can't find
> anything helpful.
I don't believe there is a concept of 'escaping a URL' as such. How you
escape or unescape a URL depends on what context you're embedding it in
or extracting it from.
In this case, it looks like you have URLs which have been escaped to go
into an html CDATA attribute value (such as <a href="...">).
I believe there is no documented function in the Python standard library
which reverses this escaping (short of putting your string into a
larger document and parsing that with a full html or xml parser).
More information about the Python-list