Unescaping URLs in Python
John Nagle
nagle at animats.com
Mon Dec 25 13:12:39 EST 2006
Lawrence D'Oliveiro wrote:
> In message <hWHjh.25037$Gr2.6406 at newssvr21.news.prodigy.net>, John Nagle
> wrote:
>
>
>>Here's a URL from a link on the home page of a major company.
>>
>><a href="/adsk/servlet/index?siteID=123112&id=1860142">About Us</a>
>>
>>What's the appropriate Python function to call to unescape a URL
>>which might contain things like that?
>
>
> Just use any HTML-parsing library. I think the standard Python HTMLParser
> will do the trick, provided there aren't any errors in the HTML.
I'm using BeautifulSoup, because I need to process real world
HTML. At least by default, it doesn't unescape URLs like that.
Nor, on the output side, does it escape standalone "&" characters,
as in text like "Sales & Advertising Department".
But there are various BeautifulSoup options; more on this later.
John Nagle
More information about the Python-list
mailing list