Unescaping URLs in Python

John Nagle nagle at animats.com
Mon Dec 25 13:12:39 EST 2006


Lawrence D'Oliveiro wrote:
> In message <hWHjh.25037$Gr2.6406 at newssvr21.news.prodigy.net>, John Nagle
> wrote:
> 
> 
>>Here's a URL from a link on the home page of a major company.
>>
>><a href="/adsk/servlet/index?siteID=123112&id=1860142">About Us</a>
>>
>>What's the appropriate Python function to call to unescape a URL
>>which might contain things like that?
> 
> 
> Just use any HTML-parsing library. I think the standard Python HTMLParser
> will do the trick, provided there aren't any errors in the HTML.

    I'm using BeautifulSoup, because I need to process real world
HTML.  At least by default, it doesn't unescape URLs like that.

    Nor, on the output side, does it escape standalone "&" characters,
as in text like "Sales & Advertising Department".
But there are various BeautifulSoup options; more on this later.

				John Nagle



More information about the Python-list mailing list