Replacing utf-8 characters

Mike no at spam
Wed Oct 5 10:35:36 EDT 2005


Hi, I am using Python to scrape web pages and I do not have problem 
unless I run into a site that is utf-8.  It seems & is changed to & 
when the site is utf-8.

If I try to replace it with .replace('&','&') it for some reason 
does not replace it.

For example: http://today.reuters.co.uk/news/default.aspx

The url in the page looks like this

http://today.reuters.co.uk/news/NewsArticle.aspx?type=topNews&storyID=2005-10-05T140937Z_01_MCC423599_RTRUKOC_0_UK-BRITAIN-CONSERVATIVES.xml

However when I pull it into python the URL ends up looking like this 
(notice the & instead of just & in the URL)

http://today.reuters.co.uk/news/newsArticle.aspx?type=businessNews&storyID=2005-10-05T094354Z_01_MOL530411_RTRUKOC_0_UK-CONSTRUCTION-BPB-STGOBAIN.xml

Any ideas?



More information about the Python-list mailing list