BeautifulSoup vs. loose & chars
placid
Bulkan at gmail.com
Tue Dec 26 07:22:38 EST 2006
John Nagle wrote:
> I've been parsing existing HTML with BeautifulSoup, and occasionally
> hit content which has something like "Design & Advertising", that is,
> an "&" instead of an "&". Is there some way I can get BeautifulSoup
> to clean those up? There are various parsing options related to "&"
> handling, but none of them seem to do quite the right thing.
>
> If I write the BeautifulSoup parse tree back out with "prettify",
> the loose "&" is still in there. So the output is
> rejected by XML parsers. Which is why this is a problem.
> I need valid XML out, even if what went in wasn't quite valid.
>
> John Nagle
So do you want to remove "&" or replace them with "&" ? If you want
to replace it try the following;
import urllib, sys
try:
location = urllib.urlopen(url)
except IOError, (errno, strerror):
sys.exit("I/O error(%s): %s" % (errno, strerror))
content = location.read()
content = content.replace("&", "&")
To do this with BeautifulSoup, i think you need to go through every
Tag, get its content, see if it contains an "&" and then replace the
Tag with the same Tag but the content contains "&"
Hope this helps.
Cheers
More information about the Python-list
mailing list