BeautifulSoup bug when ">>>" found in attribute value
John Nagle
nagle at animats.com
Tue Dec 26 16:36:14 EST 2006
This, which is from a real web site, went into BeautifulSoup:
<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&blinkt=Click here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />
And this came out, via prettify:
<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&blinkt=Click here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
</param>
BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a missing
quote mark did the wrong thing.
John Nagle
More information about the Python-list
mailing list