BeautifulSoup bug when ">>>" found in attribute value
John Nagle
nagle at animats.com
Wed Dec 27 13:38:14 EST 2006
Duncan Booth wrote:
> John Nagle <nagle at animats.com> wrote:
>
>
>>And this came out, via prettify:
>>
>><addresssnippet siteurl="http%3A//apartmentsapart.com"
>>url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
>> <param name="movie"
>> value="/images/offersBanners/sw04.swf?binfot=We offer
>>fantastic rates for selected weeks or days!!&blinkt=Click here
>>>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>>
>>>>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
>>
>></param>
>>
>>BeautifulSoup seems to have become confused by the ">>>" within
>>a quoted attribute value. It first parsed it right, but then stuck
>>in an extra, totally bogus line. Note the entity "&linkurl;", which
>>appears nowhere in the original. It looks like code to handle a
>>missing quote mark did the wrong thing.
>
>
> I don't think I would quibble with what BeautifulSoup extracted from that
> mess. The input isn't valid HTML so any output has to be guessing at what
> was meant. A lot of code for parsing html would assume that there was a
> quote missing and the tag was terminated by the first '>'. IE and Firefox
> seem to assume that the '>' is allowed inside the attribute. BeautifulSoup
> seems to have given you the best of both worlds: the attribute is parsed to
> the closing quote, but the tag itself ends at the first '>'.
>
> As for inserting a semicolon after linkurl, I think you'll find it is just
> being nice and cleaning up an unterminated entity. Browsers (or at least
> IE) will often accept entities without the terminating semicolon, so that's
> a common problem in badly formed html that BeautifulSoup can fix.
It's worse than that. Look at the last line of BeautifulSoup output:
&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
That "/>" doesn't match anything. We're outside a tag at that point.
And it was introduced by BeautifulSoup. That's both wrong and
puzzling; given that this was created from a parse tree, that type
of error shouldn't ever happen. This looks like the parser didn't
delete a string item after deciding it was actually part of a tag.
John Nagle
More information about the Python-list
mailing list