BeautifulSoup bug when ">>>" found in attribute value
Duncan Booth
duncan.booth at invalid.invalid
Wed Dec 27 13:38:57 EST 2006
John Nagle <nagle at animats.com> wrote:
> It's worse than that. Look at the last line of BeautifulSoup
> output:
>
> &linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
>
> That "/>" doesn't match anything. We're outside a tag at that point.
> And it was introduced by BeautifulSoup. That's both wrong and
> puzzling; given that this was created from a parse tree, that type
> of error shouldn't ever happen. This looks like the parser didn't
> delete a string item after deciding it was actually part of a tag.
The /> was in the original input that you gave it:
<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We
offer fantastic rates for selected weeks or days!!&blinkt=Click here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />
You don't actually *have* to escape > when it appears in html.
As I said before, it looks like BeautifulSoup decided that the tag ended
at the first > although it took text beyond that up to the closing " as
the value of the attribute. The remaining text was then simply treated
as text content of the unclosed param tag. Finally it inserted a
</param> to close the unclosed param tag.
... some time later ...
Ok, it looks like I was wrong and this is a bug in BeautifulSoup: it
seems that it *is* legal to have an unescaped > in an attribute value,
although it should (not must) be escaped:
>From the HTML 4.01 spec:
> Similarly, authors should use ">" (ASCII decimal 62) in text
> instead of ">" to avoid problems with older user agents that
> incorrectly perceive this as the end of a tag (tag close delimiter)
> when it appears in quoted attribute values.
Thank you, it looks like I just learned something new.
Mind you, the sentence before that says 'should' for quoting < characters
which is just plain silly.
More information about the Python-list
mailing list