"<![CDATA[]]" vs. BeautifulSoup

Ian Kelly ian.g.kelly at gmail.com
Thu May 3 19:02:02 EDT 2012


On Thu, May 3, 2012 at 1:59 PM, John Nagle <nagle at animats.com> wrote:
>  An HTML page for a major site (http://www.chase.com) has
> some incorrect HTML.  It contains
>
>        <![CDATA[]]
>
> which is not valid HTML, XML, or SMGL.  However, most browsers
> ignore it.  BeautifulSoup treats it as the start of a CDATA section,
> and consumes the rest of the document in CDATA format.
>
>  Bug?

Seems like a bug to me.  BeautifulSoup is supposed to parse like a
browser would, so if most browsers just ignore an unterminated CDATA
section, then BeautifulSoup probably should too.



More information about the Python-list mailing list