libxml2dom - parsing maligned html
Stefan Behnel
stefan_ml at behnel.de
Tue Aug 26 12:02:12 EDT 2008
bruce wrote:
> I'm using quick test with libxml2dom
>
> ===============
> import libxml2dom
>
> aa=libxml2dom.parseString(foo)
> ff=libxml2dom.toString(aa)
>
> print ff
> ===============
>
> ----------------------------------
> when i start, foo is:
> <html>
> <body>
> </body>
> </html>
>
> <html>
> <body>
> .
> .
> .
> </body>
> </html>
> -------------------------------
> when i print ff it's:
> <html>
> <body>
> </body>
> </html>
> -------------------------------
>
> so it's as if the parseString only reads the initial "html" tree. i've
> reviewed as much as i can find regarding libxml2dom to try to figure out how
> i can get it to read/parse/handle both html trees/nodes.
>
> i know, the html is maligned/screwed-up, but i can't seem to find any app
> (tidy/beautifulsoup) that can "know" which one of the html trees to throw
> out/remove!!
>
> technically, both html trees are valid, it's just that they both shouldn't
> be in the file!!!
What about splitting the string on "<html" and them parsing each part on its own?
Stefan
More information about the Python-list
mailing list