incorrect handling of dt/dl in parsing html ?
data:image/s3,"s3://crabby-images/84a93/84a93c129e5ffe327d021a775de27ac6317e6cd4" alt=""
g'day i'm trying to parse the "usual" netscape-format bookmarks.html produced by Firefox (and Opera recently). The way lxml parses the tree is different from how it should be and how both Firefox and Opera parse it. (sub)Folders there are represented as [dl] inside a [dt] , but lxml puts the [dl] next to the [dt], on same level as the [dt]. Is this a mistake or something intentional? ciao svil (angle brackets replaced with [] below) [DL][p] [DT][A HREF= "hrf1/" ] name1[/A] [DD] [DT][A HREF= "hrf2"] name2[/A] [DD] [DT][H3] folder1[/H3] [DL][p] [DT][A HREF= "hrf11"] name3[/A] [DD]dd1 [DT][H3] folder2[/H3] [DL][p] [DT][A HREF= "hrf22"] name4[/A] [DD]dd2 [/DL][p] [DT][A HREF= "hrf33"] name5[/A] [DD]dd3 [/DL][p] [/DL]
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
svilen schrieb am 20.10.2017 um 18:55:
That makes it a bit difficult to read, but to me it looks like several closing tags are missing. That means that this is actually broken HTML, and that there is no "correct" way to parse this.
I'm surprised about what you describe as behaviour, because the current libxml2 does not allow <dl> inside of another <dl>: https://git.gnome.org/browse/libxml2/tree/HTMLparser.c?h=v2.9.6#n706 This seems correct according to HTML5: https://www.w3.org/TR/html5/grouping-content.html#the-dl-element In any case, there is nothing lxml can do about this since the parsing is done by libxml2. Stefan
data:image/s3,"s3://crabby-images/84a93/84a93c129e5ffe327d021a775de27ac6317e6cd4" alt=""
Ah, thanks. i tried several libraries, each one interprets it in its own wrong way. Seems all expect correct xhtml input, with all them closing tags etc. For the record, i found that html5lib is the only (python) library that does parse such things as they should be - as html4/1999 spec says. And yes it is not a correct xhtml - allowing things like autoclosing of [p] [dt] [dd] [li] [tr] [td] etc. but it's designed that way, as the common/comfortable usage - not very xml-ish.. ciao svil On Sat, 21 Oct 2017 16:06:19 +0200 Stefan Behnel <stefan_ml@behnel.de> wrote:
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
svilen schrieb am 20.10.2017 um 18:55:
That makes it a bit difficult to read, but to me it looks like several closing tags are missing. That means that this is actually broken HTML, and that there is no "correct" way to parse this.
I'm surprised about what you describe as behaviour, because the current libxml2 does not allow <dl> inside of another <dl>: https://git.gnome.org/browse/libxml2/tree/HTMLparser.c?h=v2.9.6#n706 This seems correct according to HTML5: https://www.w3.org/TR/html5/grouping-content.html#the-dl-element In any case, there is nothing lxml can do about this since the parsing is done by libxml2. Stefan
data:image/s3,"s3://crabby-images/84a93/84a93c129e5ffe327d021a775de27ac6317e6cd4" alt=""
Ah, thanks. i tried several libraries, each one interprets it in its own wrong way. Seems all expect correct xhtml input, with all them closing tags etc. For the record, i found that html5lib is the only (python) library that does parse such things as they should be - as html4/1999 spec says. And yes it is not a correct xhtml - allowing things like autoclosing of [p] [dt] [dd] [li] [tr] [td] etc. but it's designed that way, as the common/comfortable usage - not very xml-ish.. ciao svil On Sat, 21 Oct 2017 16:06:19 +0200 Stefan Behnel <stefan_ml@behnel.de> wrote:
participants (2)
-
Stefan Behnel
-
svilen