Probable bug with nested lists in v4.3.3
data:image/s3,"s3://crabby-images/d900f/d900f535a350adcbef640d95a472f47f247697ee" alt=""
For nested lists in HTML, the nested list is supposed to be within an <li> element rather than as a child of the <ol> or <ul>. Given this first example of a slightly non-conformant HTML snippet produced by Gmail:
The above code snippet works the way I would expect, in that even though the input isn't entirely valid html, the nested ordered list is still within the outer ordered list. However, if we look at this code snippet below, which is identical except for that now the nested list is an unordered list:
Suddenly the unordered list is no longer nested inside the ordered list. Unless there is something I don't know about in the HTML spec prohibiting nested lists of mixed types, this seems both inconsistent with the first example and broken in general. I would expect instead for lxml to produce one of the following two trees: <ol><li>1</li><ul><li>*</li></ul></ol> <ol><li>1</li><li><ul><li>*</li></ul></li></ol> Either would be fine, but what currently gets produced seems pretty suboptimal because it substantially changes the meaning of the text. Alex -- Alex Krupp Cell: (607) 351 2671 Read my Email: www.fwdeveryone.com/u/alex3917 Subscribe to my blog: http://alexkrupp.typepad.com/ My homepage: www.alexkrupp.com
data:image/s3,"s3://crabby-images/53b7a/53b7a441f6f2a0ce12108ae436842605ac0e275e" alt=""
Am 19.05.2019, 15:29 Uhr, schrieb Stefan Behnel <stefan_ml@behnel.de>:
There are also alternative lxml-tree-producing HTML-Parsers like: https://github.com/kovidgoyal/html5-parser Maybe this one handles your cases better. --dirk
data:image/s3,"s3://crabby-images/53b7a/53b7a441f6f2a0ce12108ae436842605ac0e275e" alt=""
Am 19.05.2019, 15:29 Uhr, schrieb Stefan Behnel <stefan_ml@behnel.de>:
There are also alternative lxml-tree-producing HTML-Parsers like: https://github.com/kovidgoyal/html5-parser Maybe this one handles your cases better. --dirk
participants (3)
-
Alex Krupp
-
Dirk Rothe
-
Stefan Behnel