problem with getprevious in lxml
data:image/s3,"s3://crabby-images/68281/682811131061ddf0a8ae288d02efca5f138e45a0" alt=""
``` from lxml import html import sys doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8')) r = doc.xpath(sys.argv[1])[0] p = r.getprevious() if p is None: print p else: print html.tostring(r.getprevious()) ``` I have the above python code. The following runs show these. I'd expect the first run returns `None`, but it shows up as `<p>x</p>`. I don't think this makes sense. Is it a bug? ``` $ ./main.py //div <<< '<html>x<body><div>abc</div></html>' <p>x</p> $ ./main.py //div <<< '<html><body><h1>x<div>abc</div><h1></html>' None ``` -- Regards, Peng
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Peng Yu schrieb am 24.08.19 um 15:38:
The fixups that the parser does to the tree look at least somewhat reasonable to me: $ echo '<html>x<body><div>abc</div></html>' | xmllint --html - -:1: HTML parser error : htmlParseStartTag: misplaced <body> tag <html>x<body><div>abc</div></html> ^ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body> <p>x</p> <div>abc</div> </body></html> $ echo '<html><body><h1>x<div>abc</div><h1></html>' | xmllint --html - <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><h1>x<div>abc</div> <h1></h1> </h1></body></html> There isn't really any "right" or "wrong" when it comes to parsing broken HTML, given that libxml2 only has an HTML 4 parser. If you want more modern fixups, you'll need to resort to an HTML 5 parser instead, e.g. https://pypi.org/project/html5-parser/ Stefan
data:image/s3,"s3://crabby-images/68281/682811131061ddf0a8ae288d02efca5f138e45a0" alt=""
If you put the same html code in a browser (using right click and inspect), I see the following result In Google Chrome, I see <html><head></head><body>x<div>abc</div></html> In Firefox, I see <html><head></head><body data-gr-c-s-loaded="true">x<div>abc</div></html> If there must be fix introduced by lxml, I think a browser-similar fixup would be a better alternative. -- Regards, Peng
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Peng Yu schrieb am 24.08.19 um 15:38:
The fixups that the parser does to the tree look at least somewhat reasonable to me: $ echo '<html>x<body><div>abc</div></html>' | xmllint --html - -:1: HTML parser error : htmlParseStartTag: misplaced <body> tag <html>x<body><div>abc</div></html> ^ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body> <p>x</p> <div>abc</div> </body></html> $ echo '<html><body><h1>x<div>abc</div><h1></html>' | xmllint --html - <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><h1>x<div>abc</div> <h1></h1> </h1></body></html> There isn't really any "right" or "wrong" when it comes to parsing broken HTML, given that libxml2 only has an HTML 4 parser. If you want more modern fixups, you'll need to resort to an HTML 5 parser instead, e.g. https://pypi.org/project/html5-parser/ Stefan
data:image/s3,"s3://crabby-images/68281/682811131061ddf0a8ae288d02efca5f138e45a0" alt=""
If you put the same html code in a browser (using right click and inspect), I see the following result In Google Chrome, I see <html><head></head><body>x<div>abc</div></html> In Firefox, I see <html><head></head><body data-gr-c-s-loaded="true">x<div>abc</div></html> If there must be fix introduced by lxml, I think a browser-similar fixup would be a better alternative. -- Regards, Peng
participants (2)
-
Peng Yu
-
Stefan Behnel