lxml children vs. descendant not working correctly (I think)

Here's a bit of code I was trying to parse: <div data-flag="TODO"> <p>This content is flagged as <code>TODO</code>. It gets a background colour (red, in this case) and the label <code>TODO</code> is rendered into the document margin.</p> </div> I was using .//div/child::* What I got back was a list of three items: <p><code><code> those aren't the children of div! Those are the descendant elements of div. I decided to test this using .//div/descendant::* which gave me the proper: <p><code><code> elements. To further confirm this I used other parsers and they provided the proper child (p in this case). How do we go about getting this fixed in lxml?

Hi, Curious, I tried this out, first creating some data in the file child.py: #!/usr/bin/env python3 import lxml.etree as ET t = ET.fromstring( """ <html> <body> <div data-flag="TODO"> <p>This content is flagged as <code>TODO</code>. It gets a background colour (red, in this case) and the label <code>TODO</code> is rendered into the document margin.</p> </div> </body> </html> """) And then applied your query: for n, i in enumerate(t.xpath(r'.//div/child::*')): print(f"[{n}] {i}") Which gave me the following result: ❯ ./child.py [0] <Element p at 0x1023c3d00> However, when experimenting, I put two forward slashes before the child operator, and then I saw the output you describe, i.e: for n, i in enumerate(t.xpath(r'.//div//child::*')): print(f"[{n}] {i}") Give: ❯ ./child.py [0] <Element p at 0x10f2a3d40> [1] <Element code at 0x10f2a3e40> [2] <Element code at 0x10f2b4140> (Although, as far as i can see, that output is correct when the two forward slashes are present.) The above was performed on Python 3.11.3 and lxml version 4.9.2. Could a double slash have sneaked into your query before "child"...? Kind regards aid
On 29 Jun 2023, at 19:17, wayneb--- via lxml - The Python XML Toolkit <lxml@python.org> wrote:
Here's a bit of code I was trying to parse:
<div data-flag="TODO"> <p>This content is flagged as <code>TODO</code>. It gets a background colour (red, in this case) and the label <code>TODO</code> is rendered into the document margin.</p> </div>
I was using .//div/child::* What I got back was a list of three items: <p><code><code> those aren't the children of div! Those are the descendant elements of div. I decided to test this using .//div/descendant::* which gave me the proper: <p><code><code> elements. To further confirm this I used other parsers and they provided the proper child (p in this case).
How do we go about getting this fixed in lxml? _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: aid@logic.org.uk
participants (2)
-
Adrian Bool
-
wayneb@mac.com