Missing tail in iterparse
Hello, I have encountered something that feels like a bug in lxml. Given this minimal parser: https://gist.github.com/jasonaowen/ 2c98ebe9515918eebc86eeeb3706e6cf and this HTML file: https://innovation.isotropic.org/gamelog/201307/18/game- 20130718-235322-1acbb5dc.html I expect running the parser on the HTML file to include, somewhere in its output, the string "splays their blue cards left", as it is present at line 289 in the HTML file, but it does not. Changing the file by deleting things before that line seems to avoid triggering this issue and causes this to be printed, which is the desired behavior: ('span', {'class': 'age e'}, '4', '.\n... Nnastya splays their blue cards left.\n') I have reproduced this on two machines: Ubuntu 16.04.2 LTS Linux 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Python 3.5.2 LIBXML_COMPILED_VERSION (2, 9, 3) LIBXML_VERSION (2, 9, 3) LIBXSLT_COMPILED_VERSION (1, 1, 29) LIBXSLT_VERSION (1, 1, 29) LXML_VERSION (3, 7, 3, 0) (lxml installed via pip) and Debian GNU/Linux 8.7 (jessie) Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u1 (2017-02-22) x86_64 GNU/Linux Python 3.4.2 LIBXML_COMPILED_VERSION (2, 9, 1) LIBXML_VERSION (2, 9, 1) LIBXSLT_COMPILED_VERSION (1, 1, 28) LIBXSLT_VERSION (1, 1, 28) LXML_VERSION (3, 7, 3, 0) (lxml installed via pip) I also reproduced on the Debian system with the system-installed python3-lxml package, version 3.4.0-1: LIBXML_COMPILED_VERSION (2, 9, 1) LIBXML_VERSION (2, 9, 1) LIBXSLT_COMPILED_VERSION (1, 1, 28) LIBXSLT_VERSION (1, 1, 28) LXML_VERSION (3, 4, 0, 0) Is this a bug in lxml, or am I using it wrong? Thanks, Jason
Jason Owen schrieb am 17.04.2017 um 05:59:
I have encountered something that feels like a bug in lxml.
Given this minimal parser: https://gist.github.com/jasonaowen/ 2c98ebe9515918eebc86eeeb3706e6cf and this HTML file: https://innovation.isotropic.org/gamelog/201307/18/game- 20130718-235322-1acbb5dc.html
I expect running the parser on the HTML file to include, somewhere in its output, the string "splays their blue cards left", as it is present at line 289 in the HTML file, but it does not. Changing the file by deleting things before that line seems to avoid triggering this issue and causes this to be printed, which is the desired behavior:
('span', {'class': 'age e'}, '4', '.\n... Nnastya splays their blue cards left.\n')
This is a bit of a known quirk. It happens because the incremental parser sees the closing tag, and potentially but not necessarily more of the following content, and then yields the end event for the tag without making sure that the tail string is also completely parsed already. This could be fixed by making sure that the parser receives the complete tail text data before generating the end event for the element. But that means that there will be extreme cases where it needs to wait for a lot more data than currently that simply isn't going to be seen by anyone, especially because tail text is entirely irrelevant for many use cases. It could be argued that it's often relevant for HTML parsing, but introducing such a difference between the HTML and XML parsers would easily produce bugs on user side - see your way of shadowing the problem by passing slightly different data. The implementation that creates parse events from SAX events is in saxparser.pxi. You can take a look at the spots where _pushSaxEndEvent() is called, but the whole machinery is a bit complex overall, e.g. because it allows matching only specific tag names (which shouldn't impact the "tail finished" detection). Postponing the end event creation might not be all that trivial. Stefan
Thank you for looking into this!
On Fri, Apr 21, 2017 at 9:37 AM, Stefan Behnel
This is a bit of a known quirk. It happens because the incremental parser sees the closing tag, and potentially but not necessarily more of the following content, and then yields the end event for the tag without making sure that the tail string is also completely parsed already.
Is this behavior documented somewhere? I found it pretty surprising, and searched but could not find anything about it. Also, as a user, is there a way to work around this quirk?
This could be fixed by making sure that the parser receives the complete tail text data before generating the end event for the element. But that means that there will be extreme cases where it needs to wait for a lot more data than currently that simply isn't going to be seen by anyone, especially because tail text is entirely irrelevant for many use cases.
I'll defer to your judgment here, because you have clearly thought about this problem more than I have, but I would love to understand more. Missing data feels like a big deal to me - why is tail text irrelevant for many use cases? In particular, is it often both irrelevant and present in the document to be parsed?
It could be argued that it's often relevant for HTML parsing, but introducing such a difference between the HTML and XML parsers would easily produce bugs on user side - see your way of shadowing the problem by passing slightly different data.
I'm not sure I follow; can you say more?
The implementation that creates parse events from SAX events is in saxparser.pxi. You can take a look at the spots where _pushSaxEndEvent() is called, but the whole machinery is a bit complex overall, e.g. because it allows matching only specific tag names (which shouldn't impact the "tail finished" detection). Postponing the end event creation might not be all that trivial.
I looked at that file, but it's not clear to me how or why the tail is sometimes missing, and I'm afraid I don't have much bandwidth to work on this. I've opened a bug report to track it: https://bugs.launchpad.net/lxml/+bug/1684273 . Thanks, Jason
Jason Owen schrieb am 21.04.2017 um 18:52:
On Fri, Apr 21, 2017 at 9:37 AM, Stefan Behnel wrote:
This is a bit of a known quirk. It happens because the incremental parser sees the closing tag, and potentially but not necessarily more of the following content, and then yields the end event for the tag without making sure that the tail string is also completely parsed already.
Is this behavior documented somewhere? I found it pretty surprising, and searched but could not find anything about it.
I agree that it's surprising. And no, it's not documented. This would be the usual place to point to, but the warning there doesn't include it: http://lxml.de/parsing.html#modifying-the-tree
Also, as a user, is there a way to work around this quirk?
You could look ahead by a one element. That would make sure that the parser has always gone beyond the current element, and thus parsed the complete tail, before you process it.
This could be fixed by making sure that the parser receives the complete tail text data before generating the end event for the element. But that means that there will be extreme cases where it needs to wait for a lot more data than currently that simply isn't going to be seen by anyone, especially because tail text is entirely irrelevant for many use cases.
I'll defer to your judgment here, because you have clearly thought about this problem more than I have, but I would love to understand more. Missing data feels like a big deal to me - why is tail text irrelevant for many use cases? In particular, is it often both irrelevant and present in the document to be parsed?
The bulk of XML document formats only contains text content within elements (.text), not between different elements (.tail). Except for formatting whitespace, which is usually irrelevant for the receiver. As I said, it's different for HTML, where both text and tail are important.
It could be argued that it's often relevant for HTML parsing, but introducing such a difference between the HTML and XML parsers would easily produce bugs on user side - see your way of shadowing the problem by passing slightly different data.
I'm not sure I follow; can you say more?
Your example shows that it's data dependent whether the problem shows or not. That means that many user tests will not reveal the problem for the code they've written, thus making it error prone. Basically, you wouldn't expect it, your tests won't catch it, your code will be broken without you knowing it, and will fail in production (often even silently) when it hits real world data. And if we fix it for the HTML parser but not for the XML parser, it's even worse because users would expect that even less.
The implementation that creates parse events from SAX events is in saxparser.pxi. You can take a look at the spots where _pushSaxEndEvent() is called, but the whole machinery is a bit complex overall, e.g. because it allows matching only specific tag names (which shouldn't impact the "tail finished" detection). Postponing the end event creation might not be all that trivial.
I looked at that file, but it's not clear to me how or why the tail is sometimes missing, and I'm afraid I don't have much bandwidth to work on this. I've opened a bug report to track it: https://bugs.launchpad.net/lxml/+bug/1684273 .
Thanks. I think this should be fixed inside of lxml. Stefan
participants (2)
-
Jason Owen
-
Stefan Behnel