Get text content of an element with iterparse
Hi I am trying to emulate the API of QXmlStreamReader and I am using iterparse(). My problem is that I cannot find a reliable way of getting the chars between the elements. I can have two situations: chars that are element text or element tail. When receiving "start" event I can obtain both text and tail, but these values are not reliable, not always the text is parsed at this point. Is there a way to get around this issue ? thanks Bogdan
Bogdan Cristea, 08.11.2013 11:36:
I am trying to emulate the API of QXmlStreamReader and I am using iterparse().
For reference, Bogdan posted his code here: https://bugs.launchpad.net/lxml/+bug/1249254
My problem is that I cannot find a reliable way of getting the chars between the elements. I can have two situations: chars that are element text or element tail. When receiving "start" event I can obtain both text and tail, but these values are not reliable, not always the text is parsed at this point. Is there a way to get around this issue ?
If you really need the text content when a "start" event is being reported, then you could read one iteration ahead and handle the event before that. That will make sure that any text between the currently reported opening tag and the next tag has been parsed. The same applies to tail text during an "end" event. Speaking of this approach, I wonder if lxml shouldn't just do that internally. It would make it less easy for users to write broken code, at the cost of a bit more content being held in memory internally on average. The only really problematic thing would be large blocks of text, which may then end up being loaded into memory completely, even if the user stops iterating right before them. Not an unrealistic scenario, though. Opinions? Maybe an "ensure_text" option would help? Stefan
On 11/08/2013 12:11 PM, Stefan Behnel wrote:
Bogdan Cristea, 08.11.2013 11:36:
I am trying to emulate the API of QXmlStreamReader and I am using iterparse(). For reference, Bogdan posted his code here:
https://bugs.launchpad.net/lxml/+bug/1249254
My problem is that I cannot find a reliable way of getting the chars between the elements. I can have two situations: chars that are element text or element tail. When receiving "start" event I can obtain both text and tail, but these values are not reliable, not always the text is parsed at this point. Is there a way to get around this issue ? If you really need the text content when a "start" event is being reported, then you could read one iteration ahead and handle the event before that. That will make sure that any text between the currently reported opening tag and the next tag has been parsed. The same applies to tail text during an "end" event.
Speaking of this approach, I wonder if lxml shouldn't just do that internally. It would make it less easy for users to write broken code, at the cost of a bit more content being held in memory internally on average. The only really problematic thing would be large blocks of text, which may then end up being loaded into memory completely, even if the user stops iterating right before them. Not an unrealistic scenario, though.
Opinions?
Maybe an "ensure_text" option would help?
Stefan
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
IMHO a "chars" event would solve completely this issue
[replying on-list] Bogdan Cristea, 08.11.2013 12:20:
On 11/08/2013 12:11 PM, Stefan Behnel wrote:
If you really need the text content when a "start" event is being reported, then you could read one iteration ahead and handle the event before that. That will make sure that any text between the currently reported opening tag and the next tag has been parsed. The same applies to tail text during an "end" event.
Speaking of this approach, I wonder if lxml shouldn't just do that internally. It would make it less easy for users to write broken code, at the cost of a bit more content being held in memory internally on average. The only really problematic thing would be large blocks of text, which may then end up being loaded into memory completely, even if the user stops iterating right before them. Not an unrealistic scenario, though.
Opinions?
Maybe an "ensure_text" option would help?
IMHO a "chars" event would solve completely this issue
No way. Stefan
participants (2)
-
Bogdan Cristea
-
Stefan Behnel