Bogdan Cristea, 08.11.2013 11:36:
I am trying to emulate the API of QXmlStreamReader and I am using iterparse().
For reference, Bogdan posted his code here: https://bugs.launchpad.net/lxml/+bug/1249254
My problem is that I cannot find a reliable way of getting the chars between the elements. I can have two situations: chars that are element text or element tail. When receiving "start" event I can obtain both text and tail, but these values are not reliable, not always the text is parsed at this point. Is there a way to get around this issue ?
If you really need the text content when a "start" event is being reported, then you could read one iteration ahead and handle the event before that. That will make sure that any text between the currently reported opening tag and the next tag has been parsed. The same applies to tail text during an "end" event. Speaking of this approach, I wonder if lxml shouldn't just do that internally. It would make it less easy for users to write broken code, at the cost of a bit more content being held in memory internally on average. The only really problematic thing would be large blocks of text, which may then end up being loaded into memory completely, even if the user stops iterating right before them. Not an unrealistic scenario, though. Opinions? Maybe an "ensure_text" option would help? Stefan