
Hiya, I was recently wondering about the best way to edit XML documents using both a streaming reader and writer. I'm sure this is possible using iterparse and xmlfile but I seem to remember that iterparse produces the full tree so that parent elements and their children are returned. Has anyone had any experience with this? Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226

Hi Charlie, Charlie Clark schrieb am 18.01.24 um 12:13:
I was recently wondering about the best way to edit XML documents using both a streaming reader and writer. I'm sure this is possible using iterparse and xmlfile but I seem to remember that iterparse produces the full tree so that parent elements and their children are returned.
You might want to look into the more general XMLPullParser, but yes, both that and iterparse() generate a full XML tree in the back. The idea is that you actively delete parts of it when you're done with them, but you gain easy tree navigation for that. If you need to do somewhat complex and non-local tree transformations, the additional tree building and cleanup work is a price you might want to pay. Alternatively, for the parsing side, there's also still SAX (i.e. pass a "target" object into the parser). It matches somewhat well with xmlfile(), at the cost of requiring separate callback methods and thus, probably, some state keeping on your side. But depending on the kind of "editing" that you're doing on your XML documents, it might not be too bad. Basically, lxml can do all the state keeping for you if you let it build a tree (but then you have to clean up after yourself to save memory), or you choose to do all the state keeping yourself and take the bare parse events, and then have full control over the amount of state that you keep. Whatever is better for your use case. Stefan

On 18 Jan 2024, at 15:48, Stefan Behnel wrote: Hi Stefan,
You might want to look into the more general XMLPullParser, but yes, both that and iterparse() generate a full XML tree in the back. The idea is that you actively delete parts of it when you're done with them, but you gain easy tree navigation for that. If you need to do somewhat complex and non-local tree transformations, the additional tree building and cleanup work is a price you might want to pay.
Right, but I guess I'm not sure about what happens to the XML document itself: I only need to delete some attributes and elements. Basically, I'm looking at writing a self-help page for Openpyxl-users on how they can fix broken OOXML files before they try and pass them to Openpyxl. I'm obviously not understanding things quite right but I do remember that iterparse loops through all elements so clear needs to be called at the right level to avoid essentially the same elements being handled twice. I have a very naive and currently not working example that uses coroutines: ```python def parser(sheet_src): """Gets a worksheet XML file""" xml = iterparse(sheet_src) for _, element in xml: if element.tag == CELL_TAG: element.set("r", None) yield element def writer(output): """Gets a BytesIO object or file""" with xmlfile(output) as xf: try: while True: el = (yield) if el is True: yield xf xf.write(el) except GeneratorExit: pass ``` Apart from the fact that this currently doesn't work, I imagine that both Elements and their children would happily be passed to the write, which could lead to an almighty mess. Getting this to work properly, possibly rewritten for async to avoid the awfully awful `(yield)` hack could be a nice addition to the documentation.
Alternatively, for the parsing side, there's also still SAX (i.e. pass a "target" object into the parser). It matches somewhat well with xmlfile(), at the cost of requiring separate callback methods and thus, probably, some state keeping on your side. But depending on the kind of "editing" that you're doing on your XML documents, it might not be too bad.
It's a bit of blunderbuss to be honest. But I'm also wondering whether it wouldn't be an idea to switch to a pull parser for my work anyway, because I don't keep the XML objects in memory. This occurred to me after looking at Haki Benita's review of Excel processing libraries. https://hakibenita.com/fast-excel-python I know for a fact that the XML parsing (whether it's lxml or xml.etree) is a key determinant here. And, having played around with PugiXML last year, know that there's a huge penalty when the Python objects are instantiated: the underlying libraries can parse the files in a couple of seconds but iterating over them in Python takes much, much longer. My guess is that it's probably much the same with libXML(2) and would be the same with putative PyO3 wrapper round Rust's quick_xml. For most parts of an Excel file, this is completely irrelevant, but worksheets can have potentially millions of cell objects, so optimisations here would have tangible benefits.
Basically, lxml can do all the state keeping for you if you let it build a tree (but then you have to clean up after yourself to save memory), or you choose to do all the state keeping yourself and take the bare parse events, and then have full control over the amount of state that you keep. Whatever is better for your use case.
I have a feeling that it's this "state" thing that I don't really understand… Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226

On 18 Jan 2024, at 18:10, Charlie Clark wrote:
Apart from the fact that this currently doesn't work, I imagine that both Elements and their children would happily be passed to the write, which could lead to an almighty mess. Getting this to work properly, possibly rewritten for async to avoid the awfully awful (yield) hack could be a nice addition to the documentation.
Thinking about this again, I think a pull parser is probably the way to go as I really don't want or need to create elements, it's probably fine if I just make the changes to what's coming through and stream the text straight back into another file. I'll give that a go. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226

Charlie Clark schrieb am 19.01.24 um 15:00:
On 18 Jan 2024, at 18:10, Charlie Clark wrote:
Apart from the fact that this currently doesn't work, I imagine that both Elements and their children would happily be passed to the write, which could lead to an almighty mess. Getting this to work properly, possibly rewritten for async to avoid the awfully awful (yield) hack could be a nice addition to the documentation.
Thinking about this again, I think a pull parser is probably the way to go as I really don't want or need to create elements, it's probably fine if I just make the changes to what's coming through and stream the text straight back into another file. I'll give that a go.
If you want to avoid creating element objects all together, maybe even don't need a full (sub-)tree structure to get all relevant information, I suggest you try the low-level SAX interface. https://lxml.de/parsing.html#the-target-parser-interface It's quite efficient and usable for locally constrained XML transformations, e.g. filtering elements or attributes. And you can still parse input chunk by chunk, if you need that: https://lxml.de/parsing.html#the-feed-parser-interface Stefan

On 21 Jan 2024, at 12:42, Stefan Behnel wrote: Hi Stefan,
If you want to avoid creating element objects all together, maybe even don't need a full (sub-)tree structure to get all relevant information, I suggest you try the low-level SAX interface.
https://lxml.de/parsing.html#the-target-parser-interface
It's quite efficient and usable for locally constrained XML transformations, e.g. filtering elements or attributes.
And you can still parse input chunk by chunk, if you need that:
Yes, I've read about both of those but always shied away from them. The ETree interface really is a join to work with and in most cases all you need but I guess I've come across the two edge-cases where the performance overhead can be considered an issue. On a slightly related note, is there anyway getting the parser to treat some attributes as numbers to avoid casting in Python? Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226
participants (2)
-
Charlie Clark
-
Stefan Behnel