[lxml] Re: Streaming read/write

Jan. 18, 2024

      On 18 Jan 2024, at 15:48, Stefan Behnel wrote:

Hi Stefan,
...
You might want to look into the more general XMLPullParser, but yes, 
both that and iterparse() generate a full XML tree in the back. The 
idea is that you actively delete parts of it when you're done with 
them, but you gain easy tree navigation for that. If you need to do 
somewhat complex and non-local tree transformations, the additional 
tree building and cleanup work is a price you might want to pay.
Right, but I guess I'm not sure about what happens to the XML document 
itself: I only need to delete some attributes and elements. Basically, 
I'm looking at writing a self-help page for Openpyxl-users on how they 
can fix broken OOXML files before they try and pass them to Openpyxl.

I'm obviously not understanding things quite right but I do remember 
that iterparse loops through all elements so clear needs to be called at 
the right level to avoid essentially the same elements being handled 
twice.

I have a very naive and currently not working example that uses 
coroutines:

```python
def parser(sheet_src):
     """Gets a worksheet XML file"""
     xml = iterparse(sheet_src)
     for _, element in xml:
         if element.tag == CELL_TAG:
             element.set("r", None)
         yield element

def writer(output):
     """Gets a BytesIO object or file"""
     with xmlfile(output) as xf:
         try:
             while True:
                 el = (yield)
                 if el is True:
                     yield xf
                 xf.write(el)
         except GeneratorExit:
             pass
```

Apart from the fact that this currently doesn't work, I imagine that 
both Elements and their children would happily be passed to the write, 
which could lead to an almighty mess. Getting this to work properly, 
possibly rewritten for async to avoid the awfully awful `(yield)` hack 
could be a nice addition to the documentation.
...
Alternatively, for the parsing side, there's also still SAX (i.e. pass 
a "target" object into the parser). It matches somewhat well with 
xmlfile(), at the cost of requiring separate callback methods and 
thus, probably, some state keeping on your side. But depending on the 
kind of "editing" that you're doing on your XML documents, it might 
not be too bad.
It's a bit of blunderbuss to be honest. But I'm also wondering whether 
it wouldn't be an idea to switch to a pull parser for my work anyway, 
because I don't keep the XML objects in memory. This occurred to me 
after looking at Haki Benita's review of Excel processing libraries.

https://hakibenita.com/fast-excel-python

I know for a fact that the XML parsing (whether it's lxml or xml.etree) 
is a key determinant here. And, having played around with PugiXML last 
year, know that there's a huge penalty when the Python objects are 
instantiated: the underlying libraries can parse the files in a couple 
of seconds but iterating over them in Python takes much, much longer. My 
guess is that it's probably much the same with libXML(2) and would be 
the same with putative PyO3 wrapper round Rust's quick_xml.

For most parts of an Excel file, this is completely irrelevant, but 
worksheets can have potentially millions of cell objects, so 
optimisations here would have tangible benefits.
...
Basically, lxml can do all the state keeping for you if you let it 
build a tree (but then you have to clean up after yourself to save 
memory), or you choose to do all the state keeping yourself and take 
the bare parse events, and then have full control over the amount of 
state that you keep. Whatever is better for your use case.
I have a feeling that it's this "state" thing that I don't really 
understand…

Charlie

--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Sengelsweg 34
Düsseldorf
D- 40489
Tel: +49-203-3925-0390
Mobile: +49-178-782-6226