
On 18 Jan 2024, at 15:48, Stefan Behnel wrote: Hi Stefan,
You might want to look into the more general XMLPullParser, but yes, both that and iterparse() generate a full XML tree in the back. The idea is that you actively delete parts of it when you're done with them, but you gain easy tree navigation for that. If you need to do somewhat complex and non-local tree transformations, the additional tree building and cleanup work is a price you might want to pay.
Right, but I guess I'm not sure about what happens to the XML document itself: I only need to delete some attributes and elements. Basically, I'm looking at writing a self-help page for Openpyxl-users on how they can fix broken OOXML files before they try and pass them to Openpyxl. I'm obviously not understanding things quite right but I do remember that iterparse loops through all elements so clear needs to be called at the right level to avoid essentially the same elements being handled twice. I have a very naive and currently not working example that uses coroutines: ```python def parser(sheet_src): """Gets a worksheet XML file""" xml = iterparse(sheet_src) for _, element in xml: if element.tag == CELL_TAG: element.set("r", None) yield element def writer(output): """Gets a BytesIO object or file""" with xmlfile(output) as xf: try: while True: el = (yield) if el is True: yield xf xf.write(el) except GeneratorExit: pass ``` Apart from the fact that this currently doesn't work, I imagine that both Elements and their children would happily be passed to the write, which could lead to an almighty mess. Getting this to work properly, possibly rewritten for async to avoid the awfully awful `(yield)` hack could be a nice addition to the documentation.
Alternatively, for the parsing side, there's also still SAX (i.e. pass a "target" object into the parser). It matches somewhat well with xmlfile(), at the cost of requiring separate callback methods and thus, probably, some state keeping on your side. But depending on the kind of "editing" that you're doing on your XML documents, it might not be too bad.
It's a bit of blunderbuss to be honest. But I'm also wondering whether it wouldn't be an idea to switch to a pull parser for my work anyway, because I don't keep the XML objects in memory. This occurred to me after looking at Haki Benita's review of Excel processing libraries. https://hakibenita.com/fast-excel-python I know for a fact that the XML parsing (whether it's lxml or xml.etree) is a key determinant here. And, having played around with PugiXML last year, know that there's a huge penalty when the Python objects are instantiated: the underlying libraries can parse the files in a couple of seconds but iterating over them in Python takes much, much longer. My guess is that it's probably much the same with libXML(2) and would be the same with putative PyO3 wrapper round Rust's quick_xml. For most parts of an Excel file, this is completely irrelevant, but worksheets can have potentially millions of cell objects, so optimisations here would have tangible benefits.
Basically, lxml can do all the state keeping for you if you let it build a tree (but then you have to clean up after yourself to save memory), or you choose to do all the state keeping yourself and take the bare parse events, and then have full control over the amount of state that you keep. Whatever is better for your use case.
I have a feeling that it's this "state" thing that I don't really understand… Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226