Optimizing lxml for Handling Large XML Files: Tips and Experiences

I've been using lxml to process large XML files recently, and I'm looking for ways to optimize performance. Specifically, I'm trying to filter specific nodes from large datasets and manage memory usage more effectively. I'd appreciate any tips or best practices from your experiences. Are there any techniques you use to enhance performance, or potential pitfalls I should watch out for? Thanks in advance for your insights!

On 25 Oct 2024, at 10:10, Lily Parker via lxml - The Python XML Toolkit wrote: Hi Lily,
Can you explain a little more in what you're trying to do? If you're wanting to manipulate files then you're probably best off combining an iterative, incremental parser with an incremental reader. This is something I've used recently for fixing broken Excel worksheets, the incorrect "r" attribute needs removing. You should be able to adapt it to your needs. ```python def parser(sheet_src): xml = iterparse(sheet_src) for _, element in xml: if element.tag == CELL_TAG: element.set("r", None) yield element def writer(output): with xmlfile(output) as xf: try: while True: el = (yield) if el is True: yield xf xf.write(el) except GeneratorExit: pass def writer(out_stream, in_stream): with xmlfile(out_stream) as xf: for el in in_stream: xf.write(el) ``` -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226

On 25 Oct 2024, at 10:10, Lily Parker via lxml - The Python XML Toolkit wrote: Hi Lily,
Can you explain a little more in what you're trying to do? If you're wanting to manipulate files then you're probably best off combining an iterative, incremental parser with an incremental reader. This is something I've used recently for fixing broken Excel worksheets, the incorrect "r" attribute needs removing. You should be able to adapt it to your needs. ```python def parser(sheet_src): xml = iterparse(sheet_src) for _, element in xml: if element.tag == CELL_TAG: element.set("r", None) yield element def writer(output): with xmlfile(output) as xf: try: while True: el = (yield) if el is True: yield xf xf.write(el) except GeneratorExit: pass def writer(out_stream, in_stream): with xmlfile(out_stream) as xf: for el in in_stream: xf.write(el) ``` -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226
participants (2)
-
Charlie Clark
-
Lily Parker