On 2/13/19 10:42 AM, René Dudfield wrote:
you can run it as a daemon/server(for example a little flask app). This optimization also works for cpython apps if you want to avoid the startup/import time.
That would be a big change for a uncertain improvement, so I'm not willing to go there yet. Performance is okay, but I want to see if I can improve it further as a stand-alone.
Can the work be split up per xml file easily? Then perhaps multiprocessing will work nicely for you.
Do you need to process all the files each time? Or can you avoid work?
I've tried multiprocessing too, but it is slower. Parsing the XML can be done in parallel but I suspect the overhead of multi-processing was the drag. I've also thought about caching the XML parse trees, but serializing and reloading pickles of unchanged parse trees seems slower than just parsing the XML anew. Using lru_cache on text processing functions (e.g., removing accents) didn't help either. I haven't been able to find good examples of people using multiprocessing or pypy for XML processing, perhaps this is why. Thank you all for the suggestions!