Does lxml run under pypy and would it make a difference to my project?
I looked at pypy where you learn that it can be a lot faster in some circumstances. Would it help me?
I run quite primitive lxml scripts across very large data sets, in particular 50,000 Early Modern texts that have been linguistically annotated so that every token is a <w> element with a set of attributes. There are a lot of errors in the original annotation, and I use various heuristics to spot errors and correct them, which mainly involves changing @lemma, @pos and @reg attributes.
The texts vary in length from 100K to 250MB. It appears to me that building the document tree is the most expensive operation in the enterprise. If you have an error with 1,000 occurrences but you don't know the texts in which they occur you have to run the script across the entire set. That's an operation that takes between six and eight hours. So you don't want to run it unless you've gathered a lot of errors.
Shaving a quarter off that running time wouldn't make much difference. Cutting it in half would be well worth it.
I haven't experimented with running things concurrently. I use Pycharm and cut theoretically do two concurrent runs, dividing the texts into two groups of 25,000. I have a Mac with 32 GB of memory and a four core 4 GHz i7 processor. I don't know enough about the inside of machines to figure out whether the two processes would just get in each other's way.
I'll be grateful for any advice.
Martin Mueller
Professor emeritus of English and Classics
Northwestern University