Mailman 3 lxml and pypy - lxml - The Python XML Toolkit

Jan. 21, 2018

      Does lxml run under pypy and would it make a difference to my project?

I looked at pypy where you learn that it can be a lot faster in some circumstances. Would it help me?

I run quite primitive lxml scripts across very large data sets, in particular 50,000 Early Modern texts that have been linguistically annotated so that every token is a <w> element with a set of attributes. There are a lot of errors in the original annotation, and I use various heuristics to spot errors and correct them, which mainly involves changing @lemma, @pos and @reg attributes.

The texts vary in length from 100K to 250MB.  It appears to me that building the document tree is the most expensive operation in the enterprise. If you have an error with 1,000 occurrences but you don't know the texts in which they occur you have to run the script across the entire set.  That's an operation that takes between six and eight hours. So you don't want to run it unless you've gathered a lot of errors.

Shaving a quarter off that running time wouldn't make much difference. Cutting it in half would be well worth it.

I haven't experimented with running things concurrently. I use Pycharm and cut theoretically do two concurrent runs, dividing the texts into two groups of 25,000. I have a Mac with 32 GB of memory and a four core 4 GHz i7 processor.  I don't know enough about the inside of machines to figure out whether the two processes would just get in each other's way.

I'll be grateful for any advice.

Martin Mueller

Professor emeritus of English and Classics

Northwestern University

lxml and pypy

Martin Mueller

Pim van der Eijk (Lists)

Martin Mueller

Charlie Clark

Holger Joukl

Martin Mueller

Holger Joukl

Pim van der Eijk (Lists)

Martin Mueller

Charlie Clark

Holger Joukl

Martin Mueller

Holger Joukl

tags

participants (4)