
Does lxml run under pypy and would it make a difference to my project? I looked at pypy where you learn that it can be a lot faster in some circumstances. Would it help me? I run quite primitive lxml scripts across very large data sets, in particular 50,000 Early Modern texts that have been linguistically annotated so that every token is a <w> element with a set of attributes. There are a lot of errors in the original annotation, and I use various heuristics to spot errors and correct them, which mainly involves changing @lemma, @pos and @reg attributes. The texts vary in length from 100K to 250MB. It appears to me that building the document tree is the most expensive operation in the enterprise. If you have an error with 1,000 occurrences but you don't know the texts in which they occur you have to run the script across the entire set. That's an operation that takes between six and eight hours. So you don't want to run it unless you've gathered a lot of errors. Shaving a quarter off that running time wouldn't make much difference. Cutting it in half would be well worth it. I haven't experimented with running things concurrently. I use Pycharm and cut theoretically do two concurrent runs, dividing the texts into two groups of 25,000. I have a Mac with 32 GB of memory and a four core 4 GHz i7 processor. I don't know enough about the inside of machines to figure out whether the two processes would just get in each other's way. I'll be grateful for any advice. Martin Mueller Professor emeritus of English and Classics Northwestern University

It works, but last time I checked it was slower than CPython because the use of the Cython/CPython API (instead of CFFI) breaks many of the PyPy speed ups. The PyPy has improved their CPython API support, so it will be less now. But the performance benefits of PyPy are mostly with pure Python, or Python and CFFI. On 21-01-18 16:37, Martin Mueller wrote:
Does lxml run under pypy and would it make a difference to my project?

?Which raises the further question how much of lxml is "pure Python". If I understand it correctly, lxml is a frontend of sorts for libxml, and it needs libxml to do its work. But libxml isn't Python. Is the act of building a tree "pure Python"? What about looping over a set of two million tokens in a very long document and changing the @lemma attribute? ________________________________ From: lxml <lxml-bounces@lxml.de> on behalf of Pim van der Eijk (Lists) <lists@sonnenglanz.net> Sent: Sunday, January 21, 2018 10:56 AM To: lxml@lxml.de Subject: Re: [lxml] lxml and pypy It works, but last time I checked it was slower than CPython because the use of the Cython/CPython API (instead of CFFI) breaks many of the PyPy speed ups. The PyPy has improved their CPython API support, so it will be less now. But the performance benefits of PyPy are mostly with pure Python, or Python and CFFI. On 21-01-18 16:37, Martin Mueller wrote: Does lxml run under pypy and would it make a difference to my project?

Am .01.2018, 18:46 Uhr, schrieb Martin Mueller <martinmueller@northwestern.edu>:
Not really: lxml isn't pure Python. It relies on libxml2 and Cython. Depending on what you're doing lxml might not be for you. It's fantastic for transforming structures in memory and with ability to write to a stream is a godsend. In a parsing-heavy environment you might be better off with CPython's etree. At least, that's what I found for parsing Excel files. As always: measure and profile your own project. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226

Hi,
If the actual processing is more or less a function independently applied to each input text file it should be very easy to add concurrency with a process or thread pool and map using multiprocessing (and maybe a helper library). Threading parallelism in Python is usually hampered by the Global Interpreter Lock for pure-Python code, but lxml is a C extension and releases the GIL appropriately. So you'll really want to measure what works fastest for your usecase. You might want to look here for an idea: http://chriskiehl.com/article/parallelism-in-one-line/ Try this with different pool sizes and a thread- (multiprocessing.dummy) vs process-based ("regular" multiprocessing) approach, and measure. joblib seems like a helper library along the same lines, not sure if this would make life any easier: https://pythonhosted.org/joblib/index.html Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart

Thank you for your advice. I tried running two separate instances of lxml on Pycharm. From a casual observation of time spent on this or that, it appears that the two processes run at approximately the expected speed. Which means that I can halve the time with procedures I understand. MM ________________________________________ From: lxml <lxml-bounces@lxml.de> on behalf of Holger Joukl <Holger.Joukl@LBBW.de> Sent: Monday, January 22, 2018 2:56 AM To: lxml@lxml.de Subject: Re: [lxml] lxml and pypy Hi,
If the actual processing is more or less a function independently applied to each input text file it should be very easy to add concurrency with a process or thread pool and map using multiprocessing (and maybe a helper library). Threading parallelism in Python is usually hampered by the Global Interpreter Lock for pure-Python code, but lxml is a C extension and releases the GIL appropriately. So you'll really want to measure what works fastest for your usecase. You might want to look here for an idea: https://urldefense.proofpoint.com/v2/url?u=http-3A__chriskiehl.com_article_parallelism-2Din-2Done-2Dline_&d=DwICAg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=_UfpLeyyJhKFvNg6KZpXhR2cMsajQMhX0flrSa73MPo&s=JHXBbPhJM8btVOBjFN5RccFS2XgbfuOYGLzQJO2V3FI&e= Try this with different pool sizes and a thread- (multiprocessing.dummy) vs process-based ("regular" multiprocessing) approach, and measure. joblib seems like a helper library along the same lines, not sure if this would make life any easier: https://urldefense.proofpoint.com/v2/url?u=https-3A__pythonhosted.org_joblib_index.html&d=DwICAg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=_UfpLeyyJhKFvNg6KZpXhR2cMsajQMhX0flrSa73MPo&s=Tc_gkQbsqJzulrGWu1kKv8i2F_t2pOhnYwtEoyeXM_w&e= Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart _________________________________________________________________ Mailing list for the lxml Python XML toolkit - https://urldefense.proofpoint.com/v2/url?u=http-3A__lxml.de_&d=DwICAg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=_UfpLeyyJhKFvNg6KZpXhR2cMsajQMhX0flrSa73MPo&s=BLf6EpPDGJW2z5H0001_4B4kfoJs7snNzXcIPCrnGV4&e= lxml@lxml.de https://urldefense.proofpoint.com/v2/url?u=https-3A__mailman-2Dmail5.webfaction.com_listinfo_lxml&d=DwICAg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=_UfpLeyyJhKFvNg6KZpXhR2cMsajQMhX0flrSa73MPo&s=TXkOO07KY28U1Nla6sigiXKg5OQnLPrqH7-KiVwh7xQ&e=

If need be, you might well be able to further cut down execution times with more parallelization, e.g. something like import multiprocessing # import multiprocessing.dummy # for using threads instead of processes # Your worker function that does the lxml processing of # a single input file def fix_annotation(filepath): ... # your list of input files (as retrieved from the command line, # by inspecting a directory or whatever...) filepaths = [ ... ] # number of workers defaults to number of CPU cores of your machine; # Try different/higher numbers here e.g. # pool = multiprocessing.Pool(8) # Also, try using a thread pool instead of a process pool # pool = multiprocessing.dummy.Pool() pool = multiprocessing.Pool() results = pool.map(fix_annotation, filepaths) # wait for tasks to finish and exit the worker processes pool.close() # wait for worker processes' exits pool.join() Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart

It works, but last time I checked it was slower than CPython because the use of the Cython/CPython API (instead of CFFI) breaks many of the PyPy speed ups. The PyPy has improved their CPython API support, so it will be less now. But the performance benefits of PyPy are mostly with pure Python, or Python and CFFI. On 21-01-18 16:37, Martin Mueller wrote:
Does lxml run under pypy and would it make a difference to my project?

?Which raises the further question how much of lxml is "pure Python". If I understand it correctly, lxml is a frontend of sorts for libxml, and it needs libxml to do its work. But libxml isn't Python. Is the act of building a tree "pure Python"? What about looping over a set of two million tokens in a very long document and changing the @lemma attribute? ________________________________ From: lxml <lxml-bounces@lxml.de> on behalf of Pim van der Eijk (Lists) <lists@sonnenglanz.net> Sent: Sunday, January 21, 2018 10:56 AM To: lxml@lxml.de Subject: Re: [lxml] lxml and pypy It works, but last time I checked it was slower than CPython because the use of the Cython/CPython API (instead of CFFI) breaks many of the PyPy speed ups. The PyPy has improved their CPython API support, so it will be less now. But the performance benefits of PyPy are mostly with pure Python, or Python and CFFI. On 21-01-18 16:37, Martin Mueller wrote: Does lxml run under pypy and would it make a difference to my project?

Am .01.2018, 18:46 Uhr, schrieb Martin Mueller <martinmueller@northwestern.edu>:
Not really: lxml isn't pure Python. It relies on libxml2 and Cython. Depending on what you're doing lxml might not be for you. It's fantastic for transforming structures in memory and with ability to write to a stream is a godsend. In a parsing-heavy environment you might be better off with CPython's etree. At least, that's what I found for parsing Excel files. As always: measure and profile your own project. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226

Hi,
If the actual processing is more or less a function independently applied to each input text file it should be very easy to add concurrency with a process or thread pool and map using multiprocessing (and maybe a helper library). Threading parallelism in Python is usually hampered by the Global Interpreter Lock for pure-Python code, but lxml is a C extension and releases the GIL appropriately. So you'll really want to measure what works fastest for your usecase. You might want to look here for an idea: http://chriskiehl.com/article/parallelism-in-one-line/ Try this with different pool sizes and a thread- (multiprocessing.dummy) vs process-based ("regular" multiprocessing) approach, and measure. joblib seems like a helper library along the same lines, not sure if this would make life any easier: https://pythonhosted.org/joblib/index.html Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart

Thank you for your advice. I tried running two separate instances of lxml on Pycharm. From a casual observation of time spent on this or that, it appears that the two processes run at approximately the expected speed. Which means that I can halve the time with procedures I understand. MM ________________________________________ From: lxml <lxml-bounces@lxml.de> on behalf of Holger Joukl <Holger.Joukl@LBBW.de> Sent: Monday, January 22, 2018 2:56 AM To: lxml@lxml.de Subject: Re: [lxml] lxml and pypy Hi,
If the actual processing is more or less a function independently applied to each input text file it should be very easy to add concurrency with a process or thread pool and map using multiprocessing (and maybe a helper library). Threading parallelism in Python is usually hampered by the Global Interpreter Lock for pure-Python code, but lxml is a C extension and releases the GIL appropriately. So you'll really want to measure what works fastest for your usecase. You might want to look here for an idea: https://urldefense.proofpoint.com/v2/url?u=http-3A__chriskiehl.com_article_parallelism-2Din-2Done-2Dline_&d=DwICAg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=_UfpLeyyJhKFvNg6KZpXhR2cMsajQMhX0flrSa73MPo&s=JHXBbPhJM8btVOBjFN5RccFS2XgbfuOYGLzQJO2V3FI&e= Try this with different pool sizes and a thread- (multiprocessing.dummy) vs process-based ("regular" multiprocessing) approach, and measure. joblib seems like a helper library along the same lines, not sure if this would make life any easier: https://urldefense.proofpoint.com/v2/url?u=https-3A__pythonhosted.org_joblib_index.html&d=DwICAg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=_UfpLeyyJhKFvNg6KZpXhR2cMsajQMhX0flrSa73MPo&s=Tc_gkQbsqJzulrGWu1kKv8i2F_t2pOhnYwtEoyeXM_w&e= Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart _________________________________________________________________ Mailing list for the lxml Python XML toolkit - https://urldefense.proofpoint.com/v2/url?u=http-3A__lxml.de_&d=DwICAg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=_UfpLeyyJhKFvNg6KZpXhR2cMsajQMhX0flrSa73MPo&s=BLf6EpPDGJW2z5H0001_4B4kfoJs7snNzXcIPCrnGV4&e= lxml@lxml.de https://urldefense.proofpoint.com/v2/url?u=https-3A__mailman-2Dmail5.webfaction.com_listinfo_lxml&d=DwICAg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=_UfpLeyyJhKFvNg6KZpXhR2cMsajQMhX0flrSa73MPo&s=TXkOO07KY28U1Nla6sigiXKg5OQnLPrqH7-KiVwh7xQ&e=

If need be, you might well be able to further cut down execution times with more parallelization, e.g. something like import multiprocessing # import multiprocessing.dummy # for using threads instead of processes # Your worker function that does the lxml processing of # a single input file def fix_annotation(filepath): ... # your list of input files (as retrieved from the command line, # by inspecting a directory or whatever...) filepaths = [ ... ] # number of workers defaults to number of CPU cores of your machine; # Try different/higher numbers here e.g. # pool = multiprocessing.Pool(8) # Also, try using a thread pool instead of a process pool # pool = multiprocessing.dummy.Pool() pool = multiprocessing.Pool() results = pool.map(fix_annotation, filepaths) # wait for tasks to finish and exit the worker processes pool.close() # wait for worker processes' exits pool.join() Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart
participants (4)
-
Charlie Clark
-
Holger Joukl
-
Martin Mueller
-
Pim van der Eijk (Lists)