lxml and memory consumption

Hello, We are writing an ETL tool and we use lxml to parse a lot of XML files. The problem we are having is that lxml uses a considerable amount of memory which it doesn't release. I've already disabled caching of 'ID's. I've read in the archives of this list that lxml also caches a lot of other strings. By itself this isn't a problem, but the fact that this cache isn't cleared when lxml is "done" is a problem. Once the XML files have been imported the other stages in the ETL pipeline also need memory. Is there a way to clear this cache from client code? TIA, Wietse

Hi Wietse, maybe you are keeping references to lxml _ElementStringResults. For example like: s = myelement.xpath('text()')[0] This looks like a string but keeps a reference to myelement and therefore keeps the full tree alive. --dirk Am 11.09.2018, 10:46 Uhr, schrieb Wietse Jacobs <wietse.j@gmail.com>:

On Tue, 2018-09-11 at 15:48 +0200, Dirk Rothe wrote:
This! It is easy to accidentally create references. I also maintain an ETL tool which uses LXML extensively; and it creates and processes XML files as large as 9GB. With attention to detail memory utilization is reasonable. Additional advise is that, if possible, stream the XML rather than using the Document model. SaX is old, but still wonderful. I also make extensive use of XSLT transforms. Leaning on XSLT - and the libxslt library - is a very solid and **fast** way to approach data transformations. Lastly, in particular operations it is useful to call gc.collect() manually.
This looks like a string but keeps a reference to myelement and therefore keeps the full tree alive.
-- Adam Tauno Williams <mailto:awilliam@whitemice.org> GPG D95ED383 OpenGroupware Developer <http://www.opengroupware.us/>

On Tue, 2018-09-11 at 15:48 +0200, Dirk Rothe wrote:
This! It is easy to accidentally create references. I also maintain an ETL tool which uses LXML extensively; and it creates and processes XML files as large as 9GB. With attention to detail memory utilization is reasonable. Additional advise is that, if possible, stream the XML rather than using the Document model. SaX is old, but still wonderful. I also make extensive use of XSLT transforms. Leaning on XSLT - and the libxslt library - is a very solid and **fast** way to approach data transformations. Lastly, in particular operations it is useful to call gc.collect() manually.
This looks like a string but keeps a reference to myelement and therefore keeps the full tree alive.
-- Adam Tauno Williams <mailto:awilliam@whitemice.org> GPG D95ED383 OpenGroupware Developer <http://www.opengroupware.us/>

Hi,
This! It is easy to accidentally create references. I also maintain an ETL tool which uses LXML extensively; and it creates and processes
Just out of curiosity: Is this tool open or closed source? I happen to maintain a proprietary Integrations tool that uses lxml.objectify and would love to hear a management-convincing "open source it" story. Best, Holger

Am .09.2018, 21:57 Uhr, schrieb <jholg@gmx.de>:
Unless it is exactly how the business makes its money then the arguments against keeping it proprietary are that only inhouse resources will go to developing and maintaining it. This is as much about risk as cost: while it is likely that the company will remain the main contributor to the project (you can't just open source something and expect magic things to happen), this may mean missing opportunities and pitfalls that others might spot and there are costs associated with keeping something closed source: hosting, security, etc. Any open source project stands to benefit from peer review but avoid any discussion that open source is somehow intrinsically good, because whether it's true or not is irrelevant to the business. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226

Am .09.2018, 10:46 Uhr, schrieb Wietse Jacobs <wietse.j@gmail.com>:
Is there a way to clear this cache from client code?
How are you doing the parsing? I assume you're using iterparse in which case it's essential to .clear() elements after parsing them. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226

Wietse Jacobs schrieb am 11.09.2018 um 10:46:
You are probably referring to the global tag name cache. That's usually not an issue because applications do not tend to process an unlimited amount of structurally different XML documents.
You'd have to be more specific to help us understand what you are doing. How do you load the XML files? How do you process them? What data do you keep after processing? Stefan

Hi Wietse, maybe you are keeping references to lxml _ElementStringResults. For example like: s = myelement.xpath('text()')[0] This looks like a string but keeps a reference to myelement and therefore keeps the full tree alive. --dirk Am 11.09.2018, 10:46 Uhr, schrieb Wietse Jacobs <wietse.j@gmail.com>:

On Tue, 2018-09-11 at 15:48 +0200, Dirk Rothe wrote:
This! It is easy to accidentally create references. I also maintain an ETL tool which uses LXML extensively; and it creates and processes XML files as large as 9GB. With attention to detail memory utilization is reasonable. Additional advise is that, if possible, stream the XML rather than using the Document model. SaX is old, but still wonderful. I also make extensive use of XSLT transforms. Leaning on XSLT - and the libxslt library - is a very solid and **fast** way to approach data transformations. Lastly, in particular operations it is useful to call gc.collect() manually.
This looks like a string but keeps a reference to myelement and therefore keeps the full tree alive.
-- Adam Tauno Williams <mailto:awilliam@whitemice.org> GPG D95ED383 OpenGroupware Developer <http://www.opengroupware.us/>

On Tue, 2018-09-11 at 15:48 +0200, Dirk Rothe wrote:
This! It is easy to accidentally create references. I also maintain an ETL tool which uses LXML extensively; and it creates and processes XML files as large as 9GB. With attention to detail memory utilization is reasonable. Additional advise is that, if possible, stream the XML rather than using the Document model. SaX is old, but still wonderful. I also make extensive use of XSLT transforms. Leaning on XSLT - and the libxslt library - is a very solid and **fast** way to approach data transformations. Lastly, in particular operations it is useful to call gc.collect() manually.
This looks like a string but keeps a reference to myelement and therefore keeps the full tree alive.
-- Adam Tauno Williams <mailto:awilliam@whitemice.org> GPG D95ED383 OpenGroupware Developer <http://www.opengroupware.us/>

Hi,
This! It is easy to accidentally create references. I also maintain an ETL tool which uses LXML extensively; and it creates and processes
Just out of curiosity: Is this tool open or closed source? I happen to maintain a proprietary Integrations tool that uses lxml.objectify and would love to hear a management-convincing "open source it" story. Best, Holger

Am .09.2018, 21:57 Uhr, schrieb <jholg@gmx.de>:
Unless it is exactly how the business makes its money then the arguments against keeping it proprietary are that only inhouse resources will go to developing and maintaining it. This is as much about risk as cost: while it is likely that the company will remain the main contributor to the project (you can't just open source something and expect magic things to happen), this may mean missing opportunities and pitfalls that others might spot and there are costs associated with keeping something closed source: hosting, security, etc. Any open source project stands to benefit from peer review but avoid any discussion that open source is somehow intrinsically good, because whether it's true or not is irrelevant to the business. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226

Am .09.2018, 10:46 Uhr, schrieb Wietse Jacobs <wietse.j@gmail.com>:
Is there a way to clear this cache from client code?
How are you doing the parsing? I assume you're using iterparse in which case it's essential to .clear() elements after parsing them. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226

Wietse Jacobs schrieb am 11.09.2018 um 10:46:
You are probably referring to the global tag name cache. That's usually not an issue because applications do not tend to process an unlimited amount of structurally different XML documents.
You'd have to be more specific to help us understand what you are doing. How do you load the XML files? How do you process them? What data do you keep after processing? Stefan
participants (6)
-
Adam Tauno Williams
-
Charlie Clark
-
Dirk Rothe
-
jholg@gmx.de
-
Stefan Behnel
-
Wietse Jacobs