Re: [lxml] Memory leak when parsing XML files in sequence?

In-Reply-To=<4EA045E4.9020508@anaproy.nl> Has there been any advancements regarding this memory leak? I built the newest version of lxml (as well as its dependencies) and the problem is still there. I was able to track it down using umdh on Windows: etree!xmlDictLookup+0000025E (c:\tmp\libxml2-win-binaries\libxml2\dict.c, 933) etree!xmlHashAddEntry3+00000053 (c:\tmp\libxml2-win-binaries\libxml2\hash.c, 532) etree!xmlHashAddEntry+00000014 (c:\tmp\libxml2-win-binaries\libxml2\hash.c, 377) etree!xmlAddID+0000011D (c:\tmp\libxml2-win-binaries\libxml2\valid.c, 2632) etree!xmlSAX2AttributeInternal+0000078A (c:\tmp\libxml2-win-binaries\libxml2\sax2.c, 1411) etree!xmlSAX2StartElement+000002AE (c:\tmp\libxml2-win-binaries\libxml2\sax2.c, 1743) etree!htmlParseStartTag+00000579 (c:\tmp\libxml2-win-binaries\libxml2\htmlparser.c, 3926) etree!htmlParseElementInternal+00000069 (c:\tmp\libxml2-win-binaries\libxml2\htmlparser.c, 4467) etree!htmlParseContentInternal+000003D3 (c:\tmp\libxml2-win-binaries\libxml2\htmlparser.c, 4652) etree!htmlParseDocument+000002A2 (c:\tmp\libxml2-win-binaries\libxml2\htmlparser.c, 4818) etree!htmlDoRead+00000094 (c:\tmp\libxml2-win-binaries\libxml2\htmlparser.c, 6786) etree!htmlCtxtReadMemory+00000093 (c:\tmp\libxml2-win-binaries\libxml2\htmlparser.c, 7072) etree!__pyx_f_4lxml_5etree_11_BaseParser__parseUnicodeDoc+0000028A (c:\tmp\lxml-3.6.4\src\lxml\lxml.etree.c, 109222) etree!__pyx_f_4lxml_5etree__parseDoc+0000041F (c:\tmp\lxml-3.6.4\src\lxml\lxml.etree.c, 115220) etree!__pyx_f_4lxml_5etree__parseMemoryDocument+000000E6 (c:\tmp\lxml-3.6.4\src\lxml\lxml.etree.c, 116674) etree!__pyx_pf_4lxml_5etree_22fromstring+00000086 (c:\tmp\lxml-3.6.4\src\lxml\lxml.etree.c, 77737) etree!__pyx_pw_4lxml_5etree_23fromstring+00000294 (c:\tmp\lxml-3.6.4\src\lxml\lxml.etree.c, 77687) python33!PyCFunction_Call+000000F3 (c:\users\martin\33.amd64\python\objects\methodobject.c, 84) python33!PyObject_Call+00000061 (c:\users\martin\33.amd64\python\objects\abstract.c, 2036) python33!ext_do_call+00000295 (c:\users\martin\33.amd64\python\python\ceval.c, 4381) python33!PyEval_EvalFrameEx+00002041 (c:\users\martin\33.amd64\python\python\ceval.c, 2723) python33!PyEval_EvalCodeEx+0000065C (c:\users\martin\33.amd64\python\python\ceval.c, 3436) python33!function_call+0000015D (c:\users\martin\33.amd64\python\objects\funcobject.c, 639) python33!PyObject_Call+00000061 (c:\users\martin\33.amd64\python\objects\abstract.c, 2036) python33!ext_do_call+00000295 (c:\users\martin\33.amd64\python\python\ceval.c, 4381) python33!PyEval_EvalFrameEx+00002041 (c:\users\martin\33.amd64\python\python\ceval.c, 2723) python33!PyEval_EvalCodeEx+0000065C (c:\users\martin\33.amd64\python\python\ceval.c, 3436) python33!fast_function+0000014D (c:\users\martin\33.amd64\python\python\ceval.c, 4168) python33!call_function+00000339 (c:\users\martin\33.amd64\python\python\ceval.c, 4088) python33!PyEval_EvalFrameEx+00001F98 (c:\users\martin\33.amd64\python\python\ceval.c, 2681) For reference, here are my version numbers: Python : sys.version_info(major=3, minor=3, micro=5, releaselevel='final', serial=0) lxml.etree : (3, 6, 4, 0) libxml used : (2, 9, 4) libxml compiled : (2, 9, 4) libxslt used : (1, 1, 29) libxslt compiled : (1, 1, 29) Should I open a new bug? Thanks! Benoit Bernard https://benbernardblog.com

Hi! Benoit Bernard schrieb am 23.11.2016 um 19:44:
Has there been any advancements regarding this memory leak?
I built the newest version of lxml (as well as its dependencies) and the problem is still there. I was able to track it down using umdh on Windows:
etree!xmlDictLookup+0000025E (c:\tmp\libxml2-win-binaries\libxml2\dict.c, 933) etree!xmlHashAddEntry3+00000053 (c:\tmp\libxml2-win-binaries\libxml2\hash.c, 532) etree!xmlHashAddEntry+00000014 (c:\tmp\libxml2-win-binaries\libxml2\hash.c, 377) etree!xmlAddID+0000011D (c:\tmp\libxml2-win-binaries\libxml2\valid.c, 2632) etree!xmlSAX2AttributeInternal+0000078A (c:\tmp\libxml2-win-binaries\libxml2\sax2.c, 1411) etree!xmlSAX2StartElement+000002AE (c:\tmp\libxml2-win-binaries\libxml2\sax2.c, 1743)
By default, lxml configures the parser to collect and remember IDs used in the documents. The dict that stores the names is shared globally in order to reduce overall memory consumption across documents. You can disable this for ID names by creating a parser with the option collect_ids=False. Stefan

In-Reply-To=<4a9c3a57-3cc6-117f-f2f6-0542c7adce16@behnel.de> Thanks a lot for your reply! Oh, interesting, I didn't know that such a property exists! However, the problem is that I'm parsing HTML pages using html.fromstring(). And this uses an HTMLParser under the hood. HTMLParser hardcodes the value of collect_ids in its constructor, contrary to XMLParser: cdef class XMLParser(_FeedParser): def __init__(self, *, encoding=None, attribute_defaults=False, dtd_validation=False, load_dtd=False, no_network=True, ns_clean=False, recover=False, XMLSchema schema=None, huge_tree=False, remove_blank_text=False, resolve_entities=True, remove_comments=False, remove_pis=False, strip_cdata=True, collect_ids=True, target=None, compact=True): ... _BaseParser.__init__(self, parse_options, 0, schema, remove_comments, remove_pis, strip_cdata, collect_ids, target, encoding) cdef class HTMLParser(_FeedParser): def __init__(self, *, encoding=None, remove_blank_text=False, remove_comments=False, remove_pis=False, strip_cdata=True, no_network=True, target=None, XMLSchema schema=None, recover=True, compact=True): ... _BaseParser.__init__(self, parse_options, 1, schema, remove_comments, remove_pis, strip_cdata, True, target, encoding) Is there any way to set collect_ids=False when we use HTMLParser? Thanks! On Sat, Nov 26, 2016 at 8:34 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Hi!
Has there been any advancements regarding this memory leak?
I built the newest version of lxml (as well as its dependencies) and the problem is still there. I was able to track it down using umdh on Windows:
etree!xmlDictLookup+0000025E (c:\tmp\libxml2-win-binaries\
933) etree!xmlHashAddEntry3+00000053 (c:\tmp\libxml2-win-binaries\
532) etree!xmlHashAddEntry+00000014 (c:\tmp\libxml2-win-binaries\
377) etree!xmlAddID+0000011D (c:\tmp\libxml2-win-binaries\libxml2\valid.c,
etree!xmlSAX2AttributeInternal+0000078A (c:\tmp\libxml2-win-binaries\libxml2\sax2.c, 1411) etree!xmlSAX2StartElement+000002AE (c:\tmp\libxml2-win-binaries\
Benoit Bernard schrieb am 23.11.2016 um 19:44: libxml2\dict.c, libxml2\hash.c, libxml2\hash.c, 2632) libxml2\sax2.c,
1743)
By default, lxml configures the parser to collect and remember IDs used in the documents. The dict that stores the names is shared globally in order to reduce overall memory consumption across documents.
You can disable this for ID names by creating a parser with the option collect_ids=False.
Stefan
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml

On 11/28/16 20:54, Benoit Bernard wrote:
Is there any way to set collect_ids=False when we use HTMLParser?
FYI, Just sent a pull request: https://github.com/lxml/lxml/pull/216 Best, Burak
participants (3)
-
Benoit Bernard
-
Burak Arslan
-
Stefan Behnel