In-Reply-To=<4a9c3a57-3cc6-117f-f2f6-0542c7adce16@behnel.de>

Thanks a lot for your reply!

Oh, interesting, I didn't know that such a property exists!

However, the problem is that I'm parsing HTML pages using html.fromstring(). And this uses an HTMLParser under the hood.

HTMLParser hardcodes the value of collect_ids in its constructor, contrary to XMLParser:

cdef class XMLParser(_FeedParser):
    def __init__(self, *, encoding=None, attribute_defaults=False,
                 dtd_validation=False, load_dtd=False, no_network=True,
                 ns_clean=False, recover=False, XMLSchema schema=None,
                 huge_tree=False, remove_blank_text=False, resolve_entities=True,
                 remove_comments=False, remove_pis=False, strip_cdata=True,
                 collect_ids=True, target=None, compact=True):
        ...
        _BaseParser.__init__(self, parse_options, 0, schema,
                             remove_comments, remove_pis, strip_cdata,
                             collect_ids, target, encoding)
                             
cdef class HTMLParser(_FeedParser):
    def __init__(self, *, encoding=None, remove_blank_text=False,
                 remove_comments=False, remove_pis=False, strip_cdata=True,
                 no_network=True, target=None, XMLSchema schema=None,
                 recover=True, compact=True):
        ...
        _BaseParser.__init__(self, parse_options, 1, schema,
                             remove_comments, remove_pis, strip_cdata, True,
                             target, encoding)

Is there any way to set collect_ids=False when we use HTMLParser?

Thanks!

On Sat, Nov 26, 2016 at 8:34 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Hi!

Benoit Bernard schrieb am 23.11.2016 um 19:44:
> Has there been any advancements regarding this memory leak?
>
> I built the newest version of lxml (as well as its dependencies) and the
> problem is still there. I was able to track it down using umdh on Windows:
>
> etree!xmlDictLookup+0000025E (c:\tmp\libxml2-win-binaries\libxml2\dict.c,
> 933)
> etree!xmlHashAddEntry3+00000053 (c:\tmp\libxml2-win-binaries\libxml2\hash.c,
> 532)
> etree!xmlHashAddEntry+00000014 (c:\tmp\libxml2-win-binaries\libxml2\hash.c,
> 377)
> etree!xmlAddID+0000011D (c:\tmp\libxml2-win-binaries\libxml2\valid.c, 2632)
> etree!xmlSAX2AttributeInternal+0000078A
> (c:\tmp\libxml2-win-binaries\libxml2\sax2.c,
> 1411)
> etree!xmlSAX2StartElement+000002AE (c:\tmp\libxml2-win-binaries\libxml2\sax2.c,
> 1743)

By default, lxml configures the parser to collect and remember IDs used in
the documents. The dict that stores the names is shared globally in order
to reduce overall memory consumption across documents.

You can disable this for ID names by creating a parser with the option
collect_ids=False.

Stefan

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml@lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml