Hello,
for a while we have seen extensive memory usage in our application.
Yesterday a colleague and I were able to find the cause of the memory
loss. A cache is caching etree instances and uses deepcopy() to return a
copy of the etree. The cache is filled and access from multiple threads,
hence the need for deepcopy.
The amount of lost memory depends on the amount of threads and -- to
some point -- on the amount of copy operations in each thread. I've
attached a sample script to reproduce …
[View More]the issue. I've utilized libxml2's
debug method mlMemUsed() to verify, that the memory isn't lost in
libxml2. I've also tested several thousand deepcopy() ops in one thread
and several thousand etree.parse() ops in several hundred threads. None
have shown a similar memory loss. According to my colleague Dirk Rothe,
there is no visible memory loss on Windows.
Example output with a 7kb document
----------------------------------
The first number is the RSS of the process in MB after the document has
been parsed. The second number is the RSS after all threads have stopped.
$ python etree_deepcopy.py 50 100
10.875
21.1796875
50 threads, 100 copy ops per thread
$ python etree_deepcopy.py 50 1000
10.875
21.1875
50 threads, 1000 copy ops per thread
$ python etree_deepcopy.py 100 100
10.87109375
30.86328125
100 threads, 100 copy ops per thread
$ python etree_deepcopy.py 100 200
10.875
31.19140625
100 threads, 200 copy ops per thread
$ python etree_deepcopy.py 100 300
10.875
31.2109375
100 threads, 300 copy ops per thread
$ python etree_deepcopy.py 200 100
10.875
40.46484375
200 threads, 100 copy ops per thread
Python:
2.7.3 (self compiled with UCS-2)
Platform:
Ubuntu 12.04 AMD64
>>> etree.LXML_VERSION
(2, 3, 4, 0)
>>> etree.LIBXML_VERSION
(2, 7, 8)
>>> etree.LIBXSLT_VERSION
(1, 1, 26)
Christian
[View Less]
Hi,
while implementing a script that should extract a few things from
HTML pages, I've run into two problems with lxml.html. I'm not sure
if those are actually bugs or if I am doing something wrong, so I
haven't created entries in the bug tracker yet.
Problem 1: Top level comment has no parent
,----
| >>> html = "<!-- comment --><html><head><title>foo</title></head></html>"
| >>> tree = lxml.html.fromstring(html)
| >>> [tag.…
[View More]drop_tree() for tag in html.xpath("//comment()")]
| Traceback (most recent call last):
| File "<stdin>", line 1, in <module>
| File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 169,
| in drop_tree
| assert parent is not None
`----
This method to remove some elements works for any comment or
element, just not if the comment is on top level (which
unfortunately happens on exactly those pages that I'm processing).
Is that behaviour related to the problems mentioned in the FAQ entry
"Why can't I just delete parents or clear the root node in
iterparse()"? I have tried to use tree.remove(tag) as well, but then
I get a ValueError: Element is not a child of this node.
The reason behind removing the comment elements (and others) is that
I would otherwise get them returned by itertext().
Problem 2: lxml.html.clean.Cleaner removes meta content
,----
| >>> html = "<html><head><meta name=\"keywords\" content=\"foo\"></head></html>"
| >>> cleaner = lxml.html.clean.Cleaner()
| >>> cleaner.page_structure = False
| >>> cleaner.meta = False
| >>> tree = lxml.html.fromstring(html)
| >>> cleaner(tree)
| >>> lxml.html.tostring(tree)
| '<html><head><meta name="keywords"></head></html>'
`----
To work around the above problem with the comments, I tried to use
the cleaner to get rid of the comments (which works fine), but this
would then kill the content attributes of the meta tags. Is that
intended?
The used version numbers (I've seen that there are slightly newer
versions available, but the changelogs didn't mention anything that
looked like those problems could be affected):
Python : sys.version_info(major=2, minor=7, micro=2, releaselevel='final', serial=0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
Thanks,
Adalbert
[View Less]