Small follow up again, I did get Valgrind to work after all. Here is my
leak summary on the tiny test collection.
==22340== LEAK SUMMARY:
==22340== definitely lost: 0 bytes in 0 blocks
==22340== indirectly lost: 0 bytes in 0 blocks
==22340== possibly lost: 276,408 bytes in 79 blocks
==22340== still reachable: 2,864,995 bytes in 1,917 blocks
==22340== suppressed: 0 bytes in 0 blocks
==22340== Reachable blocks (those to which a pointer was found) are not
==22340== To see them, rerun with: --leak-check=full --show-reachable=yes
And here it is on 300 files.
==22399== LEAK SUMMARY:
==22399== definitely lost: 0 bytes in 0 blocks
==22399== indirectly lost: 0 bytes in 0 blocks
==22399== possibly lost: 513,916 bytes in 344 blocks
==22399== still reachable: 31,140,660 bytes in 175,023 blocks
==22399== suppressed: 0 bytes in 0 blocks
I'd bet this 'still reachable' category keeps increasing infinitely.
My other experiment is still running by the way, having processed 55000
of 249780 files now and taking almost 1GB RAM.
Maarten van Gompel (Proycon)
> I ran the test script you sent me through valgrind (a memory analyser,
> amongst other things) and it came out clean:
> ==10062== LEAK SUMMARY:
> ==10062== definitely lost: 0 bytes in 0 blocks
> ==10062== indirectly lost: 0 bytes in 0 blocks
> ==10062== possibly lost: 498,566 bytes in 265 blocks
> ==10062== still reachable: 2,645,015 bytes in 1,709 blocks
> ==10062== suppressed: 0 bytes in 0 blocks
> I looked through the "possibly lost" blocks and they all look reasonable,
> none of them seems to be related to parsing. Basically, they are
> initialisation time global memory allocations that valgrind isn't
> completely sure about.
> If you want to try it on your side, here's my command line:
> valgrind --tool=memcheck --leak-check=full --num-callers=30 \
> --suppressions=lxmldir/valgrind-python.supp python lxml_leak.py