![](https://secure.gravatar.com/avatar/dc18010b69964fa9960d5381d9b8f073.jpg?s=120&d=mm&r=g)
The actual code is below, but I've got it so that it inflates Very Quickly... [cheshire@edhellond jstor]$ ./memory.py UID PID PPID C SZ RSS PSR STIME TTY TIME CMD cheshire 1778 1154 0 5861 14204 1 14:25 pts/2 00:00:00 /home/cheshire/install/bin/python -i ./memory.py 0 cheshire 1778 1154 99 20753 73820 1 14:25 pts/2 00:00:01 /home/cheshire/install/bin/python -i ./memory.py 238 cheshire 1778 1154 99 140239 551556 1 14:25 pts/2 00:00:08 /home/cheshire/install/bin/python -i ./memory.py 483 cheshire 1778 1154 99 245972 974616 1 14:25 pts/2 00:00:14 /home/cheshire/install/bin/python -i ./memory.py 734 cheshire 1778 1154 99 319488 1268656 1 14:25 pts/2 00:00:24 /home/cheshire/install/bin/python -i ./memory.py 1269 eg, after parsing 1269 documents (on average 250k each) it's using a total of 1.5 gigabytes of memory. This also happens in 2.1.1. I've used guppy/hpy to check that it's not python level code. Putting in a hp.heap() call in the loop shows the only difference to be the for loop's frame, per iteration. The actual production code works in 2.1.1, but has a lot more xpaths and then a serialization phase in the loop as well. Code, with comments: ---------------------------- def build_journal(jrnl): global nparse # Search for journal descriptions q = parse('c3.idx-id-journal exact "%s"' % jrnl) rs = db.search(session, q) # step through matches for rsi in rs: nparse += 1 # fetch record out of storage, use etree.XML(data) to parse rec = rsi.fetch_record(session) # process_xpath passes through directly to node.xpath() try: year = rec.process_xpath(session, '/issuemap/issue-meta/numerations/pub-date/year/text()')[0] month = rec.process_xpath(session, '/issuemap/issue-meta/numerations/pub-date/month/text()')[0] day = rec.process_xpath(session, '/issuemap/issue-meta/numerations/pub-date/day/text()')[0] except: rsi._ymd = (0,0,0) del rec continue rsi._ymd = (year, month, day) del rec # sort list based on date rs._list.sort(key=lambda x: x._ymd) del rs nparse = 0 # scan through all journal identifiers q = parse('c3.idx-id-journal exact ""') jids = db.scan(session, q, 1000000) # get OS memory usage stats pid = os.getpid() cmd = "ps -F -p %s" % pid print commands.getoutput(cmd) print nparse # and try to build for j in jids[100:]: build_journal(j[0]) print commands.getoutput(cmd).split('\n')[1] print nparse ---------------------------------------- Help? Rob On Mon, 22 Dec 2008, Dr R. Sanderson wrote:
Hi all,
I'm working on a script to replicate it, but using 2.1.2 or more recent results in not freeing any memory when parsing multiple documents in quick succession. The changelog says there was a memory issue fixed, so perhaps this introduced the bug at the same time?
I've seen (but not consistently) the lxml memory allocation failed: growing buffer message. Normally it just runs my machine out of memory.
Rob _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev