extremely slow exit for program having huge (45G) dict (python 2.5.2)
I have a program that creates a huge (45GB) defaultdict. (The keys are short strings, the values are short lists of pairs (string, int).) Nothing but possibly the strings and ints is shared. The program takes around 10 minutes to run, but longer than 20 minutes to exit (I gave up at that point). That is, after executing the final statement (a print), it is apparently spending a huge amount of time cleaning up before exiting. I haven't installed any exit handlers or anything like that, all files are already closed and stdout/stderr flushed, and there's nothing special going on. I have done 'gc.disable()' for performance (which is hideous without it)--I have no reason to think there are any loops. Currently I am working around this by doing an os._exit(), which is immediate, but this seems like a bit of hack. Is this something that needs fixing, or that has already been fixed? Mike
Can you distill the program into something reproducible? Maybe with something slightly less than 45Gb but still exhibiting some degradation of exit performance? I can try to point our commercial profiling tools at it and see what it is doing. K -----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Mike Coleman Sent: 19. desember 2008 23:30 To: python-dev@python.org Subject: [Python-Dev] extremely slow exit for program having huge (45G) dict (python 2.5.2) I have a program that creates a huge (45GB) defaultdict. (The keys are short strings, the values are short lists of pairs (string, int).) Nothing but possibly the strings and ints is shared. The program takes around 10 minutes to run, but longer than 20 minutes to exit (I gave up at that point). That is, after executing the final statement (a print), it is apparently spending a huge amount of time cleaning up before exiting. I haven't installed any exit handlers or anything like that, all files are already closed and stdout/stderr flushed, and there's nothing special going on. I have done 'gc.disable()' for performance (which is hideous without it)--I have no reason to think there are any loops. Currently I am working around this by doing an os._exit(), which is immediate, but this seems like a bit of hack. Is this something that needs fixing, or that has already been fixed? Mike _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/kristjan%40ccpgames.com
On Sat, 20 Dec 2008 09:02:38 pm Kristján Valur Jónsson wrote:
Can you distill the program into something reproducible? Maybe with something slightly less than 45Gb but still exhibiting some degradation of exit performance? I can try to point our commercial profiling tools at it and see what it is doing. K
In November 2007, a similar problem was reported on the comp.lang.python newsgroup. 370MB was large enough to demonstrate the problem. I don't know if a bug was ever reported. The thread starts here: http://mail.python.org/pipermail/python-list/2007-November/465498.html or if you prefer Google Groups: http://preview.tinyurl.com/97xsso and it describes extremely long times to populate and destroy large dicts even with garbage collection turned off. My summary at the time was: "On systems with multiple CPUs or 64-bit systems, or both, creating and/or deleting a multi-megabyte dictionary in recent versions of Python (2.3, 2.4, 2.5 at least) takes a LONG time, of the order of 30+ minutes, compared to seconds if the system only has a single CPU. Turning garbage collection off doesn't help." I make no guarantee that the above is a correct description of the problem, only that this is what I believed at the time. I'm afraid it is a very long thread, with multiple red herrings, lots of people unable to reproduce the problem, and the usual nonsense that happens on comp.lang.python. I was originally one of the skeptics until I reproduced the original posters problem. I generated a sample file 8 million key/value pairs as a 370MB text file. Reading it into a dict took two and a half minutes on my relatively slow computer. But deleting the dict took more than 30 minutes even with garbage collection switched off. Sample code reproducing the problem on my machine is here: http://mail.python.org/pipermail/python-list/2007-November/465513.html According to this post of mine: http://mail.python.org/pipermail/python-list/2007-November/466209.html deleting 8 million (key, value) pairs stored as a list of tuples was very fast. It was only if they were stored as a dict that deleting it was horribly slow. Please note that other people have tried and failed to replicate the problem. I suspect the fault (if it is one, and not human error) is specific to some combinations of Python version and hardware. Even if this is a Will Not Fix, I'd be curious if anyone else can reproduce the problem. Hope this is helpful, Steven.
-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Mike Coleman Sent: 19. desember 2008 23:30 To: python-dev@python.org Subject: [Python-Dev] extremely slow exit for program having huge (45G) dict (python 2.5.2)
I have a program that creates a huge (45GB) defaultdict. (The keys are short strings, the values are short lists of pairs (string, int).) Nothing but possibly the strings and ints is shared.
The program takes around 10 minutes to run, but longer than 20 minutes to exit (I gave up at that point). That is, after executing the final statement (a print), it is apparently spending a huge amount of time cleaning up before exiting. I haven't installed any exit handlers or anything like that, all files are already closed and stdout/stderr flushed, and there's nothing special going on. I have done 'gc.disable()' for performance (which is hideous without it)--I have no reason to think there are any loops.
Currently I am working around this by doing an os._exit(), which is immediate, but this seems like a bit of hack. Is this something that needs fixing, or that has already been fixed?
Mike
-- Steven D'Aprano
Steven D'Aprano <steve <at> pearwood.info> writes:
In November 2007, a similar problem was reported on the comp.lang.python newsgroup. 370MB was large enough to demonstrate the problem. I don't know if a bug was ever reported.
Do you still reproduce it on trunk? I've tried your scripts on my machine and they work fine, even if I leave garbage collecting enabled during the process. (dual core 64-bit machine but in 32-bit mode)
On Sun, 21 Dec 2008 06:45:11 am Antoine Pitrou wrote:
Steven D'Aprano <steve <at> pearwood.info> writes:
In November 2007, a similar problem was reported on the comp.lang.python newsgroup. 370MB was large enough to demonstrate the problem. I don't know if a bug was ever reported.
Do you still reproduce it on trunk? I've tried your scripts on my machine and they work fine, even if I leave garbage collecting enabled during the process. (dual core 64-bit machine but in 32-bit mode)
I'm afraid that sometime over the last year, I replaced my computer's motherboard, and now I can't reproduce the behaviour at all. I've tried two different boxes, with both Python 2.6.1 and 2.5.1. -- Steven D'Aprano
On Sat, Dec 20, 2008 at 4:02 AM, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
Can you distill the program into something reproducible? Maybe with something slightly less than 45Gb but still exhibiting some degradation of exit performance? I can try to point our commercial profiling tools at it and see what it is doing.
I will try next week to see if I can come up with a smaller, submittable example. Thanks.
On 2008-12-20 17:57, Mike Coleman wrote:
On Sat, Dec 20, 2008 at 4:02 AM, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
Can you distill the program into something reproducible? Maybe with something slightly less than 45Gb but still exhibiting some degradation of exit performance? I can try to point our commercial profiling tools at it and see what it is doing.
I will try next week to see if I can come up with a smaller, submittable example. Thanks.
These long exit times are usually caused by the garbage collection of objects. This can be a very time consuming task. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 20 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2008-12-02: Released mxODBC.Connect 1.0.0 http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Sat, Dec 20, 2008 at 3:04 PM, M.-A. Lemburg <mal@egenix.com> wrote:
These long exit times are usually caused by the garbage collection of objects. This can be a very time consuming task.
In that case, the question would be "why is the interpreter collecting garbage when it knows we're trying to exit anyway?". -- Cheers, Leif
Leif Walsh wrote:
On Sat, Dec 20, 2008 at 3:04 PM, M.-A. Lemburg <mal@egenix.com> wrote:
These long exit times are usually caused by the garbage collection of objects. This can be a very time consuming task.
In that case, the question would be "why is the interpreter collecting garbage when it knows we're trying to exit anyway?".
Because finalizers are only called when an object is destroyed presumably. Michael -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog
Leif> In that case, the question would be "why is the interpreter Leif> collecting garbage when it knows we're trying to exit anyway?". Because useful side effects are sometimes performed as a result of this activity (flushing disk buffers, closing database connections, etc). Skip
(@Skip, Michael, Tim) On Sat, Dec 20, 2008 at 3:26 PM, <skip@pobox.com> wrote:
Because useful side effects are sometimes performed as a result of this activity (flushing disk buffers, closing database connections, etc).
Of course they are. But what about the case given above: On Sat, Dec 20, 2008 at 5:55 AM, Steven D'Aprano <steve@pearwood.info> wrote:
I was originally one of the skeptics until I reproduced the original posters problem. I generated a sample file 8 million key/value pairs as a 370MB text file. Reading it into a dict took two and a half minutes on my relatively slow computer. But deleting the dict took more than 30 minutes even with garbage collection switched off.
It might be a semantic change that I'm looking for here, but it seems to me that if you turn off the garbage collector, you should be able to expect that either it also won't run on exit, or it should have a way of letting you tell it not to run on exit. If I'm running without a garbage collector, that assumes I'm at least cocky enough to think I know when I'm done with my objects, so I should know to delete the objects that have __del__ functions I care about before I exit. Well, maybe; I'm sure one of you could drag out a programmer that would make that mistake, but turning off the garbage collector to me seems to send the experience message, at least a little. Does the garbage collector run any differently when the process is exiting? It seems that it wouldn't need to do anything more that run through all objects in the heap and delete them, which doesn't require anything fancy, and should be able to sort by address to aid with caching. If it's already this fast, then I guess it really is the sheer number of function calls necessary that are causing such a slowdown in the cases we've seen, but I find this hard to believe. -- Cheers, Leif
[Leif Walsh]
... It might be a semantic change that I'm looking for here, but it seems to me that if you turn off the garbage collector, you should be able to expect that either it also won't run on exit,
It won't then, but "the garbage collector" is the gc module, and that only performs /cyclic/ garbage collection. There is no way to stop refcount-based garbage collection. Read my message again.
or it should have a way of letting you tell it not to run on exit. If I'm running without a garbage collector, that assumes I'm at least cocky enough to think I know when I'm done with my objects, so I should know to delete the objects that have __del__ functions I care about before I exit. Well, maybe; I'm sure one of you could drag out a programmer that would make that mistake, but turning off the garbage collector to me seems to send the experience message, at least a little.
This probably isn't a problem with cyclic gc (reread my msg).
Does the garbage collector run any differently when the process is exiting?
No.
It seems that it wouldn't need to do anything more that run through all objects in the heap and delete them, which doesn't require anything fancy,
Reread my msg -- already explained the likely cause here (if "all the objects in the heap" have in fact been swapped out to disk, it can take an enormously long time to just "run through" them all).
and should be able to sort by address to aid with caching.
That one isn't possible. There is no list of "all objects" to /be/ sorted. The only way to find all the objects is to traverse the object graph from its roots, which is exactly what non-cyclic gc does anyway.
If it's already this fast, then I guess it really is the sheer number of function calls necessary that are causing such a slowdown in the cases we've seen, but I find this hard to believe.
My guess remains that CPU usage is trivial here, and 99.99+% of the wall-clock time is consumed waiting for disk reads. Either that, or that platform malloc is going nuts.
Leif Walsh <leif.walsh <at> gmail.com> writes:
It might be a semantic change that I'm looking for here, but it seems to me that if you turn off the garbage collector, you should be able to expect that either it also won't run on exit, or it should have a way of letting you tell it not to run on exit.
[...] I'm skeptical that it's a garbage collector problem. The script creates one dict containing lots of strings and ints. The thing is, strings and ints aren't tracked by the GC as they are simple atomic objects. Therefore, the /only/ object created by the script which is tracked by the GC is the dict. Moreover, since there is no cycle created, the dict should be directly destroyed when its last reference dies (the "del" statement), not go through the garbage collection process. Given that the problem is reproduced on certain systems and not others, it can be related to an interaction between allocation patterns of the dict implementation, the Python memory allocator, and the implementation of the C malloc() / free() functions. I'm no expert enough to find out more on the subject.
Antoine Pitrou wrote:
Leif Walsh <leif.walsh <at> gmail.com> writes:
It might be a semantic change that I'm looking for here, but it seems to me that if you turn off the garbage collector, you should be able to expect that either it also won't run on exit, or it should have a way of letting you tell it not to run on exit. [...]
I'm skeptical that it's a garbage collector problem. The script creates one dict containing lots of strings and ints. The thing is, strings and ints aren't tracked by the GC as they are simple atomic objects. Therefore, the /only/ object created by the script which is tracked by the GC is the dict. Moreover, since there is no cycle created, the dict should be directly destroyed when its last reference dies (the "del" statement), not go through the garbage collection process.
Given that the problem is reproduced on certain systems and not others, it can be related to an interaction between allocation patterns of the dict implementation, the Python memory allocator, and the implementation of the C malloc() / free() functions. I'm no expert enough to find out more on the subject.
I believe the OP engendered a certain amount of confusion by describing object deallocation as being performed by the garbage collector. So he perhaps didn't understand that even decref'ing all the objects only referenced by the dict will take a huge amount of time unless there's enough real memory to hold it. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/
Steve Holden <steve <at> holdenweb.com> writes:
I believe the OP engendered a certain amount of confusion by describing object deallocation as being performed by the garbage collector. So he perhaps didn't understand that even decref'ing all the objects only referenced by the dict will take a huge amount of time unless there's enough real memory to hold it.
He said he has 64GB RAM so I assume all his working set was in memory, not swapped out.
From Tim Peters: BTW, the original poster should try this: use whatever tools the OS supplies to look at CPU and disk usage during the long exit. What I /expect/ is that almost no CPU time is being used, while the disk is grinding itself to dust. That's what happens when a large number of objects have been swapped out to disk, and exit processing has to page
Tim, I left out some details that I believe probably rule out the "swapped out" theory. The machine in question has 64GB RAM, but only 16GB swap. I'd prefer more swap, but in any case only around ~400MB of the swap was actually in use during my program's entire run. Furthermore, during my program's exit, it was using 100% CPU, and I'm 95% sure there was no significant "system" or "wait" CPU time for the system. (All observations via 'top'.) So, I think that the problem is entirely a computational one within this process. The system does have 8 CPUs. I'm not sure about it's memory architecture, but if it's some kind of NUMA box, I guess access to memory could be slower than what we'd normally expect. I'm skeptical about that being a significant factor here, though. Just to clarify, I didn't gc.disable() to address this problem, but rather because it destroys performance during the creation of the huge dict. I don't have a specific number, but I think disabling gc reduced construction from something like 70 minutes to 5 (or maybe 10). Quite dramatic. Mike them all back into memory again (in order to decrement their refcounts).
Re "held" and "intern_it": Haha! That's evil and extremely evil, respectively. :-) I will add these to the Python wiki if they're not already there... Mike
On Sat, Dec 20, 2008 at 6:22 PM, Mike Coleman <tutufan@gmail.com> wrote:
Re "held" and "intern_it": Haha! That's evil and extremely evil, respectively. :-)
P.S. I tried the "held" idea out (interning integers in a list), and unfortunately it didn't make that much difference. In the example I tried, there were 104465178 instances of integers from range(33467). I guess if ints are 12 bytes (per Beazley's book, but not sure if that still holds), then that would correspond to a 1GB reduction. Judging by 'top', it might have been 2 or 3GB instead, from a total of 45G. Mike
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Mike Coleman wrote:
I guess if ints are 12 bytes (per Beazley's book, but not sure if that still holds), then that would correspond to a 1GB reduction.
Python 2.6.1 (r261:67515, Dec 11 2008, 20:28:07) [GCC 4.2.3] on sunos5 Type "help", "copyright", "credits" or "license" for more information.
import sys sys.getsizeof(0) 12
- -- Jesus Cea Avion _/_/ _/_/_/ _/_/_/ jcea@jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ jabber / xmpp:jcea@jabber.org _/_/ _/_/ _/_/_/_/_/ . _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQCVAwUBSVq3+Zlgi5GaxT1NAQLYUAP+Jc0JPYf2GPdNCKypORO+mD887xs81hQ0 MM7QBbRgLflcQ6g2tijpWPhN2/INscbtFn41lptHEYFTv/kka9EICuxgoNP1COYT Or+1uChnSsx1Z7Xxr8YwLFe6ZW/LDyvPjCMpIT32mGSlc1/mfPZk3WjpqTJPeCwY vqu9xD0T0iw= =gXQ5 -----END PGP SIGNATURE-----
Jesus Cea <jcea <at> jcea.es> writes:
Mike Coleman wrote:
I guess if ints are 12 bytes (per Beazley's book, but not sure if that still holds), then that would correspond to a 1GB reduction.
Python 2.6.1 (r261:67515, Dec 11 2008, 20:28:07) [GCC 4.2.3] on sunos5 Type "help", "copyright", "credits" or "license" for more information.
import sys sys.getsizeof(0) 12
On a 32-bit system, sure, but given Mike creates a 45 GB dict, he has a 64-bit system, where ints are 24 bytes:
sys.getsizeof(0) 24 sys.getsizeof(100000) 24
cheers Antoine.
Mike Coleman <tutufan <at> gmail.com> writes:
Just to clarify, I didn't gc.disable() to address this problem, but rather because it destroys performance during the creation of the huge dict. I don't have a specific number, but I think disabling gc reduced construction from something like 70 minutes to 5 (or maybe 10). Quite dramatic.
There's a pending patch which should fix that problem: http://bugs.python.org/issue4074 Regards Antoine.
[M.-A. Lemburg]
These long exit times are usually caused by the garbage collection of objects. This can be a very time consuming task.
[Leif Walsh]
In that case, the question would be "why is the interpreter collecting garbage when it knows we're trying to exit anyway?".
Because user-defined destructors (like __del__ methods and weakref callbacks) may be associated with garbage, and users presumably want those to execute. Doing so requires identifying identifying garbage and releasing it, same as if the interpreter didn't happen to be exiting. BTW, the original poster should try this: use whatever tools the OS supplies to look at CPU and disk usage during the long exit. What I /expect/ is that almost no CPU time is being used, while the disk is grinding itself to dust. That's what happens when a large number of objects have been swapped out to disk, and exit processing has to page them all back into memory again (in order to decrement their refcounts). Python's cyclic gc (the `gc` module) has nothing to do with this -- it's typically the been-there-forever refcount-based non-cyclic gc that accounts for supernaturally long exit times. If that is the case here, there's no evident general solution. If you have millions of objects still alive at exit, refcount-based reclamation has to visit all of them, and if they've been swapped out to disk it can take a very long time to swap them all back into memory again.
Tim Peters wrote:
If that is the case here, there's no evident general solution. If you have millions of objects still alive at exit, refcount-based reclamation has to visit all of them, and if they've been swapped out to disk it can take a very long time to swap them all back into memory again.
In that case, it sounds like using os._exit() to get out of the program without visiting all that memory *is* the right answer (or as right an answer as is available at least). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
On 2008-12-20 21:20, Leif Walsh wrote:
On Sat, Dec 20, 2008 at 3:04 PM, M.-A. Lemburg <mal@egenix.com> wrote:
These long exit times are usually caused by the garbage collection of objects. This can be a very time consuming task.
In that case, the question would be "why is the interpreter collecting garbage when it knows we're trying to exit anyway?".
It cannot know until the very end, because there may still be some try: ... except SystemExit: ... somewhere in the code waiting to trigger and stop the system exit. If you want a really fast exit, try this: import os os.kill(os.getpid(), 9) But you better know what you're doing if you take this approach... -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 20 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2008-12-02: Released mxODBC.Connect 1.0.0 http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Sat, Dec 20, 2008 at 2:50 PM, M.-A. Lemburg <mal@egenix.com> wrote:
If you want a really fast exit, try this:
import os os.kill(os.getpid(), 9)
But you better know what you're doing if you take this approach...
This would work, but I think os._exit(EX_OK) is probably just as fast, and allows you to control the exit status...
I will try next week to see if I can come up with a smaller, submittable example. Thanks.
These long exit times are usually caused by the garbage collection of objects. This can be a very time consuming task.
I doubt that. The long exit times are usually caused by a bad malloc implementation. Regards, Martin
On 2008-12-20 23:16, Martin v. Löwis wrote:
I will try next week to see if I can come up with a smaller, submittable example. Thanks. These long exit times are usually caused by the garbage collection of objects. This can be a very time consuming task.
I doubt that. The long exit times are usually caused by a bad malloc implementation.
With "garbage collection" I meant the process of Py_DECREF'ing the objects in large containers or deeply nested structures, not the GC mechanism for breaking circular references in Python. This will usually also involve free() calls, so the malloc implementation affects this as well. However, I've seen such long exit times on Linux and Windows, which both have rather good malloc implementations. I don't think there's anything much we can do about it at the interpreter level. Deleting millions of objects takes time and that's not really surprising at all. It takes even longer if you have instances with .__del__() methods written in Python. Applications can choose other mechanisms for speeding up the exit process in various (less clean) ways, if they have a need for this. BTW: Rather than using a huge in-memory dict, I'd suggest to either use an on-disk dictionary such as the ones found in mxBeeBase or a database. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 22 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2008-12-02: Released mxODBC.Connect 1.0.0 http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Mon, Dec 22, 2008 at 6:20 AM, M.-A. Lemburg <mal@egenix.com> wrote:
BTW: Rather than using a huge in-memory dict, I'd suggest to either use an on-disk dictionary such as the ones found in mxBeeBase or a database.
I really want this to work in-memory. I have 64G RAM, and I'm only trying to use 45G of it ("only" 45G :-), and I don't need the results to persist after the program finishes. Python should be able to do this. I don't want to hear "Just use Perl instead" from my co-workers... ;-)
On Dec 22, 2008, at 1:13 PM, Mike Coleman wrote:
On Mon, Dec 22, 2008 at 6:20 AM, M.-A. Lemburg <mal@egenix.com> wrote:
BTW: Rather than using a huge in-memory dict, I'd suggest to either use an on-disk dictionary such as the ones found in mxBeeBase or a database.
I really want this to work in-memory. I have 64G RAM, and I'm only trying to use 45G of it ("only" 45G :-), and I don't need the results to persist after the program finishes.
It's still not clear to me, from reading the whole thread, precisely what you're seeing. A self-contained test case, preferably with generated random data, would be great, and save everyone a lot of investigation time. In the meantime, can you 1) turn off all swap files and partitions, and 2) confirm positively that your CPU cycles are burning up in userland? (In general, unless you know exactly why your workload needs swap, and have written your program to take swapping into account, having _any_ swap on a machine with 64GB RAM is lunacy. The machine will grind to a complete standstill long before filling up gigabytes of swap.) -- Ivan Krstić <krstic@solarsail.hcs.harvard.edu> | http://radian.org
On Mon, Dec 22, 2008 at 2:54 PM, Ivan Krstić <krstic@solarsail.hcs.harvard.edu> wrote:
It's still not clear to me, from reading the whole thread, precisely what you're seeing. A self-contained test case, preferably with generated random data, would be great, and save everyone a lot of investigation time.
I'm still working on a test case. The first couple of attempts, using a half-hearted attempt to model the application behavior wrt this dict didn't demonstrate bad behavior. My impression is that no one's burning much time on this but me at the moment, aside from offering helpful advice. If you are, you might want to wait. I noticed just now that the original hardware was throwing some chipkills, so I'm retesting on something else.
In the meantime, can you 1) turn off all swap files and partitions, and 2) confirm positively that your CPU cycles are burning up in userland?
For (1), I don't have that much control over the machine. Plus, based on watching with top, I seriously doubt the process is using swap in any way. For (2), yes, 100% CPU usage.
(In general, unless you know exactly why your workload needs swap, and have written your program to take swapping into account, having _any_ swap on a machine with 64GB RAM is lunacy. The machine will grind to a complete standstill long before filling up gigabytes of swap.)
The swap is not there to support my application per se. Clearly if you're swapping, generally you're crawling. This host is used by a reasonably large set of non- and novice programmers, who sometimes vacuum up VM without realizing it. If you have a nice, big swap space, you can 'kill -STOP' these offenders, and allow them to swap out while you have a leisurely discussion with the owner and possibly 'kill -CONT' later, as opposed to having to do a quick 'kill -KILL' to save the machine. That's my thinking, anyway. Mike
On Dec 22, 2008, at 6:28 PM, Mike Coleman wrote:
For (2), yes, 100% CPU usage.
100% _user_ CPU usage? (I'm trying to make sure we're not chasing some particular degeneration of kmalloc/vmalloc and friends.) -- Ivan Krstić <krstic@solarsail.hcs.harvard.edu> | http://radian.org
2008/12/22 Ivan Krstić <krstic@solarsail.hcs.harvard.edu>:
On Dec 22, 2008, at 6:28 PM, Mike Coleman wrote:
For (2), yes, 100% CPU usage.
100% _user_ CPU usage? (I'm trying to make sure we're not chasing some particular degeneration of kmalloc/vmalloc and friends.)
Yes, user. No noticeable sys or wait CPU going on.
I unfortunately don't have time to work out how obmalloc works myself, but I wonder if any of the constants in that file might need to scale somehow with memory size. That is, is it possible that some of them that work okay with 1G RAM won't work well with (say) 128G or 1024G (coming soon enough)?
On 2008-12-22 19:13, Mike Coleman wrote:
On Mon, Dec 22, 2008 at 6:20 AM, M.-A. Lemburg <mal@egenix.com> wrote:
BTW: Rather than using a huge in-memory dict, I'd suggest to either use an on-disk dictionary such as the ones found in mxBeeBase or a database.
I really want this to work in-memory. I have 64G RAM, and I'm only trying to use 45G of it ("only" 45G :-), and I don't need the results to persist after the program finishes.
Python should be able to do this. I don't want to hear "Just use Perl instead" from my co-workers... ;-)
What kinds of objects are you storing in your dictionary ? Python instances, strings, integers ? The time it takes to deallocate the objects in your dictionary depends a lot on the types you are using. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 22 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2008-12-02: Released mxODBC.Connect 1.0.0 http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Dec 22, 2008, at 4:07 PM, M.-A. Lemburg wrote:
What kinds of objects are you storing in your dictionary ? Python instances, strings, integers ?
Answered in a previous message: On Dec 20, 2008, at 8:09 PM, Mike Coleman wrote:
The dict keys were all uppercase alpha strings of length 7. I don't have access at the moment, but maybe something like 10-100M of them (not sure how redundant the set is). The values are all lists of pairs, where each pair is a (string, int). The pair strings are of length around 30, and drawn from a "small" fixed set of around 60K strings (). As mentioned previously, I think the ints are drawn pretty uniformly from something like range(10000). The length of the lists depends on the redundancy of the key set, but I think there are around 100-200M pairs total, for the entire dict.
(If you're curious about the application domain, see 'http://greylag.org '.)
-- Ivan Krstić <krstic@solarsail.hcs.harvard.edu> | http://radian.org
On Mon, 22 Dec 2008 11:20:59 pm M.-A. Lemburg wrote:
On 2008-12-20 23:16, Martin v. Löwis wrote:
I will try next week to see if I can come up with a smaller, submittable example. Thanks.
These long exit times are usually caused by the garbage collection of objects. This can be a very time consuming task.
I doubt that. The long exit times are usually caused by a bad malloc implementation.
With "garbage collection" I meant the process of Py_DECREF'ing the objects in large containers or deeply nested structures, not the GC mechanism for breaking circular references in Python.
This will usually also involve free() calls, so the malloc implementation affects this as well. However, I've seen such long exit times on Linux and Windows, which both have rather good malloc implementations.
I don't think there's anything much we can do about it at the interpreter level. Deleting millions of objects takes time and that's not really surprising at all. It takes even longer if you have instances with .__del__() methods written in Python.
This behaviour appears to be specific to deleting dicts, not deleting random objects. I haven't yet confirmed that the problem still exists in trunk (I hope to have time tonight or tomorrow), but in my previous tests deleting millions of items stored in a list of tuples completed in a minute or two, while deleting the same items stored as key:item pairs in a dict took 30+ minutes. I say plus because I never had the patience to let it run to completion, it could have been hours for all I know.
Applications can choose other mechanisms for speeding up the exit process in various (less clean) ways, if they have a need for this.
BTW: Rather than using a huge in-memory dict, I'd suggest to either use an on-disk dictionary such as the ones found in mxBeeBase or a database.
The original poster's application uses 45GB of data. In my earlier tests, I've experienced the problem with ~ 300 *megabytes* of data: hardly what I would call "huge". -- Steven D'Aprano
Steven D'Aprano wrote:
This behaviour appears to be specific to deleting dicts, not deleting random objects. I haven't yet confirmed that the problem still exists in trunk (I hope to have time tonight or tomorrow), but in my previous tests deleting millions of items stored in a list of tuples completed in a minute or two, while deleting the same items stored as key:item pairs in a dict took 30+ minutes. I say plus because I never had the patience to let it run to completion, it could have been hours for all I know.
There's actually an interesting comment in list_dealloc: /* Do it backwards, for Christian Tismer. There's a simple test case where somehow this reduces thrashing when a *very* large list is created and immediately deleted. */ The "backwards" the comment is referring to is the fact that it invokes DECREF on the last item in the list first and counts back down to the first item, instead of starting at the first item and incrementing the index each time around the loop. The revision number on that (13452) indicates that it predates the implementation of PyObject_Malloc and friends, so it was probably avoiding pathological behaviour in platform malloc() implementations by free'ing memory in the reverse order to which it was allocated (assuming the list was built initially starting with the first item). However, I'm now wondering it if also has the side effect of avoiding the quadratic behaviour Mike has found inside the more recent code to release arenas back to the OS. I'm working on a simple benchmark that looks for non-linear scaling of the deallocation times - I'll include a case of deallocation of a reversed list along with a normal list and a dictionary. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
On 2008-12-22 22:45, Steven D'Aprano wrote:
On Mon, 22 Dec 2008 11:20:59 pm M.-A. Lemburg wrote:
On 2008-12-20 23:16, Martin v. Löwis wrote:
I will try next week to see if I can come up with a smaller, submittable example. Thanks. These long exit times are usually caused by the garbage collection of objects. This can be a very time consuming task. I doubt that. The long exit times are usually caused by a bad malloc implementation. With "garbage collection" I meant the process of Py_DECREF'ing the objects in large containers or deeply nested structures, not the GC mechanism for breaking circular references in Python.
This will usually also involve free() calls, so the malloc implementation affects this as well. However, I've seen such long exit times on Linux and Windows, which both have rather good malloc implementations.
I don't think there's anything much we can do about it at the interpreter level. Deleting millions of objects takes time and that's not really surprising at all. It takes even longer if you have instances with .__del__() methods written in Python.
This behaviour appears to be specific to deleting dicts, not deleting random objects. I haven't yet confirmed that the problem still exists in trunk (I hope to have time tonight or tomorrow), but in my previous tests deleting millions of items stored in a list of tuples completed in a minute or two, while deleting the same items stored as key:item pairs in a dict took 30+ minutes. I say plus because I never had the patience to let it run to completion, it could have been hours for all I know.
That's interesting. The dictionary dealloc routine doesn't give any hint as to why this should take longer than deallocating a list of tuples. However, due to the way dictionary tables are allocated, it is possible that you create a table that is nearly twice the size of the actual number of items needed by the dictionary. At those dictionary size, this can result in a lot of extra memory being allocated, certainly more than the corresponding list of tuples would use.
Applications can choose other mechanisms for speeding up the exit process in various (less clean) ways, if they have a need for this.
BTW: Rather than using a huge in-memory dict, I'd suggest to either use an on-disk dictionary such as the ones found in mxBeeBase or a database.
The original poster's application uses 45GB of data. In my earlier tests, I've experienced the problem with ~ 300 *megabytes* of data: hardly what I would call "huge".
Times have changed, that's true :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 23 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2008-12-02: Released mxODBC.Connect 1.0.0 http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
M.-A. Lemburg wrote:
On 2008-12-22 22:45, Steven D'Aprano wrote:
This behaviour appears to be specific to deleting dicts, not deleting random objects. I haven't yet confirmed that the problem still exists in trunk (I hope to have time tonight or tomorrow), but in my previous tests deleting millions of items stored in a list of tuples completed in a minute or two, while deleting the same items stored as key:item pairs in a dict took 30+ minutes. I say plus because I never had the patience to let it run to completion, it could have been hours for all I know.
That's interesting. The dictionary dealloc routine doesn't give any hint as to why this should take longer than deallocating a list of tuples.
Shuffling the list with random.shuffle before deleting it makes a *massive* difference to how long the deallocation takes. Not only that, but after the shuffled list has been deallocated, deleting an unshuffled list subsequently takes significantly longer. (I posted numbers and a test script showing these effects elsewhere in the thread). The important factor seems to be deallocation order relative to allocation order. A simple list deletes objects in the reverse of the order of creation, while a reversed list deletes them in order of creation. Both of these seem to scale fairly linearly. A dict with a hash order that I believe is a fair approximation of creation order also didn't appear to exhibit particularly poor scaling (at least not within the 20 million objects I could test). The shuffled list, on the other hand, was pretty atrocious, taking nearly twice as long to be destroyed as an unshuffled list of the same size. I'd like to add another dict to the test which eliminates the current coupling between hash order and creation order, and see if it exhibits poor behaviour which is similar to that of the shuffled list, but I'm not sure when I'll get to that (probably post-Christmas). Note that I think these results are consistent with the theory that the problem lies in the way partially allocated memory pools are tracked in the obmalloc code - it makes sense that deallocating in creation order or in reverse of creation order would tend to clean up each arena in order and keep the obmalloc internal state neat and tidy, while deallocating objects effectively at random would lead to a lot of additional bookkeeping as the "most used" and "least used" arenas change over the course of the deallocation. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
Mike Coleman wrote:
I have a program that creates a huge (45GB) defaultdict. (The keys are short strings, the values are short lists of pairs (string, int).) Nothing but possibly the strings and ints is shared.
The program takes around 10 minutes to run, but longer than 20 minutes to exit (I gave up at that point). That is, after executing the final statement (a print), it is apparently spending a huge amount of time cleaning up before exiting. I haven't installed any exit handlers or anything like that, all files are already closed and stdout/stderr flushed, and there's nothing special going on. I have done 'gc.disable()' for performance (which is hideous without it)--I have no reason to think there are any loops.
Currently I am working around this by doing an os._exit(), which is immediate, but this seems like a bit of hack. Is this something that needs fixing, or that has already been fixed?
You don't mention the platform, but... This behaviour was not unknown in the distant past, with much smaller datasets. Most of the problems then related to the platform malloc() doing funny things as stuff was free()ed, like coalescing free space. [I once sat and watched a Python script run in something like 30 seconds and then take nearly 10 minutes to terminate, as you describe (Python 2.1/Solaris 2.5/Ultrasparc E3500)... and that was only a couple of hundred MB of memory - the Solaris 2.5 malloc() had some undesirable properties from Python's point of view] PyMalloc effectively removed this as an issue for most cases and platform malloc()s have also become considerably more sophisticated since then, but I wonder whether the sheer size of your dataset is unmasking related issues. Note that in Python 2.5 PyMalloc does free() unused arenas as a surplus accumulates (2.3 & 2.4 never free()ed arenas). Your platform malloc() might have odd behaviour with 45GB of arenas returned to it piecemeal. This is something that could be checked with a small C program. Calling os._exit() circumvents the free()ing of the arenas. Also consider that, with the exception of small integers (-1..256), no interning of integers is done. If your data contains large quantities of integers with non-unique values (that aren't in the small integer range) you may find it useful to do your own interning. -- ------------------------------------------------------------------------- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au (pref) | Snail: PO Box 370 andymac@pcug.org.au (alt) | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia
Andrew MacIntyre wrote:
Mike Coleman wrote:
I have a program that creates a huge (45GB) defaultdict. (The keys are short strings, the values are short lists of pairs (string, int).) Nothing but possibly the strings and ints is shared.
The program takes around 10 minutes to run, but longer than 20 minutes to exit (I gave up at that point). That is, after executing the final statement (a print), it is apparently spending a huge amount of time cleaning up before exiting. I haven't installed any exit handlers or anything like that, all files are already closed and stdout/stderr flushed, and there's nothing special going on. I have done 'gc.disable()' for performance (which is hideous without it)--I have no reason to think there are any loops.
Currently I am working around this by doing an os._exit(), which is immediate, but this seems like a bit of hack. Is this something that needs fixing, or that has already been fixed?
You don't mention the platform, but...
This behaviour was not unknown in the distant past, with much smaller datasets. Most of the problems then related to the platform malloc() doing funny things as stuff was free()ed, like coalescing free space.
[I once sat and watched a Python script run in something like 30 seconds and then take nearly 10 minutes to terminate, as you describe (Python 2.1/Solaris 2.5/Ultrasparc E3500)... and that was only a couple of hundred MB of memory - the Solaris 2.5 malloc() had some undesirable properties from Python's point of view]
PyMalloc effectively removed this as an issue for most cases and platform malloc()s have also become considerably more sophisticated since then, but I wonder whether the sheer size of your dataset is unmasking related issues.
Note that in Python 2.5 PyMalloc does free() unused arenas as a surplus accumulates (2.3 & 2.4 never free()ed arenas). Your platform malloc() might have odd behaviour with 45GB of arenas returned to it piecemeal. This is something that could be checked with a small C program. Calling os._exit() circumvents the free()ing of the arenas.
Also consider that, with the exception of small integers (-1..256), no interning of integers is done. If your data contains large quantities of integers with non-unique values (that aren't in the small integer range) you may find it useful to do your own interning.
It's a pity a simplistic approach that redefines all space reclamation activities as null functions won't work. I hate to think of all the cycles that are being wasted reclaiming space just because a program has terminated, when in fact an os.exit() call would work just as well from the user's point of view. Unfortunately there are doubtless programs out there that do rely on actions being taken at shutdown. Maybe os.exit() could be more widely advertised, though ... regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/
Steve> Unfortunately there are doubtless programs out there that do rely Steve> on actions being taken at shutdown. Indeed. I believe any code which calls atexit.register. Steve> Maybe os.exit() could be more widely advertised, though ... That would be os._exit(). Calling it avoids calls to exit functions registered with atexit.register(). I believe it is both safe, and reasonable programming practice for modules to register exit functions. Both the logging and multiprocessing modules call it. It's incumbent on the application programmer to know these details of the modules the app uses (perhaps indirectly) to know whether or not it's safe/wise to call os._exit(). -- Skip Montanaro - skip@pobox.com - http://smontanaro.dyndns.org/
skip@pobox.com wrote:
Steve> Unfortunately there are doubtless programs out there that do rely Steve> on actions being taken at shutdown.
Indeed. I believe any code which calls atexit.register.
Steve> Maybe os.exit() could be more widely advertised, though ...
That would be os._exit(). Calling it avoids calls to exit functions registered with atexit.register(). I believe it is both safe, and reasonable programming practice for modules to register exit functions. Both the logging and multiprocessing modules call it. It's incumbent on the application programmer to know these details of the modules the app uses (perhaps indirectly) to know whether or not it's safe/wise to call os._exit().
You could call sys.exitfunc() just before os._exit(). -Andrew.
Andrew, this is on an (intel) x86_64 box with 64GB of RAM. I don't recall the maker or details of the architecture off the top of my head, but it would be something "off the rack" from Dell or maybe HP. There were other users on the box at the time, but nothing heavy or that gave me any reason to think was affecting my program. It's running CentOS 5 I think, so that might make glibc several years old. Your malloc idea sounds plausible to me. If it is a libc problem, it would be nice if there was some way we could tell malloc to "live for today because there is no tomorrow" in the terminal phase of the program. I'm not sure exactly how to attack this. Callgrind is cool, but no way will work on something this size. Timed ltrace output might be interesting. Or maybe a gprof'ed Python, though that's more work. Regarding interning, I thought this only worked with strings. Is there some way to intern integers? I'm probably creating 300M integers more or less uniformly distributed across range(10000). Mike On Sat, Dec 20, 2008 at 4:08 AM, Andrew MacIntyre <andymac@bullseye.apana.org.au> wrote:
Mike Coleman wrote:
I have a program that creates a huge (45GB) defaultdict. (The keys are short strings, the values are short lists of pairs (string, int).) Nothing but possibly the strings and ints is shared.
The program takes around 10 minutes to run, but longer than 20 minutes to exit (I gave up at that point). That is, after executing the final statement (a print), it is apparently spending a huge amount of time cleaning up before exiting. I haven't installed any exit handlers or anything like that, all files are already closed and stdout/stderr flushed, and there's nothing special going on. I have done 'gc.disable()' for performance (which is hideous without it)--I have no reason to think there are any loops.
Currently I am working around this by doing an os._exit(), which is immediate, but this seems like a bit of hack. Is this something that needs fixing, or that has already been fixed?
You don't mention the platform, but...
This behaviour was not unknown in the distant past, with much smaller datasets. Most of the problems then related to the platform malloc() doing funny things as stuff was free()ed, like coalescing free space.
[I once sat and watched a Python script run in something like 30 seconds and then take nearly 10 minutes to terminate, as you describe (Python 2.1/Solaris 2.5/Ultrasparc E3500)... and that was only a couple of hundred MB of memory - the Solaris 2.5 malloc() had some undesirable properties from Python's point of view]
PyMalloc effectively removed this as an issue for most cases and platform malloc()s have also become considerably more sophisticated since then, but I wonder whether the sheer size of your dataset is unmasking related issues.
Note that in Python 2.5 PyMalloc does free() unused arenas as a surplus accumulates (2.3 & 2.4 never free()ed arenas). Your platform malloc() might have odd behaviour with 45GB of arenas returned to it piecemeal. This is something that could be checked with a small C program. Calling os._exit() circumvents the free()ing of the arenas.
Also consider that, with the exception of small integers (-1..256), no interning of integers is done. If your data contains large quantities of integers with non-unique values (that aren't in the small integer range) you may find it useful to do your own interning.
-- ------------------------------------------------------------------------- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au (pref) | Snail: PO Box 370 andymac@pcug.org.au (alt) | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia
Mike Coleman wrote:
... Regarding interning, I thought this only worked with strings. Is there some way to intern integers? I'm probably creating 300M integers more or less uniformly distributed across range(10000)?
held = list(range(10000)) ... troublesome_dict[string] = held[number_to_hold] ... --Scott David Daniels Scott.Daniels@Acm.Org
[Mike Coleman]
... Regarding interning, I thought this only worked with strings.
Implementation details. Recent versions of CPython also, e.g., "intern" the empty tuple, and very small integers.
Is there some way to intern integers? I'm probably creating 300M integers more or less uniformly distributed across range(10000)?
Interning would /vastly/ reduce memory use for ints in that case, from gigabytes down to less than half a megabyte. [Scott David Daniels]
held = list(range(10000)) ... troublesome_dict[string] = held[number_to_hold] ...
More generally, but a bit slower, for objects usable as dict keys, change code of the form: x = whatever_you_do_to_get_a_new_object() use(x) to: x = whatever_you_do_to_get_a_new_object() x = intern_it(x, x) use(x) where `intern_it` is defined like so once at the start of the program: intern_it = {}.setdefault This snippet may make the mechanism clearer:
intern_it = {}.setdefault x = 3000 id(intern_it(x, x)) 36166156 x = 1000 + 2000 id(intern_it(x, x)) 36166156 x = "works for computed strings too" id(intern_it(x, x)) 27062696 x = "works for computed strings t" + "o" * 2 id(intern_it(x, x)) 27062696
You can always try poor-man's profiling, which is surprisingly useful in the face of massive performance problems. Just attach a debugger to the program, and when it suffering from a performance problem, break the execution on a regular basis. You are statistically very likely to get a callstack representative of the problem you are having. Do this a few times and you will get a fair impression of what the program is spending its time on.
From the debugger, you can also examine the python callstack of the program by examinging the 'f' local variable in the Frame Evaluation function.
Have fun, K -----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Mike Coleman Sent: 20. desember 2008 17:09 To: Andrew MacIntyre Cc: Python Dev Subject: Re: [Python-Dev] extremely slow exit for program having huge (45G) dict (python 2.5.2) I'm not sure exactly how to attack this. Callgrind is cool, but no way will work on something this size. Timed ltrace output might be interesting. Or maybe a gprof'ed Python, though that's more work.
Mike Coleman wrote:
Andrew, this is on an (intel) x86_64 box with 64GB of RAM. I don't recall the maker or details of the architecture off the top of my head, but it would be something "off the rack" from Dell or maybe HP. There were other users on the box at the time, but nothing heavy or that gave me any reason to think was affecting my program.
It's running CentOS 5 I think, so that might make glibc several years old. Your malloc idea sounds plausible to me. If it is a libc problem, it would be nice if there was some way we could tell malloc to "live for today because there is no tomorrow" in the terminal phase of the program.
I'm not sure exactly how to attack this. Callgrind is cool, but no way will work on something this size. Timed ltrace output might be interesting. Or maybe a gprof'ed Python, though that's more work.
Some malloc()s (notably FreeBSD's) can be externally tuned at runtime via options in environment variables or other mechanisms - the malloc man page on your system might be helpful if your platform has something like this. It is likely that PyMalloc would be better with a way to disable the free()ing of empty arenas, or move to an arrangement where (like the various type free-lists in 2.6+) explicit action can force pruning of empty arenas - there are other usage patterns than yours which would benefit (performance wise) from not freeing arenas automatically. -- ------------------------------------------------------------------------- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au (pref) | Snail: PO Box 370 andymac@pcug.org.au (alt) | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia
It is likely that PyMalloc would be better with a way to disable the free()ing of empty arenas, or move to an arrangement where (like the various type free-lists in 2.6+) explicit action can force pruning of empty arenas - there are other usage patterns than yours which would benefit (performance wise) from not freeing arenas automatically.
Before such a mechanism is added, I'd like to establish for a fact that this is an actual problem. Regards, Martin
On Fri, Dec 19, 2008 at 6:29 PM, Mike Coleman <tutufan@gmail.com> wrote:
I have a program that creates a huge (45GB) defaultdict. (The keys are short strings, the values are short lists of pairs (string, int).) Nothing but possibly the strings and ints is shared.
That is, after executing the final statement (a print), it is apparently spending a huge amount of time cleaning up before exiting.
I have done 'gc.disable()' for performance (which is hideous without it)--I have no reason to think there are any loops.
[Sorry, for the previous garbage post.]
On Fri, Dec 19, 2008 at 6:29 PM, Mike Coleman <tutufan@gmail.com> wrote: I have a program that creates a huge (45GB) defaultdict. (The keys are short strings, the values are short lists of pairs (string, int).) Nothing but possibly the strings and ints is shared.
Could you give us more information about the dictionary. For example, how many objects does it contain? Is 45GB the actual size of the dictionary or of the Python process?
That is, after executing the final statement (a print), it is apparently spending a huge amount of time cleaning up before exiting.
Most of this time is probably spent on DECREF'ing objects in the dictionary. As other mentioned, it would useful to have self-contained example to examine the behavior more closely.
I have done 'gc.disable()' for performance (which is hideous without it)--I have no reason to think there are any loops.
Have you seen any significant difference in the exit time when the cyclic GC is disabled or enabled? -- Alexandre
On Sat, Dec 20, 2008 at 5:40 PM, Alexandre Vassalotti <alexandre@peadrop.com> wrote:
Could you give us more information about the dictionary. For example, how many objects does it contain? Is 45GB the actual size of the dictionary or of the Python process?
The 45G was the VM size of the process (resident size was similar). The dict keys were all uppercase alpha strings of length 7. I don't have access at the moment, but maybe something like 10-100M of them (not sure how redundant the set is). The values are all lists of pairs, where each pair is a (string, int). The pair strings are of length around 30, and drawn from a "small" fixed set of around 60K strings (). As mentioned previously, I think the ints are drawn pretty uniformly from something like range(10000). The length of the lists depends on the redundancy of the key set, but I think there are around 100-200M pairs total, for the entire dict. (If you're curious about the application domain, see 'http://greylag.org'.)
Have you seen any significant difference in the exit time when the cyclic GC is disabled or enabled?
Unfortunately, with GC enabled, the application is too slow to be useful, because of the greatly increased time for dict creation. I suppose it's theoretically possible that with this increased time, the long time for exit will look less bad by comparison, but I'd be surprised if it makes any difference at all. I'm confident that there are no loops in this dict, and nothing for cyclic gc to collect. Mike
Mike Coleman <tutufan <at> gmail.com> writes:
The 45G was the VM size of the process (resident size was similar).
Can you reproduce it with a smaller working set? Something between 1 and 2GB, possibly randomly-generated, and post both the generation script and the problematic script on the bug tracker?
On Sat, Dec 20, 2008 at 6:09 PM, Mike Coleman <tutufan@gmail.com> wrote:
On Sat, Dec 20, 2008 at 5:40 PM, Alexandre Vassalotti
Have you seen any significant difference in the exit time when the cyclic GC is disabled or enabled?
Unfortunately, with GC enabled, the application is too slow to be useful, because of the greatly increased time for dict creation. I suppose it's theoretically possible that with this increased time, the long time for exit will look less bad by comparison, but I'd be surprised if it makes any difference at all. I'm confident that there are no loops in this dict, and nothing for cyclic gc to collect.
Try putting an explicit gc.collect() at the end, with the usual timestamps before and after. After that try deleting your dict, then calling gc.collect(), with timestamps throughout. -- Adam Olsen, aka Rhamphoryncus
Thanks for all of the useful suggestions. Here are some preliminary results. With still gc.disable(), at the end of the program I first did a gc.collect(), which took about five minutes. (So, reason enough not to gc.enable(), at least without Antoine's patch.) After that, I did a .clear() on the huge dict. That's where the time is being spent. Doing the suggested "poor man's profiling" (repeated backtraces via gdb), for 20 or so samples, one is within libc free, but all of the rest are in the same place (same source line) within PyObjectFree (see below), sometimes within list_dealloc and sometimes within tuple_dealloc. So, apparently a lot of time is being spent in this loop: /* Case 3: We have to move the arena towards the end * of the list, because it has more free pools than * the arena to its right. ... /* Locate the new insertion point by iterating over * the list, using our nextarena pointer. */ while (ao->nextarena != NULL && nf > ao->nextarena->nfreepools) { ao->prevarena = ao->nextarena; ao->nextarena = ao->nextarena->nextarena; } Investigating further, from one stop, I used gdb to follow the chain of pointers in the nextarena and prevarena directions. There were 5449 and 112765 links, respectively. maxarenas is 131072. Sampling nf at different breaks gives values in the range(10,20). This loop looks like an insertion sort. If it's the case that only a "few" iterations are ever needed for any given free, this might be okay--if not, it would seem that this must be quadratic. I attempted to look further by setting a silent break with counter within the loop and another break after the loop to inspect the counter, but gdb's been buzzing away on that for 40 minutes without coming back. That might mean that there are a lot of passes through this loop per free (i.e., that gdb is taking a long time to process 100,000 silent breaks), or perhaps I've made a mistake, or gdb isn't handling this well. In any case, this looks like the problem locus. It's tempting to say "don't do this arena ordering optimization if we're doing final cleanup", but really the program could have done this .clear() at any point. Maybe there needs to be a flag to disable it altogether? Or perhaps there's a smarter way to manage the list of arena/free pool info. Mike Program received signal SIGINT, Interrupt. 0x00000000004461dc in PyObject_Free (p=0x5ec043db0) at Objects/obmalloc.c:1064 1064 while (ao->nextarena != NULL && (gdb) bt #0 0x00000000004461dc in PyObject_Free (p=0x5ec043db0) at Objects/obmalloc.c:1064 #1 0x0000000000433478 in list_dealloc (op=0x5ec043dd0) at Objects/listobject.c:281 #2 0x000000000044075b in PyDict_Clear (op=0x74c7cd0) at Objects/dictobject.c:757 #3 0x00000000004407b9 in dict_clear (mp=0x5ec043db0) at Objects/dictobject.c:1776 #4 0x0000000000485905 in PyEval_EvalFrameEx (f=0x746ca50, throwflag=<value optimized out>) at Python/ceval.c:3557 #5 0x000000000048725f in PyEval_EvalCodeEx (co=0x72643f0, globals=<value optimized out>, locals=<value optimized out>, args=0x1, argcount=0, kws=0x72a5770, kwcount=0, defs=0x743eba8, defcount=1, closure=0x0) at Python/ceval.c:2836 #6 0x00000000004855bc in PyEval_EvalFrameEx (f=0x72a55f0, throwflag=<value optimized out>) at Python/ceval.c:3669 #7 0x000000000048725f in PyEval_EvalCodeEx (co=0x72644e0, globals=<value optimized out>, locals=<value optimized out>, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2836 #8 0x00000000004872a2 in PyEval_EvalCode (co=0x5ec043db0, globals=0x543e41f10, locals=0x543b969c0) at Python/ceval.c:494 #9 0x00000000004a844e in PyRun_FileExFlags (fp=0x7171010, filename=0x7ffffaf6b419 "/home/mkc/greylag/main/greylag_reannotate.py", start=<value optimized out>, globals=0x7194510, locals=0x7194510, closeit=1, flags=0x7ffffaf69080) at Python/pythonrun.c:1273 #10 0x00000000004a86e0 in PyRun_SimpleFileExFlags (fp=0x7171010, filename=0x7ffffaf6b419 "/home/mkc/greylag/main/greylag_reannotate.py", closeit=1, flags=0x7ffffaf69080) at Python/pythonrun.c:879 #11 0x0000000000412275 in Py_Main (argc=<value optimized out>, argv=0x7ffffaf691a8) at Modules/main.c:523 #12 0x00000030fea1d8b4 in __libc_start_main () from /lib64/libc.so.6 #13 0x0000000000411799 in _start () On Sun, Dec 21, 2008 at 12:44 PM, Adam Olsen <rhamph@gmail.com> wrote:
On Sat, Dec 20, 2008 at 6:09 PM, Mike Coleman <tutufan@gmail.com> wrote:
On Sat, Dec 20, 2008 at 5:40 PM, Alexandre Vassalotti
Have you seen any significant difference in the exit time when the cyclic GC is disabled or enabled?
Unfortunately, with GC enabled, the application is too slow to be useful, because of the greatly increased time for dict creation. I suppose it's theoretically possible that with this increased time, the long time for exit will look less bad by comparison, but I'd be surprised if it makes any difference at all. I'm confident that there are no loops in this dict, and nothing for cyclic gc to collect.
Try putting an explicit gc.collect() at the end, with the usual timestamps before and after.
After that try deleting your dict, then calling gc.collect(), with timestamps throughout.
-- Adam Olsen, aka Rhamphoryncus
On Mon, Dec 22, 2008 at 11:01 AM, Mike Coleman <tutufan@gmail.com> wrote:
Thanks for all of the useful suggestions. Here are some preliminary results.
With still gc.disable(), at the end of the program I first did a gc.collect(), which took about five minutes. (So, reason enough not to gc.enable(), at least without Antoine's patch.)
After that, I did a .clear() on the huge dict. That's where the time is being spent. Doing the suggested "poor man's profiling" (repeated backtraces via gdb), for 20 or so samples, one is within libc free, but all of the rest are in the same place (same source line) within PyObjectFree (see below), sometimes within list_dealloc and sometimes within tuple_dealloc. So, apparently a lot of time is being spent in this loop:
/* Case 3: We have to move the arena towards the end * of the list, because it has more free pools than * the arena to its right.
...
/* Locate the new insertion point by iterating over * the list, using our nextarena pointer. */ while (ao->nextarena != NULL && nf > ao->nextarena->nfreepools) { ao->prevarena = ao->nextarena; ao->nextarena = ao->nextarena->nextarena; }
Investigating further, from one stop, I used gdb to follow the chain of pointers in the nextarena and prevarena directions. There were 5449 and 112765 links, respectively. maxarenas is 131072.
Sampling nf at different breaks gives values in the range(10,20).
This loop looks like an insertion sort. If it's the case that only a "few" iterations are ever needed for any given free, this might be okay--if not, it would seem that this must be quadratic.
I attempted to look further by setting a silent break with counter within the loop and another break after the loop to inspect the counter, but gdb's been buzzing away on that for 40 minutes without coming back. That might mean that there are a lot of passes through this loop per free (i.e., that gdb is taking a long time to process 100,000 silent breaks), or perhaps I've made a mistake, or gdb isn't handling this well.
To make sure that's the correct line please recompile python without optimizations. GCC happily reorders and merges different parts of a function. Adding a counter in C and recompiling would be a lot faster than using a gdb hook. -- Adam Olsen, aka Rhamphoryncus
On Mon, Dec 22, 2008 at 2:22 PM, Adam Olsen <rhamph@gmail.com> wrote:
To make sure that's the correct line please recompile python without optimizations. GCC happily reorders and merges different parts of a function.
Adding a counter in C and recompiling would be a lot faster than using a gdb hook.
Okay, I did this. The results are the same, except that now sampling selects the different source statements within this loop, instead of just the top of the loop (which makes sense). I added a counter (static volatile long) as suggested, and a breakpoint to sample it. Not every pass through PyObject_Free takes case 3, but for those that do, this loop runs around 100-25000 times. I didn't try to graph it, but based on a quick sample, it looks like more than 5000 iterations on most occasions. The total counter is 12.4 billion at the moment, and still growing. That seems high, but I'm not sure what would be expected or hoped for. I have a script that demonstrates the problem, but unfortunately the behavior isn't clearly bad until large amounts of memory are used. I don't think it shows at 2G, for example. (A 32G machine is sufficient.) Here is a log of running the program at different sizes ($1): 1 4.04686999321 0.696660041809 2 8.1575551033 1.46393489838 3 12.6426320076 2.30558800697 4 16.471298933 3.80377006531 5 20.1461620331 4.96685886383 6 25.150053978 5.48230814934 7 28.9099609852 7.41244196892 8 32.283219099 6.31711483002 9 36.6974511147 7.40236377716 10 40.3126089573 9.01174497604 20 81.7559120655 20.3317198753 30 123.67071104 31.4815018177 40 161.935647011 61.4484620094 50 210.610441923 88.6161060333 60 248.89805007 118.821491003 70 288.944771051 194.166989088 80 329.93295002 262.14109993 90 396.209988832 454.317914009 100 435.610564947 564.191882133 If you plot this, it is clearly quadratic (or worse). Here is the script: #!/usr/bin/env python """ Try to trigger quadratic (?) behavior during .clear() of a large but simple defaultdict. """ from collections import defaultdict import time import sys import gc; gc.disable() print >> sys.stderr, sys.version h = defaultdict(list) n = 0 lasttime = time.time() megs = int(sys.argv[1]) print megs, sys.stdout.flush() # 100M iterations -> ~24GB? on my 64-bit host for i in xrange(megs * 1024 * 1024): s = '%0.7d' % i h[s].append(('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 12345)) h[s].append(('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 12345)) h[s].append(('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 12345)) # if (i % 1000000) == 0: # t = time.time() # print >> sys.stderr, t-lasttime # lasttime = t t = time.time() print t-lasttime, sys.stdout.flush() lasttime = t h.clear() t = time.time() print t-lasttime, sys.stdout.flush() lasttime = t print
Mike Coleman wrote:
If you plot this, it is clearly quadratic (or worse).
Here's another comparison script that tries to probe the vagaries of the obmalloc implementation. It looks at the proportional increases in deallocation times for lists and dicts as the number of contained items increases when using a variety of deallocation orders: - in hash order (dict) - in reverse order of allocation (list) - in order of allocation (list, reversed in place) - in random order (list, shuffled in place using the random module) I've included the final output from a run on my own machine below [1], but here are the main points I get out of it: - at the sizes I can test (up to 20 million items in the containers), this version of the script doesn't show any particularly horrible non-linearity with deallocation of dicts, lists or reversed lists. - when the items in a list are deallocated in *random* order, however, the deallocation times are highly non-linear - by the time we get to 20 million items, deallocating in random order takes nearly twice as long as deallocation in either order of allocation or in reverse order. - after the list of items had been deallocated in random order, subsequent deallocation of a dict and the list took significantly longer than when those operations took place on a comparatively "clean" obmalloc state. I'm going to try making a new version of the script that uses random integers with a consistent number of digits in place of the monotically increasing values that are currently used and see what effect that has on the dict scaling (that's where I expect to see the greatest effect, since the hash ordering is the one which will be most affected by the change to the item contents). Cheers, Nick. [1] Full final results from local test run: Dict: (Baseline=0.003135 seconds) 100000=100.0% 1000000=1020.9% 2000000=2030.5% 5000000=5026.7% 10000000=10039.7% 20000000=20086.4% List: (Baseline=0.005764 seconds) 100000=100.0% 1000000=1043.7% 2000000=2090.1% 5000000=5227.2% 10000000=10458.1% 20000000=20942.7% ReversedList: (Baseline=0.005879 seconds) 100000=100.0% 1000000=1015.0% 2000000=2023.5% 5000000=5057.1% 10000000=10114.0% 20000000=20592.6% ShuffledList: (Baseline=0.028241 seconds) 100000=100.0% 1000000=1296.0% 2000000=2877.3% 5000000=7960.1% 10000000=17216.9% 20000000=37599.9% PostShuffleDict: (Baseline=0.016229 seconds) 100000=100.0% 1000000=1007.9% 2000000=2018.4% 5000000=5075.3% 10000000=10217.5% 20000000=20873.1% PostShuffleList: (Baseline=0.020551 seconds) 100000=100.0% 1000000=1021.9% 2000000=1978.2% 5000000=4953.6% 10000000=10262.3% 20000000=19854.0% Baseline changes for Dict and List after deallocation of list in random order: Dict: 517.7% List: 356.5%
Or perhaps there's a smarter way to manage the list of arena/free pool info.
If that code is the real problem (in a reproducible test case), then this approach is the only acceptable solution. Disabling long-running code is not acceptable. Regards, Martin
On Mon, Dec 22, 2008 at 2:38 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Or perhaps there's a smarter way to manage the list of arena/free pool info.
If that code is the real problem (in a reproducible test case), then this approach is the only acceptable solution. Disabling long-running code is not acceptable.
By "disabling", I meant disabling the optimization that's trying to rearrange the arenas so that more memory can be returned to the OS. This presumably wouldn't be any worse than things were in Python 2.4, when memory was never returned to the OS. (I'm working on a test case.)
If that code is the real problem (in a reproducible test case), then this approach is the only acceptable solution. Disabling long-running code is not acceptable.
By "disabling", I meant disabling the optimization that's trying to rearrange the arenas so that more memory can be returned to the OS.
I meant the same thing - I'm opposed to giving up one feature or optimization in favor of a different feature or optimization.
This presumably wouldn't be any worse than things were in Python 2.4, when memory was never returned to the OS.
Going back to the state of Python 2.4 would not be acceptable. Regards, Martin
Investigating further, from one stop, I used gdb to follow the chain of pointers in the nextarena and prevarena directions. There were 5449 and 112765 links, respectively. maxarenas is 131072.
To reduce the time for keeping sorted lists of arenas, I was first thinking of a binheap. I had formulated it all, and don't want to waste that effort, so I attach it below in case my second idea (right below) is flawed. It then occurred that there are only 64 different values for nfreepools, as ARENA_SIZE is 256kiB, and POOL_SIZE is 4kiB. So rather than keeping the list sorted, I now propose to maintain 64 lists, accessible in an array double-linked lists indexed by nfreepools. Whenever nfreepools changes, the arena_object is unlinked from its current list, and linked into the new list. This should reduce the overhead for keeping the lists sorted down from O(n) to O(1), with a moderate overhead of 64 pointers (512 Bytes in your case). Allocation of a new pool would have to do a linear search in these pointers (finding the arena with the least number of pools); this could be sped up with a finger pointing to the last index where a pool was found (-1, since that pool will have moved). Regards, Martin a) usable_arenas becomes an arena_object**, pointing to an array of maxarenas+1 arena*. A second variable max_usable_arenas is added. arena_object loses the prevarena pointer, and gains a usable_index value of type size_t (which is 0 for unused or completely allocated arena_objects). usable_arenas should stay heap-sorted, with the arena_object with the smallest nfreepools at index 1. b) sink and swim operations are added, which keep usable_index intact whenever arena_object pointers get swapped. c) whenever a pool is allocated in an arena, nfreepools decreases, and swim is called for the arena. whenever a pool becomes free, sink is called. d) when the last pool was allocated in an arena, it is removed from the heap. likewise, when all pools are freed in an arena, it is removed from the heap and returned to the system. e) when the first pool gets freed in an arena, it is added to the heap. On each pool allocation/deallocation, this should get the O(n) complexity of keeping the arena list sorted down to O(log n).
Martin v. Löwis <martin <at> v.loewis.de> writes:
It then occurred that there are only 64 different values for nfreepools, as ARENA_SIZE is 256kiB, and POOL_SIZE is 4kiB. So rather than keeping the list sorted, I now propose to maintain 64 lists, accessible in an array double-linked lists indexed by nfreepools. Whenever nfreepools changes, the arena_object is unlinked from its current list, and linked into the new list. This should reduce the overhead for keeping the lists sorted down from O(n) to O(1), with a moderate overhead of 64 pointers (512 Bytes in your case).
Allocation of a new pool would have to do a linear search in these pointers (finding the arena with the least number of pools);
You mean the least number of free pools, right? IIUC, the heuristic is to favour a small number of busy arenas rather than a lot of sparse ones. And, by linear search in these pointers, do you mean just probe the 64 lists for the first non-NULL list head? If so, then it's likely fast enough for a rather infrequent operation. Now, we should find a way to benchmark this without having to steal Mike's machine and wait 30 minutes every time. Regards Antoine.
Allocation of a new pool would have to do a linear search in these pointers (finding the arena with the least number of pools);
You mean the least number of free pools, right?
Correct.
IIUC, the heuristic is to favour a small number of busy arenas rather than a lot of sparse ones.
Correct. Or, more precisely, the hope is indeed to make most arenas sparse, so that they eventually see all their pools freed.
And, by linear search in these pointers, do you mean just probe the 64 lists for the first non-NULL list head?
Correct.
If so, then it's likely fast enough for a rather infrequent operation.
I would hope so, yes. However, the same hope applied to the current code (how much time can it take to sink an arena in a linear list?), so if we have the prospect of using larger arenas some day, this might change.
Now, we should find a way to benchmark this without having to steal Mike's machine and wait 30 minutes every time.
I think this can be simulated by using just arena objects, with no associated arenas, and just adjusting pool counters. Allocate 100,000 arena objects, and start out with them all being completely allocated. Then randomly chose one arena to deallocate a pool from; from time to time, also allocate a new pool. Unfortunately, this will require some hacking of the code to take the measurements. Alternatively, make the arena size 4k, and the pool size 32 bytes, and then come with a pattern to allocate and deallocate 8 byte blocks. Not sure whether the code works for these parameters, though (but it might be useful to fix it for non-standard sizes). This would require only 400MiB of memory to run the test. I think obmalloc is fairly independent from the rest of Python, so it should be possible to link it with a separate main() function, and nothing else of Python. Regards, Martin
Now, we should find a way to benchmark this without having to steal Mike's machine and wait 30 minutes every time.
So, I seem to reproduce it. The following script takes about 15 seconds to run and allocates a 2 GB dict which it deletes at the end (gc disabled of course). With 2.4, deleting the dict takes ~1.2 seconds while with 2.5 and higher (including 3.0), deleting the dict takes ~3.5 seconds. Nothing spectacular but the difference is clear. Also, after the dict is deleted and before the program exits, you can witness (with `ps` or `top`) that 2.5 and higher has reclaimed 1GB, while 2.4 has reclaimed nothing. There is a sleep() call at the end so that you have the time :-) You can tune memory occupation at the beginning of the script, but the lower the more difficult it will be to witness a difference. Regards Antoine. ####### import random import time import gc import itertools # Adjust this parameter according to your system RAM! target_size = int(2.0 * 1024**3) # 2.0 GB pool_size = 4 * 1024 # This is a ballpark estimate: 60 bytes overhead for each # { dict entry struct + float object + tuple object header }, # 1.3 overallocation factor for the dict. target_length = int(target_size / (1.3 * (pool_size + 60))) def make_dict(): print ("filling dict up to %d entries..." % target_length) # 1. Initialize the dict from a set of pre-computed random keys. keys = [random.random() for i in range(target_length)] d = dict.fromkeys(keys) # 2. Build the values that will constitute the dict. Each value will, as # far as possible, span a contiguous `pool_size` memory area. # Over 256 bytes per alloc, PyObject_Malloc defers to the system malloc() # We avoid that by allocating tuples of smaller longs. int_size = 200 # 24 roughly accounts for the long object overhead (YMMV) int_start = 1 << ((int_size - 24) * 8 - 7) int_range = range(1, 1 + pool_size // int_size) values = [None] * target_length # Maximize allocation locality by pre-allocating the values for n in range(target_length): values[n] = tuple(int_start + j for j in int_range) if n % 10000 == 0: print (" %d iterations" % n) # The keys are iterated over in their original order rather than in # dict order, so as to randomly spread the values in the internal dict # table wrt. allocation address. for n, k in enumerate(keys): d[k] = values[n] print ("dict filled!") return d if __name__ == "__main__": gc.disable() t1 = time.time() d = make_dict() t2 = time.time() print (" -> %.3f s." % (t2 - t1)) print ("deleting dict...") t2 = time.time() del d t3 = time.time() print (" -> %.3f s." % (t3 - t2)) print ("Finished, you can press Ctrl+C.") time.sleep(10.0)
On Mon, Dec 22, 2008 at 7:34 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Now, we should find a way to benchmark this without having to steal Mike's machine and wait 30 minutes every time.
So, I seem to reproduce it. The following script takes about 15 seconds to run and allocates a 2 GB dict which it deletes at the end (gc disabled of course). With 2.4, deleting the dict takes ~1.2 seconds while with 2.5 and higher (including 3.0), deleting the dict takes ~3.5 seconds. Nothing spectacular but the difference is clear.
I modified your script to delete the dictionary without actually deallocating the items in it. You can speed up a dictionary deallocation significantly if you keep a reference to its items and delete the dictionary before deleting its items. In Python 2.4, the same behavior exists, but is not as strongly marked as in Python 2.6 with pymalloc enabled. I can understand that deallocating the items in the order (or actually, the reverse order) they were allocated is faster, than doing so in a rather haphazard manner (i.e., like dict). However, I am not sure why pymalloc accentuate this behavior. -- Alexandre Python 2.6 with pymalloc, without pydebug alex@helios:~$ python2.6 dict_dealloc_test.py creating 397476 items... -> 6.613 s. building dict... -> 0.230 s. deleting items... -> 0.059 s. deleting dict... -> 2.299 s. total deallocation time: 2.358 seconds. alex@helios:~$ python2.6 dict_dealloc_test.py creating 397476 items... -> 6.530 s. building dict... -> 0.228 s. deleting dict... -> 0.089 s. deleting items... -> 0.971 s. total deallocation time: 1.060 seconds. Python 2.6 without pymalloc, without pydebug alex@helios:release26-maint$ ./python /home/alex/dict_dealloc_test.py creating 397476 items... -> 5.921 s. building dict... -> 0.244 s. deleting items... -> 0.073 s. deleting dict... -> 1.502 s. total deallocation time: 1.586 seconds. alex@helios:release26-maint$ ./python /home/alex/dict_dealloc_test.py creating 397476 items... -> 6.122 s. building dict... -> 0.237 s. deleting dict... -> 0.092 s. deleting items... -> 1.238 s. total deallocation time: 1.330 seconds. alex@helios:~$ python2.4 dict_dealloc_test.py creating 397476 items... -> 6.164 s. building dict... -> 0.218 s. deleting items... -> 0.057 s. deleting dict... -> 1.185 s. total deallocation time: 1.243 seconds. alex@helios:~$ python2.4 dict_dealloc_test.py creating 397476 items... -> 6.202 s. building dict... -> 0.218 s. deleting dict... -> 0.090 s. deleting items... -> 0.852 s. total deallocation time: 0.943 seconds. ###### import random import time import gc # Adjust this parameter according to your system RAM! target_size = int(2.0 * 1024**3) # 2.0 GB pool_size = 4 * 1024 # This is a ballpark estimate: 60 bytes overhead for each # { dict entry struct + float object + tuple object header }, # 1.3 overallocation factor for the dict. target_length = int(target_size / (1.3 * (pool_size + 60))) def make_items(): print ("creating %d items..." % target_length) # 1. Initialize a set of pre-computed random keys. keys = [random.random() for i in range(target_length)] # 2. Build the values that will constitute the dict. Each value will, as # far as possible, span a contiguous `pool_size` memory area. # Over 256 bytes per alloc, PyObject_Malloc defers to the system malloc() # We avoid that by allocating tuples of smaller longs. int_size = 200 # 24 roughly accounts for the long object overhead (YMMV) int_start = 1 << ((int_size - 24) * 8 - 7) int_range = range(1, 1 + pool_size // int_size) values = [None] * target_length # Maximize allocation locality by pre-allocating the values for n in range(target_length): values[n] = tuple(int_start + j for j in int_range) return list(zip(keys,values)) if __name__ == "__main__": gc.disable() t1 = time.time() items = make_items() t2 = time.time() print " -> %.3f s." % (t2 - t1) print "building dict..." t1 = time.time() testdict = dict(items) t2 = time.time() print " -> %.3f s." % (t2 - t1) def delete_testdict(): global testdict print "deleting dict..." t1 = time.time() del testdict t2 = time.time() print " -> %.3f s." % (t2 - t1) def delete_items(): global items print "deleting items..." t1 = time.time() del items t2 = time.time() print " -> %.3f s." % (t2 - t1) t1 = time.time() # Swap these, and look at the total time delete_items() delete_testdict() t2 = time.time() print "total deallocation time: %.3f seconds." % (t2 - t1)
I'd like to suggest here, if you are giving this code a facelift, that on Windows you use VirtualAlloc and friends to allocate the arenas. This gives you the most direct access to the VM manager and makes sure that a release arena is immediately availible to the rest of the system. It also makes sure that you don't mess with the regular heap and fragment it. Kristján -----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of "Martin v. Löwis" Sent: 22. desember 2008 22:56 To: Antoine Pitrou Cc: python-dev@python.org Subject: Re: [Python-Dev] extremely slow exit for program having huge (45G) dict (python 2.5.2)
Allocation of a new pool would have to do a linear search in these pointers (finding the arena with the least number of pools);
You mean the least number of free pools, right?
Correct.
I'd like to suggest here, if you are giving this code a facelift, that on Windows you use VirtualAlloc and friends to allocate the arenas. This gives you the most direct access to the VM manager and makes sure that a release arena is immediately availible to the rest of the system. It also makes sure that you don't mess with the regular heap and fragment it.
While I'd like to see this done myself, I believe it is independent from the problem at hand. Contributions are welcome. Regards, Martin
participants (19)
-
"Martin v. Löwis"
-
Adam Olsen
-
Alexandre Vassalotti
-
Andrew Bennetts
-
Andrew MacIntyre
-
Antoine Pitrou
-
Ivan Krstić
-
Jesus Cea
-
Kristján Valur Jónsson
-
Leif Walsh
-
M.-A. Lemburg
-
Michael Foord
-
Mike Coleman
-
Nick Coghlan
-
Scott David Daniels
-
skip@pobox.com
-
Steve Holden
-
Steven D'Aprano
-
Tim Peters