[Python-Dev] extremely slow exit for program having huge (45G) dict (python 2.5.2)

Mike Coleman tutufan at gmail.com
Sat Dec 20 18:09:03 CET 2008


Andrew, this is on an (intel) x86_64 box with 64GB of RAM.  I don't
recall the maker or details of the architecture off the top of my
head, but it would be something "off the rack" from Dell or maybe HP.
There were other users on the box at the time, but nothing heavy or
that gave me any reason to think was affecting my program.

It's running CentOS 5 I think, so that might make glibc several years
old.  Your malloc idea sounds plausible to me.  If it is a libc
problem, it would be nice if there was some way we could tell malloc
to "live for today because there is no tomorrow" in the terminal phase
of the program.

I'm not sure exactly how to attack this.  Callgrind is cool, but no
way will work on something this size.  Timed ltrace output might be
interesting.  Or maybe a gprof'ed Python, though that's more work.

Regarding interning, I thought this only worked with strings.  Is
there some way to intern integers?  I'm probably creating 300M
integers more or less uniformly distributed across range(10000).

Mike





On Sat, Dec 20, 2008 at 4:08 AM, Andrew MacIntyre
<andymac at bullseye.apana.org.au> wrote:
> Mike Coleman wrote:
>>
>> I have a program that creates a huge (45GB) defaultdict.  (The keys
>> are short strings, the values are short lists of pairs (string, int).)
>>  Nothing but possibly the strings and ints is shared.
>>
>> The program takes around 10 minutes to run, but longer than 20 minutes
>> to exit (I gave up at that point).  That is, after executing the final
>> statement (a print), it is apparently spending a huge amount of time
>> cleaning up before exiting.  I haven't installed any exit handlers or
>> anything like that, all files are already closed and stdout/stderr
>> flushed, and there's nothing special going on.  I have done
>> 'gc.disable()' for performance (which is hideous without it)--I have
>> no reason to think there are any loops.
>>
>> Currently I am working around this by doing an os._exit(), which is
>> immediate, but this seems like a bit of hack.  Is this something that
>> needs fixing, or that has already been fixed?
>
> You don't mention the platform, but...
>
> This behaviour was not unknown in the distant past, with much smaller
> datasets.  Most of the problems then related to the platform malloc()
> doing funny things as stuff was free()ed, like coalescing free space.
>
> [I once sat and watched a Python script run in something like 30 seconds
>  and then take nearly 10 minutes to terminate, as you describe (Python
>  2.1/Solaris 2.5/Ultrasparc E3500)... and that was only a couple of
>  hundred MB of memory - the Solaris 2.5 malloc() had some undesirable
>  properties from Python's point of view]
>
> PyMalloc effectively removed this as an issue for most cases and platform
> malloc()s have also become considerably more sophisticated since then,
> but I wonder whether the sheer size of your dataset is unmasking related
> issues.
>
> Note that in Python 2.5 PyMalloc does free() unused arenas as a surplus
> accumulates (2.3 & 2.4 never free()ed arenas).  Your platform malloc()
> might have odd behaviour with 45GB of arenas returned to it piecemeal.
> This is something that could be checked with a small C program.
> Calling os._exit() circumvents the free()ing of the arenas.
>
> Also consider that, with the exception of small integers (-1..256), no
> interning of integers is done.  If your data contains large quantities
> of integers with non-unique values (that aren't in the small integer
> range) you may find it useful to do your own interning.
>
> --
> -------------------------------------------------------------------------
> Andrew I MacIntyre                     "These thoughts are mine alone..."
> E-mail: andymac at bullseye.apana.org.au  (pref) | Snail: PO Box 370
>       andymac at pcug.org.au             (alt) |        Belconnen ACT 2616
> Web:    http://www.andymac.org/               |        Australia
>


More information about the Python-Dev mailing list