Garbage collection

Steven D'Aprano steve at REMOVE.THIS.cybersource.com.au
Wed Mar 21 19:01:08 CET 2007


On Wed, 21 Mar 2007 17:19:23 +0000, Tom Wright wrote:

>> So what's your actual problem that you are trying to solve?
> 
> I have a program which reads a few thousand text files, converts each to a
> list (with readlines()), creates a short summary of the contents of each (a
> few floating point numbers) and stores this summary in a master list.  From
> the amount of memory it's using, I think that the lists containing the
> contents of each file are kept in memory, even after there are no
> references to them.  Also, if I tell it to discard the master list and
> re-read all the files, the memory use nearly doubles so I presume it's
> keeping the lot in memory.

Ah, now we're getting somewhere!

Python's caching behaviour with strings is almost certainly going to be
different to its caching behaviour with ints. (For example, Python caches
short strings that look like identifiers, but I don't believe it caches
great blocks of text or short strings which include whitespace.)

But again, you haven't really described a problem, just a set of
circumstances. Yes, the memory usage doubles. *Is* that a problem in
practice? A few thousand 1KB files is one thing; a few thousand 1MB files
is an entirely different story.

Is the most cost-effective solution to the problem to buy another 512MB of
RAM? I don't say that it is. I just point out that you haven't given us
any reason to think it isn't.


> The program may run through several collections of files, but it only keeps
> a reference to the master list of the most recent collection it's looked
> at.  Obviously, it's not ideal if all the old collections hang around too,
> taking up space and causing the machine to swap.

Without knowing exactly what your doing with the data, it's hard to tell
where the memory is going. I suppose if you are storing huge lists of
millions of short strings (words?), they might all be cached. Is there a
way you can avoid storing the hypothetical word-lists in RAM, perhaps by
writing them straight out to a disk file? That *might* make a
difference to the caching algorithm used.

Or you could just have an "object leak" somewhere. Do you have any
complicated circular references that the garbage collector can't resolve?
Lists-of-lists? Trees? Anything where objects aren't being freed when you
think they are? Are you holding on to references to lists? It's more
likely that your code simply isn't freeing lists you think are being freed
than it is that Python is holding on to tens of megabytes of random text.



-- 
Steven.




More information about the Python-list mailing list