Suitability for long-running text processing?

Mon Jan 8 10:41:23 EST 2007

I have a pair of python programs that parse and index files on my computer
to make them searchable.  The problem that I have is that they continually
grow until my system is out of memory, and then things get ugly.  I
remember, when I was first learning python, reading that the python
interpreter doesn't gc small strings, but I assumed that was outdated and
sort of forgot about it.  Unfortunately, it seems this is still the case.  A
sample program (to type/copy and paste into the python REPL):

a=[]
for i in xrange(33,127):
 for j in xrange(33,127):
  for k in xrange(33,127):
   for l in xrange(33, 127):
    a.append(chr(i)+chr(j)+chr(k)+chr(l))

del(a)
import gc
gc.collect()

The loop is deep enough that I always interrupt it once python's size is
around 250 MB.  Once the gc.collect() call is finished, python's size has
not changed a bit.  Even though there are no locals, no references at all to
all the strings that were created, python will not reduce its size.  This
example is obviously artificial, but I am getting the exact same behaviour
in my real programs.  Is there some way to convince python to get rid of all
the data that is no longer referenced, or do I need to use a different
language?

This has been tried under python 2.4.3 in gentoo linux and python 2.3 under
OS X.3.  Any suggestions/work arounds would be much appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20070108/bbd7fa2c/attachment.html>