Program very slow to finish

Sun Nov 4 16:09:32 EST 2001

Paul Rubin wrote:

> Fred <fredNo at nospamco.com> writes:
> > Is this a garbage collection issue?  Is there a better way to count the
> > individual values than dictionaries?  I put the sys.exit call in while
> > trying to figure out what was happening but it didn't make a difference.
>
> There's only about 120,000 hash entries in your example run.  It
> shouldn't take anything like 1 minute for sys.exit to free them.
> Something else is going on.  I'd be interested to know what sys._exit
> does.  However, it's pretty bad if freeing the entries is that
> slow--bad enough that I'd probably open a bug.
>

    It does look like a bug.  os.exit(0) ends the program right after the
print statements, taking about 1/3 of the time.

    I see another respondent to this thread has done more tests and has opened
a bug report.

>
> Also, counting the entries with len(nacc.keys()) is pretty horrendous
> if the number of entries is very large (several million).  For 120,000
> it's probably tolerable.
>
> If you expect to get a much larger number of distinct values (like
> 10's or 100's of million distinct) in your 100 GB data set, you
> probably don't want to use Python dictionaries to remember them.  The
> overhead per value is rather large, like dozens of bytes.
>

    Well, it looks like I'm going to top out at about 600K values for the
three variables; so far the times look pretty good.  A first test had python
beating compiled fortran, SAS and perl.

>
> If you can convert the keys (the substrings line[8:14] and so forth)
> into integers (they're account numbers and such?) and they're not too
> sparse, you can use a bit map to record which ones you've seen, then
> iteratively count the set bits at the end.  If they're sparse but you
> don't mind slight inaccuracy in the final counts, you can hash them
> into integers and then use a bit map.
>

     Unfortunately they're not integers; they are a combination of digits and
letters. There may be few enough diferent letters that I could convert to hex
or such.

>
> If you need a precise count, and the values are sparse, and there's
> too many to fit it memory, you can divide the data into smaller
> pieces, count each piece, then merge the counts.  If you're just
> processing a single data set once, the simplest (but not fastest)
> thing to do is select out the keys and write them to a file.
> Then use the Unix sort utility to sort the file discarding
> duplicates (sort -u), then count the lines in the output.
> The sort utility knows how to handle files that don't fit in ram.

    I'm already dividing the data into pieces about 1.5Gb each (that's how we
want the data summarized).  I expect I will use sort at some point as well.

thanks