Program very slow to finish

Sat Nov 3 14:20:22 EST 2001

Fred <fredNo at nospamco.com> writes:
> Is this a garbage collection issue?  Is there a better way to count the
> individual values than dictionaries?  I put the sys.exit call in while
> trying to figure out what was happening but it didn't make a difference.

There's only about 120,000 hash entries in your example run.  It
shouldn't take anything like 1 minute for sys.exit to free them.
Something else is going on.  I'd be interested to know what sys._exit
does.  However, it's pretty bad if freeing the entries is that
slow--bad enough that I'd probably open a bug.

Also, counting the entries with len(nacc.keys()) is pretty horrendous
if the number of entries is very large (several million).  For 120,000
it's probably tolerable.

If you expect to get a much larger number of distinct values (like
10's or 100's of million distinct) in your 100 GB data set, you
probably don't want to use Python dictionaries to remember them.  The
overhead per value is rather large, like dozens of bytes.

If you can convert the keys (the substrings line[8:14] and so forth)
into integers (they're account numbers and such?) and they're not too
sparse, you can use a bit map to record which ones you've seen, then
iteratively count the set bits at the end.  If they're sparse but you
don't mind slight inaccuracy in the final counts, you can hash them
into integers and then use a bit map.

If you need a precise count, and the values are sparse, and there's
too many to fit it memory, you can divide the data into smaller
pieces, count each piece, then merge the counts.  If you're just
processing a single data set once, the simplest (but not fastest)
thing to do is select out the keys and write them to a file.
Then use the Unix sort utility to sort the file discarding
duplicates (sort -u), then count the lines in the output.
The sort utility knows how to handle files that don't fit in ram.