Program very slow to finish
fredNo at nospamco.com
Sun Nov 4 22:09:32 CET 2001
Paul Rubin wrote:
> Fred <fredNo at nospamco.com> writes:
> > Is this a garbage collection issue? Is there a better way to count the
> > individual values than dictionaries? I put the sys.exit call in while
> > trying to figure out what was happening but it didn't make a difference.
> There's only about 120,000 hash entries in your example run. It
> shouldn't take anything like 1 minute for sys.exit to free them.
> Something else is going on. I'd be interested to know what sys._exit
> does. However, it's pretty bad if freeing the entries is that
> slow--bad enough that I'd probably open a bug.
It does look like a bug. os.exit(0) ends the program right after the
print statements, taking about 1/3 of the time.
I see another respondent to this thread has done more tests and has opened
a bug report.
> Also, counting the entries with len(nacc.keys()) is pretty horrendous
> if the number of entries is very large (several million). For 120,000
> it's probably tolerable.
> If you expect to get a much larger number of distinct values (like
> 10's or 100's of million distinct) in your 100 GB data set, you
> probably don't want to use Python dictionaries to remember them. The
> overhead per value is rather large, like dozens of bytes.
Well, it looks like I'm going to top out at about 600K values for the
three variables; so far the times look pretty good. A first test had python
beating compiled fortran, SAS and perl.
> If you can convert the keys (the substrings line[8:14] and so forth)
> into integers (they're account numbers and such?) and they're not too
> sparse, you can use a bit map to record which ones you've seen, then
> iteratively count the set bits at the end. If they're sparse but you
> don't mind slight inaccuracy in the final counts, you can hash them
> into integers and then use a bit map.
Unfortunately they're not integers; they are a combination of digits and
letters. There may be few enough diferent letters that I could convert to hex
> If you need a precise count, and the values are sparse, and there's
> too many to fit it memory, you can divide the data into smaller
> pieces, count each piece, then merge the counts. If you're just
> processing a single data set once, the simplest (but not fastest)
> thing to do is select out the keys and write them to a file.
> Then use the Unix sort utility to sort the file discarding
> duplicates (sort -u), then count the lines in the output.
> The sort utility knows how to handle files that don't fit in ram.
I'm already dividing the data into pieces about 1.5Gb each (that's how we
want the data summarized). I expect I will use sort at some point as well.
More information about the Python-list