Populating a dictionary, fast
Michael Bacarella
mbac at gpshopper.com
Sun Nov 11 11:25:15 EST 2007
Firstly, thank you for all of your help so far, I really appreciate it.
> > So, you think the Python's dict implementation degrades towards
O(N)
> > performance when it's fed millions of 64-bit pseudo-random longs?
>
> No.
Yes.
I tried your code (with one change, time on feedback lines) and got the
same terrible
performance against my data set.
$ cat stevescode.py
#!/usr/bin/python
"""Read a big file into a dict."""
import gc
import time
print "Starting at %s" % time.asctime()
flag = gc.isenabled()
gc.disable()
id2name = {}
for n, line in enumerate(open('id2name.txt', 'r')):
if n % 1000000 == 0:
# Give feedback.
print "Line %d" % n,time.asctime()
id,name = line.strip().split(':', 1)
id = long(id)
id2name[id] = name
print "Items in dict:", len(id2name)
print "Completed import at %s" % time.asctime()
print "Starting to delete dict..."
del id2name
print "Completed deletion at %s" % time.asctime()
if flag:
gc.enable()
print "Finishing at %s" % time.asctime()
$ ./stevescode.py
Starting at Sun Nov 11 10:46:37 2007
Line 0 Sun Nov 11 10:46:37 2007
Line 1000000 Sun Nov 11 10:48:22 2007
Line 2000000 Sun Nov 11 10:51:08 2007
Line 3000000 Sun Nov 11 10:56:16 2007
Line 4000000 Sun Nov 11 10:58:41 2007
Line 5000000 Sun Nov 11 11:03:26 2007
^C
To prove that my machine is sane, I ran the same against your generated
sample file and got _excellent_ performance. Start to finish in under a minute.
$ cat steves-makedict.py
#!/usr/bin/python
"""Make a big file of 64-bit integer keys plus random values."""
bits64 = 2**64
import random
template = '%d:This is a bunch of text...\n'
fp = open('id2name.txt', 'w')
for i in xrange(8191180):
fp.write(template % random.randint(0, bits64))
fp.close()
$ ./steves-makedict.py
$ ./stevescode.py
Starting at Sun Nov 11 11:15:31 2007
Line 0 Sun Nov 11 11:15:31 2007
Line 1000000 Sun Nov 11 11:15:37 2007
Line 2000000 Sun Nov 11 11:15:43 2007
Line 3000000 Sun Nov 11 11:15:49 2007
Line 4000000 Sun Nov 11 11:15:54 2007
Line 5000000 Sun Nov 11 11:16:00 2007
Line 6000000 Sun Nov 11 11:16:07 2007
Line 7000000 Sun Nov 11 11:16:12 2007
Line 8000000 Sun Nov 11 11:16:18 2007
Items in dict: 8191180
Completed import at Sun Nov 11 11:16:19 2007
Starting to delete dict...
Completed deletion at Sun Nov 11 11:16:23 2007
Finishing at Sun Nov 11 11:16:23 2007
> Notice that the dict is completely read into memory in just two and a
> half minutes. The script then tries to delete the dict, and 32
minutes
> later is still struggling. That's the point I got sick of waiting and
> interrupted the script.
>
> Conclusion: it's a memory issue, or maybe a garbage collection issue, not
> a problem with dicts.
As you can see, not the case at all against my data set.
> (1) Presumably you don't want to run your app with the garbage collector
> turned off. You might still want to play around with the gc module to see
> what you can learn.
As you can see, your version did no better. :(
> (2) More memory will help avoid paging. If you can't get more memory, try
> more virtual memory. It will still be slow, but at least the
operating
> system doesn't have to try moving blocks around as much.
The machine has 8GB, and is not doing anything else when I run this.
> (3) Are you sure you need all eight-million-plus items in the cache
all
> at once?
Yes.
> (4) There are lots of algorithms out there for dealing with data too big
> to fit into main memory. Do some research.
It DOES fit into main memory and a dictionary is exactly the right way
to do this.
> (5) There is a data structure designed for dealing with tens of millions
> of records at once. It is called "a database". If you can't find a better
> algorithm, and refuse to use an existing RDBMS, I suspect you're
going to
> end up inventing a primitive, inefficient, buggy database which is no
> faster than existing systems out there.
I've tried three disk-based implementations already (BerkeleyDB, cdb, and an RDBMS)
Performance is terrible because they end up doing too many random disk seeks.
Pre-caching all of the data ahead of time has offered us the best performance so far,
but is still slower than it ought to be.
Creating a HEAP TABLE in the RDBMS is an idea, but moving all of this really easy
code into the RDBMS just to find a hashing algorithm that doesn't choke on my keys
sounds pretty lame.
A cached in main memory hash is the right way to do this.
The Perl version runs *very* fast, after all.
More information about the Python-list
mailing list