Populating a dictionary, fast
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Sat Nov 10 23:32:44 EST 2007
On Sat, 10 Nov 2007 17:18:37 -0800, Michael Bacarella wrote:
> So, you think the Python's dict implementation degrades towards O(N)
> performance when it's fed millions of 64-bit pseudo-random longs?
No.
Here's my sample file:
$ wc -l id2name.txt
8191180 id2name.txt
$ ls -lh id2name.txt
-rw-rw-r-- 1 steve steve 371M 2007-11-11 14:00 id2name.txt
And the results of reading it into a dict (code follows below):
$ time ./slurp_dict.py
Starting at Sun Nov 11 14:26:51 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180
Completed import at Sun Nov 11 14:29:31 2007
Starting to delete dict...
Traceback (most recent call last):
File "./slurp_dict.py", line 20, in <module>
del id2name
KeyboardInterrupt
real 35m52.334s
user 1m17.663s
sys 0m16.758s
Notice that the dict is completely read into memory in just two and a
half minutes. The script then tries to delete the dict, and 32 minutes
later is still struggling. That's the point I got sick of waiting and
interrupted the script.
Conclusion: it's a memory issue, or maybe a garbage collection issue, not
a problem with dicts.
Here's my code for creating the key:value file in the first place:
#!/usr/bin/python
"""Make a big file of 64-bit integer keys plus random values."""
bits64 = 2**64
import random
template = '%d:This is a bunch of text...\n'
fp = open('id2name.txt', 'w')
for i in xrange(8191180):
fp.write(template % random.randint(0, bits64))
fp.close()
###
And here's my code for slurping it in to a dict:
#!/usr/bin/python
"""Read a big file into a dict."""
import gc
import time
print "Starting at %s" % time.asctime()
flag = gc.isenabled()
gc.disable()
id2name = {}
for n, line in enumerate(open('id2name.txt', 'r')):
if n % 1000000 == 0:
# Give feedback.
print "Line %d" % n
id,name = line.strip().split(':', 1)
id = long(id)
id2name[id] = name
print "Items in dict:", len(id2name)
print "Completed import at %s" % time.asctime()
print "Starting to delete dict..."
del id2name
print "Completed deletion at %s" % time.asctime()
if flag:
gc.enable()
print "Finishing at %s" % time.asctime()
###
So, what can you do? Given that I'm not willing to do any more unpaid
experimentation for you, here are my suggestions, in no particular order:
(1) Presumably you don't want to run your app with the garbage collector
turned off. You might still want to play around with the gc module to see
what you can learn.
(2) More memory will help avoid paging. If you can't get more memory, try
more virtual memory. It will still be slow, but at least the operating
system doesn't have to try moving blocks around as much.
(3) Are you sure you need all eight-million-plus items in the cache all
at once?
(4) There are lots of algorithms out there for dealing with data too big
to fit into main memory. Do some research.
(5) There is a data structure designed for dealing with tens of millions
of records at once. It is called "a database". If you can't find a better
algorithm, and refuse to use an existing RDBMS, I suspect you're going to
end up inventing a primitive, inefficient, buggy database which is no
faster than existing systems out there.
--
Steven.
More information about the Python-list
mailing list