Populating a dictionary, fast

Sat Nov 10 20:18:37 EST 2007

> That's an awfully complicated way to iterate over a file. Try this 
> instead:

>
> id2name = {}
> for line in open('id2name.txt'):
>    id,name = line.strip().split(':')
>    id = long(id)
>    id2name[id] = name
>
> > This takes about 45 *minutes*
> > 
> On my system, it takes about a minute and a half to produce a
 dictionary 
> with 8191180 entries.

Doing something similar on my system is very fast as well.

$ cat dict-8191180.py

#!/usr/bin/python

v = {}

for i in xrange(8191180):

        v[i] = i

$ time ./dict-8191180.py

real    0m5.877s
user    0m4.953s
sys     0m0.924s

But...

> > If I comment out the last line in the loop body it takes only about
 30
> > _seconds_ to run. This would seem to implicate the line id2name[id] =
> > name as being excruciatingly slow.
>
> No, dictionary access is one of the most highly-optimized, fastest,
 most 
> efficient parts of Python. What it indicates to me is that your system
 is 
> running low on memory, and is struggling to find room for 517MB worth
 of 
> data.

If only it were so easy.

$ free

             total       used       free     shared    buffers     cached

Mem:       7390244    2103448    5286796          0      38996    1982756

-/+ buffers/cache:      81696    7308548

Swap:      2096472      10280    2086192

Here's your Python implementation running as badly as mine did.

$ wc -l id2name.txt

8191180 id2name.txt

$ cat cache-id2name.py

#!/usr/bin/python

id2name = {}

for line in open('id2name.txt'):

        id,name = line.strip().split(':',1)

        id = long(id)

        id2name[id] = name

$ time ./cache-id2name.py

^C

I let it go 30 minutes before killing it since I had to leave.  Here it is in top before I did the deed.

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

18802 root      25   0 1301m 1.2g 1148 R 99.9 17.1  36:05.99 cache-id2name.p

36 minutes, 05.99 seconds.

To rule out the file reading/parsing logic as culprit, here's same thing, but with

dictionary insertion removed.

$ cat nocache-id2name.py

#!/usr/bin/python

id2name = {}

for line in open('id2name.txt'):

        id,name = line.strip().split(':',1)

        id = long(id)

$ time ./nocache-id2name.py

real    0m33.518s

user    0m33.070s

sys     0m0.415s

Here's a Perl implementation running very fast.

$ cat cache-id2name.pl

#!/usr/bin/perl

my %id2name = ();

my $line;

my $id;

my $name;

open(F,"<id2name.txt");

foreach $line (<F>) {

        chomp($line);

        ($id,$name) = split(/:/,$line,1);

        $id = int($id);

        $id2name{$id} = $name;

}

$ time ./cache-id2name.pl

real    0m46.363s

user    0m43.730s

sys     0m2.611s

So, you think the Python's dict implementation degrades towards O(N)

performance when it's fed millions of 64-bit pseudo-random longs?