Populating a dictionary, fast
Michael Bacarella
mbac at gpshopper.com
Sat Nov 10 20:18:37 EST 2007
> That's an awfully complicated way to iterate over a file. Try this
> instead:
>
> id2name = {}
> for line in open('id2name.txt'):
> id,name = line.strip().split(':')
> id = long(id)
> id2name[id] = name
>
> > This takes about 45 *minutes*
> >
> On my system, it takes about a minute and a half to produce a
dictionary
> with 8191180 entries.
Doing something similar on my system is very fast as well.
$ cat dict-8191180.py
#!/usr/bin/python
v = {}
for i in xrange(8191180):
v[i] = i
$ time ./dict-8191180.py
real 0m5.877s
user 0m4.953s
sys 0m0.924s
But...
> > If I comment out the last line in the loop body it takes only about
30
> > _seconds_ to run. This would seem to implicate the line id2name[id] =
> > name as being excruciatingly slow.
>
> No, dictionary access is one of the most highly-optimized, fastest,
most
> efficient parts of Python. What it indicates to me is that your system
is
> running low on memory, and is struggling to find room for 517MB worth
of
> data.
If only it were so easy.
$ free
total used free shared buffers cached
Mem: 7390244 2103448 5286796 0 38996 1982756
-/+ buffers/cache: 81696 7308548
Swap: 2096472 10280 2086192
Here's your Python implementation running as badly as mine did.
$ wc -l id2name.txt
8191180 id2name.txt
$ cat cache-id2name.py
#!/usr/bin/python
id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':',1)
id = long(id)
id2name[id] = name
$ time ./cache-id2name.py
^C
I let it go 30 minutes before killing it since I had to leave. Here it is in top before I did the deed.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18802 root 25 0 1301m 1.2g 1148 R 99.9 17.1 36:05.99 cache-id2name.p
36 minutes, 05.99 seconds.
To rule out the file reading/parsing logic as culprit, here's same thing, but with
dictionary insertion removed.
$ cat nocache-id2name.py
#!/usr/bin/python
id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':',1)
id = long(id)
$ time ./nocache-id2name.py
real 0m33.518s
user 0m33.070s
sys 0m0.415s
Here's a Perl implementation running very fast.
$ cat cache-id2name.pl
#!/usr/bin/perl
my %id2name = ();
my $line;
my $id;
my $name;
open(F,"<id2name.txt");
foreach $line (<F>) {
chomp($line);
($id,$name) = split(/:/,$line,1);
$id = int($id);
$id2name{$id} = $name;
}
$ time ./cache-id2name.pl
real 0m46.363s
user 0m43.730s
sys 0m2.611s
So, you think the Python's dict implementation degrades towards O(N)
performance when it's fed millions of 64-bit pseudo-random longs?
More information about the Python-list
mailing list