Populating a dictionary, fast
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Mon Nov 12 07:52:24 EST 2007
On Sun, 11 Nov 2007 08:25:15 -0800, Michael Bacarella wrote:
> Firstly, thank you for all of your help so far, I really appreciate it.
>
>> > So, you think the Python's dict implementation degrades towards
> O(N)
>> > performance when it's fed millions of 64-bit pseudo-random longs?
>>
>> No.
>
> Yes.
>
> I tried your code (with one change, time on feedback lines) and got the
> same terrible
> performance against my data set.
Hmmm... you're getting even worse performance than I do. I read the dict
in pretty quickly considering the limited memory on my system, and only
run into trouble when I try deleting the dict.
But still... no, Python dicts shouldn't degrade that badly. Whatever is
going on, it almost certainly isn't a problem with the implementation of
dicts.
Here's a thought... try reading the data into one big list of key-value
pairs instead of a dict. If you still get terrible performance, I think
that pretty much eliminates the dict as the culprit.
> To prove that my machine is sane, I ran the same against your generated
> sample file and got _excellent_ performance. Start to finish in under a
> minute.
Not a fair comparison, because it is doing something completely
different. But still valuable to see that your install isn't *completely*
broken.
>> Notice that the dict is completely read into memory in just two and a
>> half minutes. The script then tries to delete the dict, and 32
>> minutes
>> later is still struggling. That's the point I got sick of waiting and
>> interrupted the script.
>>
>> Conclusion: it's a memory issue, or maybe a garbage collection issue,
>> not a problem with dicts.
>
> As you can see, not the case at all against my data set.
Yes, I see now that your problems are worse than my problems.
>> (1) Presumably you don't want to run your app with the garbage
>> collector turned off. You might still want to play around with the gc
>> module to see what you can learn.
>
> As you can see, your version did no better. :(
Somewhat better, but still struggling.
>> (2) More memory will help avoid paging. If you can't get more memory,
>> try more virtual memory. It will still be slow, but at least the
>> operating
>> system doesn't have to try moving blocks around as much.
>
> The machine has 8GB, and is not doing anything else when I run this.
Yes, fair enough.
>> (3) Are you sure you need all eight-million-plus items in the cache
>> all at once?
>
> Yes.
I remain skeptical, but what do I know, I don't even know what you're
doing with the data once you have it :-)
>> (4) There are lots of algorithms out there for dealing with data too
>> big to fit into main memory. Do some research.
>
> It DOES fit into main memory and a dictionary is exactly the right way
> to do this.
I agree now.
I think that you and I are experiencing anomalous results. I'm not even
certain that it is specifically a Python problem -- I'd be a lot happier
if somebody got the same terrible performance on a different OS.
What do we have in common? We're using different versions of Python (you
2.3, me 2.5) and different versions of Linux.
I wonder if we happen to have the same hardware... what's your CPU?
$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 107
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
stepping : 1
cpu MHz : 1000.000
cache size : 512 KB
...
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 107
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
stepping : 1
cpu MHz : 1000.000
cache size : 512 KB
...
--
Steven.
More information about the Python-list
mailing list