dict is really slow for big truck

Bruno Desthuilliers bruno.42.desthuilliers at websiteburo.invalid
Thu Apr 30 09:43:18 CEST 2009


forrest yang a écrit :
> i try to load a big file into a dict, which is about 9,000,000 lines,
> something like
> 1 2 3 4
> 2 2 3 4
> 3 4 5 6

How "like" is it ?-)

> code
> for line in open(file)
>    arr=line.strip().split('\t')
>    dict[arr[0]]=arr
> 
> but, the dict is really slow as i load more data into the memory,

Looks like your system is starting to swap. Use 'top' or any other 
system monitor to check it out.

> by
> the way the mac i use have 16G memory.
> is this cased by the low performace for dict to extend memory

dicts are Python's central data type (objects are based on dicts, all 
non-local namespaces are based on dicts, etc), so you can safely assume 
they are highly optimized.

> or
> something other reason.

FWIW, a very loose (and partially wrong, cf below)  estimation based on 
wild guesses: assuming an average size of 512 bytes per object (remember 
that Python doesn't have 'primitive' types), the above would use =~ 22G.

Hopefully, CPython does some caching for some values of some immutable 
types (specifically, small ints and strings that respect the grammar for 
Python identifiers), so depending on your real data, you might need a 
bit less RAM. Also, the 512 bytes per object is really more of a wild 
guess than anything else (but given the internal structure of a CPython 
object, I think it's about that order - please someone correct me if I'm 
plain wrong).

Anyway: I'm afraid the problem has more to do with your design than with 
your code or Python's dict implementation itself.

> is there any one can provide a better solution

Use a DBMS. They are designed - and highly optimised - for fast lookup 
over huge data sets.

My 2 cents.



More information about the Python-list mailing list