dict is really slow for big truck

Bruno Desthuilliers bruno.42.desthuilliers at websiteburo.invalid
Wed Apr 29 17:07:54 CEST 2009


bearophileHUGS at lycos.com a écrit :
> On Apr 28, 2:54 pm, forrest yang <Gforrest.y... at gmail.com> wrote:
>> i try to load a big file into a dict, which is about 9,000,000 lines,
>> something like
>> 1 2 3 4
>> 2 2 3 4
>> 3 4 5 6
>>
>> code
>> for line in open(file)
>>    arr=line.strip().split('\t')
>>    dict[line.split(None, 1)[0]]=arr
>>
>> but, the dict is really slow as i load more data into the memory, by
>> the way the mac i use have 16G memory.
>> is this cased by the low performace for dict to extend memory or
>> something other reason.
>> is there any one can provide a better solution
> 
> Keys are integers,

Actually strings. But this is probably not the problem here.

> so they are very efficiently managed by the dict.
> If I do this:
> d = dict.fromkeys(xrange(9000000))
> It takes only a little more than a second on my normal PC.
> So probably the problem isn't in the dict, it's the I/O

If the OP experiments a noticeable slow down during the process then I 
doubt the problem is with IO. If he finds the process to be slow but of 
constant slowness, then it may or not have to with IO, but possibly not 
as the single factor.

Hint : don't guess, profile.

> and/or the
> list allocation. A possible suggestion is to not split the arrays,

The OP is actually splitting a string.

> but
> keep it as strings, and split them only when you use them:
> 
> d = {}
> for line in open(file):
>   line = line.strip()
>   d[line.split(None, 1)[0]] = line

You still split the string - but only once, which is indeed better !-)

Bu you can have your cake and eat it too:

d = {}
for line in open(thefile):
    arr = line.strip().split()
    d[arr[0]] = arr


> if that's not fast enough you can simplify it:
> 
> d = {}
> for line in open(file):
>   d[line.split(None, 1)[0]] = line

I doubt this will save that much processing time...

> If you have memory problems still, then you can only keep the line
> number as dict values, of even absolute file positions, to seek later.
> You can also use memory mapped files.
> 
> Tell us how is the performance now.

IMHO, not much better...



More information about the Python-list mailing list