writing large dictionaries to file using cPickle

John Machin sjmachin at lexicon.net
Wed Jan 28 18:46:11 EST 2009


On Jan 29, 9:43 am, perfr... at gmail.com wrote:
> On Jan 28, 5:14 pm, John Machin <sjmac... at lexicon.net> wrote:
>
>
>
> > On Jan 29, 3:13 am, perfr... at gmail.com wrote:
>
> > > hello all,
>
> > > i have a large dictionary which contains about 10 keys, each key has a
> > > value which is a list containing about 1 to 5 million (small)
> > > dictionaries. for example,
>
> > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> > > 'world'}, ...],
> > >                 key2: [...]}
>
> > > in total there are about 10 to 15 million lists if we concatenate
> > > together all the values of every key in 'mydict'. mydict is a
> > > structure that represents data in a very large file (about 800
> > > megabytes).
>
> > > what is the fastest way to pickle 'mydict' into a file? right now i am
> > > experiencing a lot of difficulties with cPickle when using it like
> > > this:
>
> > > from cPickle import pickle
> > > pfile = open(my_file, 'w')
> > > pickle.dump(mydict, pfile)
> > > pfile.close()
>
> > > this creates extremely large files (~ 300 MB) though it does so
> > > *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
> > > it gets slower and slower. it takes almost an hour if not more to
> > > write this pickle object to file.
>
> > > is there any way to speed this up? i dont mind the large file... after
> > > all the text file with the data used to make the dictionary was larger
> > > (~ 800 MB) than the file it eventually creates, which is 300 MB.  but
> > > i do care about speed...
>
> > > i have tried optimizing this by using this:
>
> > > s = pickle.dumps(mydict, 2)
> > > pfile.write(s)
>
> > > but this takes just as long... any ideas ? is there a different module
> > > i could use that's more suitable for large dictionaries ?
> > > thank you very much.
>
> > Pardon me if I'm asking the "bleedin' obvious", but have you checked
> > how much virtual memory this is taking up compared to how much real
> > memory you have? If the slowness is due to pagefile I/O, consider
> > doing "about 10" separate pickles (one for each key in your top-level
> > dictionary).
>
> the slowness is due to CPU when i profile my program using the unix
> program 'top'... i think all the work is in the file I/O. the machine
> i am using several GB of ram and ram memory is not heavily taxed at
> all. do you know how file I/O can be sped up?

More quick silly questions:

(1) How long does it take to load that 300MB pickle back into memory
using:
(a) cpickle.load(f)
(b) f.read()
?

What else is happening on the machine while you are creating the
pickle?

(2) How does




More information about the Python-list mailing list