pickle performance on larger objects

Wed Jul 17 22:08:59 EDT 2002

I'm quite surprised by your numbers, but perhaps it is because you didn't
use a binary pickle? The companies sample included with PythonCard uses the
flatfileDatabase module in the PythonCard framework to load and save a
binary pickle file that is around 1.2MB, a little over 6600 records stored
as one big list of 6600+ dictionaries, so single load and dump calls are
used to get and save the data. On my machine, loading takes a fraction of a
second. The original data is in XML format and another version of the data
was output via pprint and both of those take dramatically longer to convert
back to a usuable list of dictionaries in memory.

If you are interested in exploring this further, contact me directly, and I
can tell you how to run some additonal experiments. Then maybe we can
summarize for the list. If you want to install MachoPython, wxPython Mac,
and PythonCard on your system, you can just run the companies sample
directly with the -l command-line option to enable logging. Alternatively,
we can get the relevant data-centric parts of the flatfileDatabase module
and the companies data set onto your box and you can play with it in a
non-gui script.

ka
---
Kevin Altis
altis at semi-retired.com
http://www.pythoncard.org/

"Sam Penrose" <spenrose at intersight.com> wrote in message
news:mailman.1026940226.16076.python-list at python.org...
> On a recent project we decided to use pickle for some quick-and-dirty
> object persistence. The object in question is a list of 3,000
> dictionaries
> whose keys and values are short (< 100 character) strings--about 1.5
> megs worth of character data in total. Loading this object from a pickle
> using cPickle took so long we assumed something was broken.
>
> In fact, loading is just slow. A list of 10,000 identical dictionaries
> whose keys and values are short strings takes many seconds to load on
> modern hardware. Some details:
>      i.  A python process which is loading a pickle will use a lot of RAM
>          relative to the pickle's size on disk, roughly an order of
>          magnitude more on Mac OS X.
>      ii. Performance appears to scale linearly with changes in the size of
>          the list or its dicts until you run out of RAM.
>      iii.Python pickle is only about 5x slower than cPickle as the list
>          gets long, except that it uses more RAM and therefore hits a big
>          RAM-to-diskswap performance falloff sooner.
>      iv. You *can* tell a Mac's performance by its MHz. An 800 MHz PIII
>          running Windows is almost exactly twice as fast as a 400 MHz G4
>          running Mac OS X, both executing the following code from the
>          command line. With 25 items in the dictionaries and 10K dicts
>          used, the former took just under a minute using cPickle, the
>          latter two minutes.
>      v.  Generating a list of 3K heterogeneous dicts of 25 items (our real
>          data) by reading in a 750k text file and splitting it up takes on
>          the order of a second.
>
> Sample run on 400 MHz G4, 448 megs of RAM:
>
>  >>> time_cPickle_Load()
> dumping list of 10 dicts:
> dumped: 0.00518298149109
> loading list of 10 dicts:
> loaded: 0.1170129776
> dumping list of 100 dicts:
> dumped: 0.0329120159149
> loading list of 100 dicts:
> loaded: 0.849031090736
> dumping list of 1000 dicts:
> dumped: 0.397919893265
> loading list of 1000 dicts:
> loaded: 8.18722295761
> dumping list of 10000 dicts:
> dumped: 4.42434895039
> loading list of 10000 dicts:
> loaded: 133.906162977
>
> #---code follows----------------//
> def makeDict(numItems=25):
>      d = {}
>      for i in range(numItems):
>          k = 'key%s' % i
>          v = 'value%s' % i
>          d[k] = v
>      return d
>
> def time_cPickle_Load():
>      import time
>      now = time.time
>      from cPickle import dump, load
>      filename = 'deleteme.pkl'
>
>      for i in (10, 100, 1000, 10000):
>          data = [makeDict() for j in range(i)]
>          output = open(filename, 'w')
>          startDump = now()
>          print "dumping list of %s dicts:" % i
>          dump(data, output)
>          print "dumped:", now() - startDump
>          output.close()
>          input = open(filename)
>          startLoad = now()
>          print "loading list of %s dicts:" % i
>          x = load(input)
>          print "loaded:", now() - startLoad
>          input.close()
>
>
>
>