[ python-Bugs-849662 ] reading shelves is really slow

SourceForge.net noreply at sourceforge.net
Fri Nov 28 16:57:03 EST 2003


Bugs item #849662, was opened at 2003-11-26 09:06
Message generated for change (Comment added) made by rhettinger
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=849662&group_id=5470

Category: Extension Modules
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Gottfried Ganßauge (ganssauge)
Assigned to: Raymond Hettinger (rhettinger)
Summary: reading shelves is really slow

Initial Comment:
My application uses a shelve-file which is created by 
another process using the same python version.
Before python2.3 using this shelve with the exact same 
application was almost twice as fast as a binary pickle 
containing the same data.
Now with python2.3 the same application is suddenly 
about 150 times slower than using the binary pickle.

The usage is as follows:
   idx_dict = shelve.open (idx_dict_name, "r")
   ...
   while not infile.eof:
      index = get_index_from_somewhere_else()
      if not idx_dict.has_key (index):
          do_something(index)
      else:
          do_something_else(index)

   idx.dict.close()
   
Profiling revealed that most of the time is spent within 
userdict.

----------------------------------------------------------------------

>Comment By: Raymond Hettinger (rhettinger)
Date: 2003-11-28 16:57

Message:
Logged In: YES 
user_id=80475

Yes, that was the culprit.

I'll look for a way to make __cmp__ a bit smarter.  In the
meantime, the proper way to check for None is always:  if
dict is None.

----------------------------------------------------------------------

Comment By: Gottfried Ganßauge (ganssauge)
Date: 2003-11-28 11:01

Message:
Logged In: YES 
user_id=792746

I think I found the answer:

apart from has_key() I'm using "dict != None".
If I leave that out in my test program both python variants 
run with the same speed.

The dict != None condition seems to trigger len(dict.keys()) 
and that seems to be way slower than before.

I definitely didn't time different scripts: the script is part of 
our CDROM production system and the only variables I had 
during my tests were python itself and the python path.

Find my test script attached...


----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2003-11-27 12:55

Message:
Logged In: YES 
user_id=80475

The fragment in the original posting showed the only
inner-loop shelve access was through has_key().   The
tracebacks show that UserDict is nowhere in the traceback
chain.  I conclude that the fragment does not represent what
is really going on in the problematic script. So, please
attach the profiled script, Konvertierung/entsch_pass2.py

The attached profile indicates that somewhere, there is a
line like:   for k,v in idx_dict.iteritems().  This is
surprising because shelves did not support iteritems() in
Py2.2.  That would be mean that you've timed and compared
two different pieces of code.

Please show the shortest script with data that runs at
radically different speeds on Py2.2 vs Py2.3.



----------------------------------------------------------------------

Comment By: Gottfried Ganßauge (ganssauge)
Date: 2003-11-27 05:42

Message:
Logged In: YES 
user_id=792746

What the heck ... here is the shelve in question

----------------------------------------------------------------------

Comment By: Gottfried Ganßauge (ganssauge)
Date: 2003-11-27 05:32

Message:
Logged In: YES 
user_id=792746

I uploaded my profiling data, maybe it will help you ...
Here is the information you requested:
----------------><------------------------><------------
(gotti at gglinux 534) 
PYTHONPATH=../../../COMMON.DEVEL/Tools/python/lib.linux-
i686-2.3 python Konvertierung/entsch_pass2.py HI69228 x HR 
all_idx2.shelve <hi69228.sgml
Traceback (most recent call last):
  File "Konvertierung/entsch_pass2.py", line 1026, in ?
    init_idx_dict (idx_dict_name)
  File "../../COMMON/lib/EDB.py", line 54, in init_idx_dict
    idx_dict.has_key([])
  File "/usr/lib/python2.3/shelve.py", line 104, in has_key
    return self.dict.has_key(key)
  File "/usr/lib/python2.3/bsddb/__init__.py", line 142, in 
has_key
    return self.db.has_key(key)
TypeError: String or Integer object expected for key, list found
(gotti at gglinux 535) 
PYTHONPATH=../../../COMMON.DEVEL/Tools/python/lib.linux-
i686-2.2 python2.2 Konvertierung/entsch_pass2.py HI69228 x 
HR all_idx2.shelve <hi69228.sgml
Traceback (most recent call last):
  File "Konvertierung/entsch_pass2.py", line 1026, in ?
    init_idx_dict (idx_dict_name)
  File "../../COMMON/lib/EDB.py", line 54, in init_idx_dict
    idx_dict.has_key([])
  File "/usr/lib/python2.2/shelve.py", line 62, in has_key
    return self.dict.has_key(key)
TypeError: key type must be string
(gotti at gglinux 536) python -V
Python 2.3.2
(gotti at gglinux 537) python2.2 -V
Python 2.2.3
(gotti at gglinux 538) uname -a
Linux gglinux 2.4.22 #1 SMP Mon Nov 3 11:40:28 CET 2003 
i686 unknown unknown GNU/Linux
(gotti at gglinux 538) cat /etc/debian_version
testing/unstable
(gotti at gglinux 539) python2.2 -c 'import shelve ; d = 
shelve.open("all_idx2.shelve", "r"); print len (d.keys()) ; print 
d.keys()[0], d [d.keys()[0]]'
34983
HI568817 None
(gotti at gglinux 540)  python2.3 -c 'import shelve ; d = 
shelve.open("all_idx2.shelve", "r"); print "# items in shelve:", 
len (d.keys()) ; print "Items look like: index", d.keys()
[0], "value", d [d.keys()[0]]'
# items in shelve: 34983
Items look like: index HI568817 value None


----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2003-11-27 04:17

Message:
Logged In: YES 
user_id=80475

I can reproduce a four-fold slowdown that persists even
after the UserDict.DictMixin lines are commented out of
shelve.py and bsddb.__init__.py.  For me, the only thing
that has changed is the underlying bsddb implementation.

Let's see if you system is going somewhere else to get its
shelving done.  After the first line, add:  idx_dict.has_key
([])
Then post the traceback here.

Do that for both Py2.2 and for Py2.3.  Thank you.

Also, post what a typical record in the index and tell me
how many entries are typically in idx_dict.  That way, I can
try to reproduce your timings with greater fidelity.

Which os are you using and what the minor bugfix verion
numbers of the Py2.2 and PY2.3 you are using.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=849662&group_id=5470



More information about the Python-bugs-list mailing list