[ python-Bugs-881522 ] Shelve slow after 7/8000 key

SourceForge.net noreply at sourceforge.net
Fri Jan 23 05:03:54 EST 2004


Bugs item #881522, was opened at 2004-01-21 17:09
Message generated for change (Comment added) made by marcoberi
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=881522&group_id=5470

Category: Python Library
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Marco Beri (marcoberi)
Assigned to: Gregory P. Smith (greg)
Summary: Shelve slow after 7/8000 key

Initial Comment:
After about 8.000 insertion shelve became really, really 
slow.
This happens only with 2.3.3 #51 on Windows, not with 
2.2 and with 2.3 on Linux.
I try with writeback True or False: same problem.
Help! :-))


----------------------------------------------------------------------

>Comment By: Marco Beri (marcoberi)
Date: 2004-01-23 10:03

Message:
Logged In: YES 
user_id=588604

I mean: I didn't try with python 2.3 on linux (just with python 
2.2)

----------------------------------------------------------------------

Comment By: Marco Beri (marcoberi)
Date: 2004-01-23 10:01

Message:
Logged In: YES 
user_id=588604

I give a wrong info: I didn't try it on Linux so I'm not so sure 
it's a windows specific problem.
Besides this, looking at 2004-01-22 18:32 greg comment, it's 
seems that also Linux - alpha version has this problem.
Probably it's better to modify category to "Python library"?



----------------------------------------------------------------------

Comment By: Marco Beri (marcoberi)
Date: 2004-01-23 00:44

Message:
Logged In: YES 
user_id=588604

jkew,
also I god a bit of a headache. I was pretty sure to improve 
performances with Python 2.3.3, while they get incredibly 
worse.
I know perhaps this is a third-party issue, but I use a python 
feature (shelve) and at least I think that it's better to remove 
it or signal this problem in the documentation.
We are talking about few thousand key, not billions!

BTW I didn't post twice the previuos message.


----------------------------------------------------------------------

Comment By: James Kew (jkew)
Date: 2004-01-23 00:16

Message:
Logged In: YES 
user_id=598066

FWIW2, on skip's "miserable hack" comment below, vis-a-vis 
running shelve on btree: isn't this exactly the sort of thing 
shelve.Shelf is intended for?

import bsddb
import shelve

db = bsddb.btopen("temp.db")
sh = shelve.Shelf(db)
# do stuff with sh
sh.close()
# automatically calls close() on the underlying db

(Not sure why Shelf and friends are documented on 
shelve's "Restrictions" subsection...)



----------------------------------------------------------------------

Comment By: Marco Beri (marcoberi)
Date: 2004-01-23 00:08

Message:
Logged In: YES 
user_id=588604

I get your same results under normal cmd: 7.07 seconds vs 
0.46.

[c:\tmp]timer & \python23\python test3skip.py hashopen & 
timer
Timer 1 on: 19.13.22
Timer 1 off: 19.13.29  Elapsed: 0.00.07,07

[c:\tmp]timer & \python23\python test3skip.py btopen & timer
Timer 1 on: 19.13.45
Timer 1 off: 19.13.45  Elapsed: 0.00.00,46


----------------------------------------------------------------------

Comment By: James Kew (jkew)
Date: 2004-01-22 23:53

Message:
Logged In: YES 
user_id=598066

FWIW, to throw another use case into the pot: I (used to) 
run Roundup (roundup.sf.net) trackers on anydbm/Win2K and 
experienced a significant drop in performance between 2.2.x 
(bsddb185) and 2.3.x (dbhash).

I understand that this is a third-party issue, and that there 
were significant known problems with bsddb 1.85, but it did 
cause me a bit of a double-take after having heard so much 
about Python 2.3 being faster...

I say "used to" because the slowdown prompted me to 
migrate to Roundup's sqlite backend, solving my problem.


----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2004-01-22 21:11

Message:
Logged In: YES 
user_id=44345

If we wanted speed and didn't care about corruption, my vote 
would be bsddb185. ;-)


----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-01-22 20:36

Message:
Logged In: YES 
user_id=31435

Greg, I didn't expect you to fix it <wink>, I just didn't want 
the bug report closed based on misunderstanding what it was 
about.

I've unassigned this item, and if nobody volunteers to dig into 
it within a few weeks, it should indeed be closed as "3rd 
Party" and "Wont Fix

Skip, maybe we should try to force spambayes to use a btree 
mapping too -- then maybe we could get a whole new class 
of intractable corruption errors <wink -- but it might be a lot 
faster>.

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2004-01-22 20:28

Message:
Logged In: YES 
user_id=44345

Whoops, sorry about polluting the waters with the btree stuff.  
Dang time lag.

Looking at just the hashopen times between 2.2, 2.3 and 2.4 does 
show that it hash file times have gotten worse since Berkeley 1.85 
days.

Whether or not btree times muddy these particular waters, 
figuring out a way to switch to a different db type and still use the 
shelve module may be Marco's best bet for a short term 
performance improvement.


----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2004-01-22 20:22

Message:
Logged In: YES 
user_id=44345

I guess I get similar results on Mac OS X after looking at it a bit.  
The differences are just not as dramatic (or disappointing) as they 
are on Windows.  Here's the output of a little shell script which 
runs test3skip.py with various Python interpreters and Berkeley 
DB versions:

Python version: (2, 4, 0, 'alpha', 0)
Berkeley DB version: 4.2.4
hashopen: 0m1.621s
btopen:   0m0.608s

Python version: (2, 3, 3, 'final', 0)
Berkeley DB version: 4.2.0
hashopen: 0m1.359s
btopen:   0m0.450s

Python version: (2, 2, 0, 'final', 0)
Berkeley DB version: ???
hashopen: 0m0.514s
btopen:   0m0.202s

Only real (wall clock) times are displayed.

Mario,

Unfortunately, there doesn't seem to be much we can do at this
end to remedy the situation with hash files.  If you want to use 
shelve but switch to bsddb.btopen as the underlying db file open 
call, try posting to comp.lang.python.  Anything you do will 
probably be a miserable hack, but we can probably figure 
something out.



----------------------------------------------------------------------

Comment By: Gregory P. Smith (greg)
Date: 2004-01-22 19:12

Message:
Logged In: YES 
user_id=413

python 2.2 and earlier on windows linked against some form
of bsddb 1.85.

python 2.3 and later link against modern BerkeleyDB (not
really related to bsddb 1.85 much at all other than by name
and a legacy api).  They are very different libraries with
very different capabilities and performance.

regardless, i don't have a windows development platform
anymore.  someone who does, please take this.

i suspect this is not something we can fix.  try asking
sleepycat why modern DB_HASH databases might be slower than
bsddb 1.85 hash databases on windows and see what they say.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-01-22 18:56

Message:
Logged In: YES 
user_id=31435

The original question is why a BDB hash is some 30x slower 
under 2.3 than under 2.2 or 2.1, and that does appear 
specific to Windows.

Skip threw btrees into this too, but that complication doesn't 
appear relevant to the original report (despite marcoberi's 
hearsay 2004-01-21 18:57 comment -- others posted actual 
output, making clear that dbhash is used under all Python 
versions in test1skip).

I'll note in passing that the test case inserts keys in already-
mostly-sorted order, which is a friendly order for a btree-
based mapping.  To get back to the original report, ignore 
everything here concerning test3skip and btrees.

----------------------------------------------------------------------

Comment By: Gregory P. Smith (greg)
Date: 2004-01-22 18:32

Message:
Logged In: YES 
user_id=413

This problem is not specific to windows.  hashopen in the
test3skip.py test case is 10x slower than btopen on my
linux-alpha system.

I don't know why BerkeleyDB hash databases are so much
slower than B-Tree ones.  My best suggestion is:  if it
hurts, don't do that.  Use a btree rather thah hash database.

Running the python process under strace on linux reveals
nothing obvious (no system calls are being made during the
time hash open is consuming lots of cpu...

You'll have to ask sleepycat themselves if you want a real
answer as to why hash databases don't perform well.

----------------------------------------------------------------------

Comment By: Marco Beri (marcoberi)
Date: 2004-01-22 18:16

Message:
Logged In: YES 
user_id=588604

I get your same results under normal cmd: 7.07 seconds vs 
0.46.

[c:\tmp]timer & \python23\python test3skip.py hashopen & 
timer
Timer 1 on: 19.13.22
Timer 1 off: 19.13.29  Elapsed: 0.00.07,07

[c:\tmp]timer & \python23\python test3skip.py btopen & timer
Timer 1 on: 19.13.45
Timer 1 off: 19.13.45  Elapsed: 0.00.00,46


----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2004-01-22 18:02

Message:
Logged In: YES 
user_id=44345

Try test3skip.py.  You run it like this:

    python test3skip.py hashopen
    python test3skip.py btopen

I ran it on win2k under cygwin so I could use the time command 
(but ran the Windows version of Python).  Using btopen was much 
faster.  I got rid of shelve to eliminate it and pickle as possible 
sources of problems.

$ time /cygdrive/c/Python23/python test3skip.py hashopen

real    0m6.801s
user    0m0.015s
sys     0m0.000s

Administrator at CYCLOPS ~/tmp
$ time /cygdrive/c/Python23/python test3skip.py btopen

real    0m0.345s
user    0m0.015s
sys     0m0.015s

I don't know if the relationship between real, user and sys time 
means anything on cygwin, but the reported real times are very 
repeatable and match my subjective feel of the elapsed time.  This 
suggests there's something fishy with either the underlying library 
or with __setitem__ when using hash files.

I'm assigning to Greg so he can take a peek.  As the bsddb/
pybsddb guy he might have some better insight (certainly better 
than me).

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2004-01-22 18:01

Message:
Logged In: YES 
user_id=44345

Try test3skip.py.  You run it like this:

    python test3skip.py hashopen
    python test3skip.py btopen

I ran it on win2k under cygwin so I could use the time command 
(but ran the Windows version of Python).  Using btopen was much 
faster.  I got rid of shelve to eliminate it and pickle as possible 
sources of problems.

$ time /cygdrive/c/Python23/python test3skip.py hashopen

real    0m6.801s
user    0m0.015s
sys     0m0.000s

Administrator at CYCLOPS ~/tmp
$ time /cygdrive/c/Python23/python test3skip.py btopen

real    0m0.345s
user    0m0.015s
sys     0m0.015s

I don't know if the relationship between real, user and sys time 
means anything on cygwin, but the reported real times are very 
repeatable and match my subjective feel of the elapsed time.  This 
suggests there's something fishy with either the underlying library 
or with __setitem__ when using hash files.

I'm assigning to Greg so he can take a peek.  As the bsddb/
pybsddb guy he might have some better insight (certainly better 
than me).

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-01-22 17:29

Message:
Logged In: YES 
user_id=31435

FYI, on a Win98SE box, test1skip.py took about 30 seconds 
under 2.3.3, and about 1 second under both 2.2.3 and 2.1.3.  
Under 2.3.3, no significant time is taken by a.close(), so it's 
all in the loop.  It prints "dbhash" under all versions.

----------------------------------------------------------------------

Comment By: Marco Beri (marcoberi)
Date: 2004-01-22 07:30

Message:
Logged In: YES 
user_id=588604

I tried your version: 31.36 seconds vs 0.65.
Just to be sure I tried on three different computers with 
Windows 2000: same gap.

[c:\tmp]timer & \Python23\python test1skip.py & timer
Timer 1 on:  8.21.58
dbhash
Timer 1 off:  8.22.29  Elapsed: 0.00.31,36

[c:\tmp]timer & \Python22\python test1skip.py & timer
Timer 1 on:  8.22.40
dbhash
Timer 1 off:  8.22.41  Elapsed: 0.00.00,65


----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2004-01-22 00:28

Message:
Logged In: YES 
user_id=44345

Can't reproduce on Mac OS X.  I tried with 2.2, 2.3 and CVS using
attached test1skip.py (no writeback - 2.2 doesn't support it, no
import pickle - not used, no key prints - just muddies the water,
print whichdb's result).

The times are close enough to not worry me:

montanaro:tmp% time python2.3 test1.py
dbhash

real    0m1.927s
user    0m1.720s
sys     0m0.080s
montanaro:tmp% time python2.2 test1.py
dbhash

real    0m1.250s
user    0m0.850s
sys     0m0.360s
montanaro:tmp% time python test1.py
dbhash

real    0m2.179s
user    0m1.950s
sys     0m0.120s

Please try this modified version just to make sure we are both
looking at the same thing.



----------------------------------------------------------------------

Comment By: Marco Beri (marcoberi)
Date: 2004-01-21 23:57

Message:
Logged In: YES 
user_id=588604

Skip Montanaro discovered that whichdb repors bsddb185 
with python 2.2 and dbhash with 2.3.3.
So why is it so slow after few thousand keys?

----------------------------------------------------------------------

Comment By: Thomas Heller (theller)
Date: 2004-01-21 18:24

Message:
Logged In: YES 
user_id=11105

Hm, are windows bugs automatically assigned to me ;-)??

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=881522&group_id=5470



More information about the Python-bugs-list mailing list