[Python-bugs-list] [ python-Bugs-775414 ] bsddb3 hash craps out with threads

Mon Sep 29 03:42:45 EDT 2003

Bugs item #775414, was opened at 2003-07-22 12:29
Message generated for change (Comment added) made by anthonybaxter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=775414&group_id=5470

Category: Extension Modules
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Tim Peters (tim_one)
Assigned to: Gregory P. Smith (greg)
Summary: bsddb3 hash craps out with threads

Initial Comment:
Richie Hindle presented something like the attached 

(hammer.py) on the spambayes-dev mailing list.  On 

Win98SE and Win2K w/ Python 2.3c1 I usually see this 

death pretty quickly:

Traceback (most recent call last):

  File "hammer.py", line 36, in ?

    main()

  File "hammer.py", line 33, in main

    hammer(db)

  File "hammer.py", line 15, in hammer

    x = db[str(int(random.random() * 100000))]

  File "C:\CODE\PYTHON\lib\bsddb\__init__.py", line 86, 

in __getitem__

    return self.db[key]

bsddb._db.DBRunRecoveryError: (-30982,

     'DB_RUNRECOVERY: Fatal error, run database 

recovery -- fatal region error detected; run recovery')

Richie also reported "illegal operation" crashes on 

Win98SE.

It's not clear whether a bsddb3 hash *can* be used 

with threads like this.  If it can't, there's a doc bug.  If it 

should be able to, there's a more serious problem.  Note 

that it looks like hashopen() always merges DB_THREAD 

into the flags, so the absence of specifying DB_THREAD 

probably isn't the problem.

----------------------------------------------------------------------

>Comment By: Anthony Baxter (anthonybaxter)
Date: 2003-09-29 17:42

Message:
Logged In: YES 
user_id=29957

I'd be much happier with a documentation fix for 2.3.2. 

Note that when I said "fails to complete" on Solaris, I

meant that it crashes out, not that it deadlocks. I can post

the tracebacks here if you'd like.

----------------------------------------------------------------------

Comment By: Gregory P. Smith (greg)
Date: 2003-09-29 17:02

Message:
Logged In: YES 
user_id=413

anthony - if we don't put this patch into python 2.3.2, the

python 2.3.x bsddb module documentation should be updated to

say that multithreaded access is not supported and will

cause problems, possibly even python interpreter crashes.

----------------------------------------------------------------------

Comment By: Gregory P. Smith (greg)
Date: 2003-09-29 16:57

Message:
Logged In: YES 
user_id=413

Deadlocks only occurring under DOS-based "windows"

(win95/98/me) aren't something the python module can

prevent.  I suggest submitting the sample code and info from

studly_hammer.py to sleepycat.  They're usually very

responsive to questions of that nature.

btw, i'll give things a go on solaris later this week.  if

the test suite never completes i again suspect it is a

berkeleydb library issue on that platform rather than python

module.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2003-09-29 11:38

Message:
Logged In: YES 
user_id=31435

Running the original hammer.py under current CVS Python 

freezes in the same way (as in my immediately preceding 

note) now too; again Win98SE.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2003-09-29 11:28

Message:
Logged In: YES 
user_id=31435

About studly_hammer.py:

[Skip Montanaro]

> ...

> Attached is a modified version of the hammer.py script 

which seems to

> not fail for me on either Windows run from IDLE (Python 

2.3, BDB

> 4.1.6) or Mac OS X (Python CVS, BDB 4.2.1).  The original 

script

> failed for me on Windows but not Mac OS X.  Can some 

other people for

> whom the original script fails please try it?  (I also attached 

it to

> bug #775414.) 

On Win98SE with current Python 2.3.1, it doesn't fail, but it 

never seemed to finish for me either.  Staring at WinTop 

showed that the Python process stopped accumulating 

cycles.  Can't be killed with Ctrl+C (no visible effect).  Can be 

killed with Ctrl+Break.

Dumping

        print "%s %s" % (thread.get_ident(), i)

at the top of the hammer loop showed that the threads get 

through several hundred iterations, then all printing stops.

Attaching to a debug-build Python from the debugger when a 

freeze occurs isn't terribly illuminating.  One thread's stack 

shows

_BSDDB_D! __db_win32_mutex_lock + 134 bytes

_BSDDB_D! __lock_get + 2264 bytes

_BSDDB_D! __lock_get + 197 bytes

_BSDDB_D! __ham_get_meta + 120 bytes

_BSDDB_D! __ham_c_dup + 4201 bytes

_BSDDB_D! __db_c_put + 2544 bytes

_BSDDB_D! __db_put + 507 bytes

_DB_put(DBObject * 0x016cff88, __db_txn * 0x016d0000, 

__db_dbt * 0x016cc000, __db_dbt * 0x50d751fe, int 0) line 

562 + 35 bytes

The main thread's stack shows

_BSDDB_D! __db_win32_mutex_lock + 134 bytes

_BSDDB_D! __lock_get + 2264 bytes

_BSDDB_D! __lock_get + 197 bytes

_BSDDB_D! __db_lget + 365 bytes

_BSDDB_D! __ham_lock_bucket + 105 bytes

_BSDDB_D! __ham_get_cpage + 195 bytes

_BSDDB_D! __ham_item_next + 25 bytes

_BSDDB_D! __ham_call_hash + 2479 bytes

_BSDDB_D! __ham_c_dup + 4307 bytes

_BSDDB_D! __db_c_put + 2544 bytes

_BSDDB_D! __db_put + 507 bytes

_DB_put(DBObject * 0x008fe2e8, __db_txn * 0x00000000, 

__db_dbt * 0x0062f230, __db_dbt * 0x0062f248, int 0) line 

562 + 35 bytes

DB_ass_sub(DBObject * 0x008fe2e8, _object * 0x00b83178, 

_object * 0x00b83370) line 2330 + 23 bytes

PyObject_SetItem(_object * 0x008fe2e8, _object * 

0x00b83178, _object * 0x00b83370) line 123 + 18 bytes

eval_frame(_frame * 0x00984948) line 1448 + 17 bytes

...

The other threads are somewhere in the OS kernel and don't 

have useful tracebacks.  This varies from run to run, but all 

threads with a useful stack are always stuck at the same 

place in __db_win32_mutex_lock.

All in all, looks like it's simply deadlocked.

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2003-09-28 15:11

Message:
Logged In: YES 
user_id=29957

Could you check that it (and the test_bsddb3) works on

Solaris? There's a couple of solaris boxes on the SF compile

farm (cf.sf.net). I was unable to get test_bsddb3 to complete

at all on Solaris 2.6, 7 or 8, when using DB 4.1.25.

As far as 2.3.2, I really really don't think it's appropriate to

throw it in at this late point. Particularly given the 2.3.1 

screwups, I don't want to risk it.

----------------------------------------------------------------------

Comment By: Gregory P. Smith (greg)
Date: 2003-09-28 09:08

Message:
Logged In: YES 
user_id=413

I just committed a change to bsddb/__init__.py (file rev 1.10) that adds the creation of a thread-safe DBEnv object for each hashopen, btopen or rnopen database.  hammer.py has been running for 5 minutes on my linux/alpha system using BerkeleyDB 4.1.25.  (admittedly my test is running on python 2.2.2, but as this isn't a python core related change i doubt that matters).

After others have tested this on other platforms with success I believe we can close this bug.  This patch would probably be good for python 2.3.2.

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2003-09-28 04:10

Message:
Logged In: YES 
user_id=44345

If hammer.py fails for you, please try this slightly modified

version (studly_hammer.py).

----------------------------------------------------------------------

Comment By: Gregory P. Smith (greg)
Date: 2003-09-13 08:28

Message:
Logged In: YES 
user_id=413

I don't see any problem in _bsddb.c:_DB_put(), what memory

are you talking about?  All of the DBT key and data

parameters are allocated on the local stack on the various

DB methods that call _DB_put.  What do you see that could be

clobbered?

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2003-09-13 05:52

Message:
Logged In: YES 
user_id=44345

The sleepycat mails (there are two of them - Keith's is

second) are in the attached sleepy.txt file.

----------------------------------------------------------------------

Comment By: Richie Hindle (richiehindle)
Date: 2003-09-13 05:25

Message:
Logged In: YES 
user_id=85414

Sorry to muddy the waters, but I'm 99% sure that this

is not a threading issue.  Today I had the same

DBRunRecoveryError for my Spambayes POP3 proxy

classifier database, which only ever gets accessed

from the main program thread.

----------------------------------------------------------------------

Comment By: Jeremy Hylton (jhylton)
Date: 2003-09-13 05:22

Message:
Logged In: YES 
user_id=31392

I don't want to sound like a broken record, but I will: Can

anyone comment on the lack of thread-safety in _DB_put()? 

It appears that there is nothing to prevent the memory used

by one call from being stomped on by another call in a

different thread.  This problem would exist even in an

application using the modern interface and specifying DB_THREAD.

----------------------------------------------------------------------

Comment By: Gregory P. Smith (greg)
Date: 2003-09-13 05:10

Message:
Logged In: YES 
user_id=413

Looking at bsddb/__init__.py (where the old bsddb compatibility 
interface is implemented) I don't see why the hammer.py attached 
below should cause a problem.  The database is opened with 
DB_THREAD using a private environment (no DBEnv passed to DB()). 

I definately see potential threading problems with the _DBWithCursor 
class defined there if any of the methods using a cursor are used (the 
cursor could be shared across threads; that's a no-no).  But in the 
context of hammer.py that doesn't happen so I wouldn't have expected 
a problem.  Unless perhaps creating the DB withou a DBEnv implies 
that the DB_THREAD flag won't work.  The DB_RECOVER flag is only 
useful for opening existing DBEnv's; we have none. 

I've got to pop offline for a bit now but i'll try a hammer.py modified to 
use direct DB calls (for easier playing around with and bug reporting to 
sleepycat if turns out to be a bug on their end) later tonight. 

PS  keiths response is in the sleepycat.txt attachment if you open the 
URL to this bug report on sourceforge. 

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2003-09-13 05:07

Message:
Logged In: YES 
user_id=31435

Jeremy, Keith's response is in the sleepy.txt file attached to 

the bug report.

----------------------------------------------------------------------

Comment By: Jeremy Hylton (jhylton)
Date: 2003-09-13 05:03

Message:
Logged In: YES 
user_id=31392

I don't see Keith's response anywhere in this thread.  Can

you add it for the record?  The only call to db->put() that

I see is in _DB_put().  It does not look thread-safe to me.

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2003-09-13 05:00

Message:
Logged In: YES 
user_id=44345

    The bsddb module emulates the old bsddb module's 1.85-ish

    interface using modern DB/DBEnv objects underneath.  So his

    comments about that not being threadsafe don't apply here.

But the low-level open() call isn't made with a DBEnv argument

is it?  Nor is the DB_RECOVER flag set.  Would the compatibility

interface be able to do both things?

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2003-09-13 04:57

Message:
Logged In: YES 
user_id=44345

In theory, yes, we could special case the bsddb stuff.  However,

the code currently is run indirectly via the anydbm module.  It

will take a little effort on our part to do something special for 

bsddb.  It would be nice if other apps using the naive interface

were able to use multiple threads.

----------------------------------------------------------------------

Comment By: Gregory P. Smith (greg)
Date: 2003-09-13 04:45

Message:
Logged In: YES 
user_id=413

ah, Keith's response from sleepycat assumed that we were using the 
DB 1.85 compatibility interface.  We do not.  The bsddb module 
emulates the old bsddb module's 1.85-ish interface using modern 
DB/DBEnv objects underneath.  So his comments about that not being 
threadsafe don't apply here. 

----------------------------------------------------------------------

Comment By: Jeremy Hylton (jhylton)
Date: 2003-09-13 04:37

Message:
Logged In: YES 
user_id=31392

Are the DB_mapping methods only used the old interface?  My

question is about those methods, which I assumed were used

by the old and new interfaces.

----------------------------------------------------------------------

Comment By: Gregory P. Smith (greg)
Date: 2003-09-13 04:30

Message:
Logged In: YES 
user_id=413

The old bsddb interface compatibility code could be modified to use a 
single DBEnv per process opened with the DB_SYSTEM_MEM flag.  Do 
we want to do this?  Shouldn't we encourage the use of the real 
pybsddb DB/DBEnv object interface for threads instead?  AFAIK the old 
bsddb module + libs were not thread safe. 

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2003-09-13 04:23

Message:
Logged In: YES 
user_id=44345

>From what I got back from Sleepycat on this, I'm pretty sure the 

old bsddb interface is not going to be thread safe.  Attached are 

two messages from Sleepycat.

Is there some way for the old interface to create a default

environment shared by all the bsddb.*open() calls and then set

the DB_RECOVER flag in the low-level open() call?

----------------------------------------------------------------------

Comment By: Jeremy Hylton (jhylton)
Date: 2003-09-13 03:14

Message:
Logged In: YES 
user_id=31392

How does the bsddb wrapper achieve thread safety?

I know very little about the wrapper or the underlying bsddb

libraries.  I found the following comment in the C API docs:

http://www.sleepycat.com/docs/ref/program/mt.html#2

> When using the non-cursor Berkeley DB calls to retrieve 

> key/data items (for example, DB->get), the memory to which

the 

> pointer stored into the Dbt refers is valid only until the

next call 

> using the DB handle returned by DB->open. This includes any 

> use of the returned DB handle, including by another thread 

> within the process.

This suggests that a call to a self->db->get() must process

its results (copy them into Python-owned memory) before any

other operation on the same db object can proceed.  Is that

right?

The bsddb wrapper releases the GIL before calling the

low-level DB API functions and the acquires it after the

call returns.  Is there some other lock that prevents

multiple simultaneous calls from stomping on each other?

----------------------------------------------------------------------

Comment By: Jeremy Hylton (jhylton)
Date: 2003-09-13 02:46

Message:
Logged In: YES 
user_id=31392

I'm running this test with CVS Python (built on 9/11/03) on

RH Linux 9 with bsddb 4.1.25.  I see the same error although

it takes a relatively long time to provoke -- a minute or two.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2003-09-13 02:08

Message:
Logged In: YES 
user_id=31435

Greg, any luck?  We're starting to see the same error ("fatal 

region error detected") in some ZODB tests using bsddb3, and 

that's an infinitely more complicated setup than this little 

program.  Jeremy Hylton also sees "fatal region" errors on 

Linux, in the ZODB context.

----------------------------------------------------------------------

Comment By: Gregory P. Smith (greg)
Date: 2003-08-14 09:26

Message:
Logged In: YES 
user_id=413

i'll try and reproduce this.

----------------------------------------------------------------------

Comment By: Richie Hindle (richiehindle)
Date: 2003-07-22 18:50

Message:
Logged In: YES 
user_id=85414

Minor correction: I'm on Plain Old Win98, not SE.

For what it's worth, the script seems more often than not

to provoke an application error when there's background

load, and a DBRunRecoveryError when there isn't.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=775414&group_id=5470