Wasted two days debugging my RefCount error : read my tale of woe so you don't do the same (LONG)

Brad Clements bkc at Murkworks.com
Thu Jul 22 12:58:30 EDT 1999


Yes, it's true. I must hang my head in shame for not refcounting correctly,
thereby causing my Python module (DLL) to crash when it's unloaded.

After two days of tracing around with the Borland Debugger (also having
entirely recompiled Python to use CodeGuard under Borland), I have the
source of my bug.

It all started so innocently. You see, I wanted to be able to call Novell
NDS Apis from Python, so I looked to swig for a solution. What a great
little program!

But, my program kept crashing everytime the DLL module got unloaded.

The crash I got was "code at offset 0x0000000 referenced memory at
0x000000". Pretty useful, eh?

Now, I know this means that I either over-wrote the stack and returned to 0,
or I had a pointer to a function that got overwritten with 0. Naturally I
suspected that the Novell APIs where trashing memory ( I have to handle a
lot of arbitrary data buffers of large size).

After much fussing and tracing and debugging I just couldn't get a
breakpoint to capture a memory overwrite. Eventually I whittled it down to a
dictionary cleanup, and finally got a break point set in dictobject.c

static void
dict_dealloc(mp)
 register dictobject *mp;
{
 register int i;
 register dictentry *ep;
 for (i = 0, ep = mp->ma_table; i < mp->ma_size; i++, ep++) {
  if (ep->me_key != NULL) {
   Py_DECREF(ep->me_key);
  }
  if (ep->me_value != NULL) {
   Py_DECREF(ep->me_value);   <-------- DIES HERE
  }
 }


The DecRef of the value would eventually make a call to 0x00000000.

Aha, I thought, something is overwriting a type definition structure with
NULLs! Now, if I could find out what object type it was I could set a data
write breakpoint on the structure and I could catch the culprit function.

Lets see ... run/break/ not null, ok, run/break/ not null, ok.  run/break
not null  (keep this up about 250 times).

run/break its null. Yeah! lets see.. object type -> Py_None

Hmm. Where is Py_None?  Its actually PyNothing_Type.. Find the structure
address and set a breakpoint on it. Where's that global stored.. Ah, I see
it..

What the f***

Looky here:

static PyTypeObject PyNothing_Type = {
 PyObject_HEAD_INIT(&PyType_Type)
 0,
 "None",
 0,
 0,
 0,  /*tp_dealloc*/ /*never called*/


Suddenly, bells and whistles start going off.. Hmm. I recall something about
Py_None.. Must think.. lessee, a week ago, I said to myself "I don't think
this is quite right".. Now, what was that exactly... "yeah, refcount Py_None
in that swig typemap"..

Which brings me to the source of the problem, to wit:

Given an NDS function like this:

%apply STRINGBUF {nptr attrVal};

NWDSCCODE
NWDSGetAttrVal
(
   NWDSContextHandle context,
   pBuf_T            buf,
   nuint32           syntaxID,
   nptr              attrVal
);
%clear STRINGBUF;

Swig creates a nice Python function to call NWDSGetAttrVal and return the
error code (NWDSCCODE) as an integer.

I thought, "hey, all NDS functions return 0 on success and non-zero on
error, so lets throw an exception if the return code is not 0".

So I wrote a typemap like this:

%typemap(python,out) NWDSCCODE  {
  if(0 != $source) {
    ThrowException($source,"$name",NULL);
      return NULL;
   } else {
    $target = Py_None;
   }
 }

The function ThrowException() sets a module global (argh! wish this was
per-thread, but I don't unlock the interpreter in there anyway).

So, given a function that has output args, Swig produces code that looks
somewhat like this:

(the NDS call goes here)

  if(0 != _result) {
    ThrowException(_result,"NWDSGetAttrName",NULL);
      return NULL;
   } else {
    _resultobj = Py_None;
   }
 }
{
    PyObject *o;

    o = Py_BuildValue("s",_arg2);
    if ((!_resultobj) || (_resultobj == Py_None)) {
      _resultobj = o;
    } else {
  _resultobj = t_output_helper(_resultobj, o);
    }
}

In the above case, this function always returns an output variable, so
_resultobj is over-written by the t_output_helper functions.

But what happens when the NDS API function would only return NWDSCCODE? Then
swig produces this:

    _result = (NWDSCCODE )NWDSGetAttrVal(_arg0,_arg1,_arg2,_arg3);
{
  if(0 != _result) {
    ThrowException(_result,"NWDSGetAttrVal",NULL);
      return NULL;
   } else {
    _resultobj = Py_None;
   }
 }
    return _resultobj;
}

Notice something funny?

THERES NO INCREF!

  Py_INCREF(Py_None);


Without the INCREF, Py_None would eventually get garbage collected when it's
refcount went to 0 (which it normally never would go to zero because it's
artificially referenced somehwere I suspect)

It seems that the simple solution is to change the typemap to this:

%typemap(python,out) NWDSCCODE  {
  if(0 != $source) {
    ThrowException($source,"$name",NULL);
      return NULL;
   } else {
      Py_INCREF(Py_None);
    $target = Py_None;
   }
 }


But then, I'll get code like this:

  if(0 != _result) {
    ThrowException(_result,"NWDSGetAttrName",NULL);
      return NULL;
   } else {
    Py_INCREF(Py_None);
    _resultobj = Py_None;
   }
 }
{
    PyObject *o;

    o = Py_BuildValue("s",_arg2);
    if ((!_resultobj) || (_resultobj == Py_None)) {
      _resultobj = o;
    } else {
  _resultobj = t_output_helper(_resultobj, o);
    }
}

See the problem? I'm incrementing Py_None but then not referencing it. So
the increment count keeps going up. Eventually it'll wrap (heh, well
probably not for a long time).

I suspect that not free-ing Py_None is ok, since the source says that it'll
never be called, but I'm unhappy about incrementing endlessly.

Anyone have any suggestions on how to fix this with better typemaps? What I
need is a typemap for "functions with argout" and a different one for
"functions without argout".

NOTES AND COMMENTS:

I would have saved a lot of time if Py_None's dealloc function pointed to a
real function that would print a message to stderr saying "You dummy, you're
trying to garbage collect Py_None".


If you're compiling Python under Borland C, you can pretty much copy the VC
options in config.h, but you need to add this:

#define MALLOC_ZERO_RETURNS_NULL 1

Enjoy the rest of the week, I've wasted mine :-(

--
Brad Clements,
bkc at murkworks.com






More information about the Python-list mailing list