[Python-Dev] Problems with Python's default dlopen flags

David Abrahams David Abrahams" <david.abrahams@rcn.com
Sun, 5 May 2002 08:33:29 -0500


From: "Martin v. Loewis" <martin@v.loewis.de>
> "David Abrahams" <david.abrahams@rcn.com> writes:
>
> > Did you misread my suggestion? I didn't say that RTLD_GLOBAL should
be
> > the default way to load an extension module, only that there should
be a
> > way for the module itself to determine how it's loaded.
>
> I dismissed your suggestion as being too complex.

Fair enough [explicit dismissal helps to reduce confusion].

> There are a number
> of questions involved which I cannot answer that may effect usability
> of this approach; most of them have to do with dlclosing the library:
>
> 1. If the extension module is C++ code, dlopening the module will run
>    constructors for global objects. dlclosing it will run destructors.
>    So the dlopen/dlclose/dlopen cycle might have side effects; that
>    might be confusing.

and quite possibly unacceptable to users; granted. I hadn't thought of
that.

> 2. Again, with C++ code, on Linux, with gcc 2.95.x, a block-local
>    static object will register its destructor with atexit(3). When
>    the module is dlclosed, the code to be called at exit goes away;
>    then the program crashes atexit. This is undesirable.

and absolutely unacceptable to me, so I guess that approach is out.

> 3. If the module is also used as a library that some other module
>    links against, I'm not sure what the semantics of dlclose is.  I'd
>    feel uncomfortable with such a feature if I don't know precisely
>    how it acts in boundary cases.

I would hope that it would be nicely reference-counted, but if *you*
don't know that answer I'm guessing it's undocumented and I wouldn't
want to count on that either.

> > And in fact, I expect to ask users to do something special, like
> > explicitly linking between extension modules
>
> Indeed, that should also work fine - if users explicitly link
> extension modules against each other, they should be able to share
> symbols. The need for RTLD_GLOBAL only occurs when they want to share
> symbols, but don't want to link the modules against each other.

Heh, that's what I'd have thought, too.

> > However, this is what I didn't expect: the lack of RTLD_GLOBAL flags
> > interferes with the ability for ext1.so to catch C++ exceptions
> > thrown by libboost_python.so!
>
> That is surprising indeed, and hard to believe. Can you demonstrate
> that in a small example?

Unfortunately, the only reproducible case we have is not exactly small.
However, we can give anyone interested full access to the machine and
test case where it's occurring [details appended at the bottom of this
message].

> > Are you suggesting that in order to do this, my users need to add
> > yet another .so, a thin layer between Python and the guts of their
> > extension modules?
>
> Originally, that's what I suggested. I now think that, for symbol
> sharing, linking the modules against each other should be sufficient.
>
> > > Now, people still want to share symbols across modules. For that,
you
> > > can use CObjects: Export a CObject with an array of function
pointers
> > > in module A (e.g. as A.API), and import that C object in module
B's
> > > initialization code. See cStringIO and Numeric for examples.
> >
> > Of course you realize that won't help with C++ exception tables...
>
> Actually, I don't: I can't see what C++ exception tables have to do
> with it - the exception regions are local in any case.

You're right, I mis-spoke: it has nothing to do with exception tables.
It's an RTTI problem: the type-specific catch clause exception-handler
doesn't catch the thrown exception.

> > ...which leads us back to the fact that the smarts are in the wrong
> > place. The extension module writer knows that this particular
> > extension needs to share symbols, and once the module is loaded it's
> > too late.
>
> The extension module writer can't possibly have this knowledge - to
> know whether it is _safe_ to share symbols, you have to know the
> complete set of extension modules in the application.

Yes, but if the module /requires/ symbol sharing it sort of doesn't
matter whether it's safe. If you don't share symbols, the module won't
work (and will probably crash), just as when you share symbols and
there's a clash you'll probably get a crash.

> If a single
> module uses your proposed feature, it would export its symbols to all
> other extensions - whether they want those symbols or not. Hence you
> might still end up with a situation where you can't use two extensions
> in a single application because of module clashes.

Yep. The existence of namespaces in the C++ case may mitigate the
situation somewhat, but I understand the problem.

> > So give setdlopenflags a "force" option which overrides the setting
> > designated by the extension module. I realize it's messy (probably
too
> > messy). If I could think of some non-messy advice for my users that
> > avoids a language change, I'd like that just as well.
>
> For that, I'd need to understand the problem of your users first. I'm
> unhappy to introduce work-arounds for incompletely-understood
> problems.

I agree with that attitude. And after all, my suggestion was really just
a straw-man.

---------

Environment:
  RedHat 7.1 on Intel
  gcc 3.0.4 used for everything, incl. Python compilation.
  All our code (but not Python) is compiled with -g.
  Example compile and link lines:


g++  -fPIC -ftemplate-depth-50 -DNDEBUG -w -g -I"/net/cci/rwgk/phenix/in
clude"  -I"/net/cci/rwgk/cctbx/include" -I"/net/cci/rwgk/boost" -I/usr/l
ocal_cci/Python-2.2.1_gcc304/include/python2.2 -c
refinementtbxmodule.cpp
  g++ -shared -Wl,-E -o refinementtbx.so refinementtbxmodule.o
error.o -L/net/taipan/scratch1/rwgk/py221gcc304/lib -lboost_python  -lm

Here is the core of where things go wrong: (part of the Boost.Python
library)

        try
        {
            PyObject* const result = f->do_call(args, keywords);
            if (result != 0)
                return result;
        }
        catch(const argument_error&)
        {
// SHOULD COME BACK HERE
        }
        catch(...) // this catch clause inserted for debugging
        {
// BUT COMES BACK HERE
          throw;
        }

The exact some source works with Compaq cxx on Alpha (and also Visual
C++ 6 & 7, Win32 CodeWarrior 7.0). I put in print statements and
compared the output from the Linux machine with the output from the
Alpha. This shows that under Linux the identity of the exception is
somehow lost:  Alpha comes back where it says "SHOULD COME BACK HERE",
Linux comes back where it says "BUT COMES BACK HERE".  This is after
many successful passes through "SHOULD COME BACK HERE".

The problem is very elusive. Most of the time the exception handling
works correctly. For example, if we change small details about the order
in which exceptions are thrown gcc works just fine
(execution flow still passes repeatedly through "SHOULD COME BACK
HERE"). Also, changing the dlopen flags used when loading extension
modules to include RTLD_GLOBAL makes the problem disappear.

It is very difficult to generate a simple test case (our application
requires both Python and Boost). I have spent days
trying to isolate what is going wrong, unfortunately to no avail.

Suggestion:

I (Ralf) could set up an account for you on our machines. Using this
account you would have direct access to all the source code files and
compiled binaries that are needed to reproduce our problem. You could
directly enter gdb to investigate.