[C++-sig] RTTI across shared library boundaries

Mon May 25 01:24:06 CEST 2009

Dear cplusplus-sig Folks:

I'm the maintainer of pycryptopp [1], a library whose main but not
sole user is Tahoe-LAFS [2].  I've recently stumbled across the
problem of RTTI crossing shared library boundaries, which seems to be
a well-known problem e.g. [3] but without, as far as I can tell, a
well-known solution.

Pycryptopp is mostly just Python wrappers for the Crypto++ library
[4].  The current status is the pycryptopp builds and passes all of
its unit tests [*, **] by building Python modules such as aes.so and
rsa.so from a combination of Crypto++ object files and pycryptopp
object files.  However, we're in the process of getting pycryptopp and
Tahoe-LAFS included in Debian and Fedora, and those two Linux
distributions have a policy that code which re-uses a separate library
has to dynamically link to the distribution-provided library instead
of bundling a copy of that library.  This is so that the distribution
maintainers can easily control the combination of libraries included
in their distribution -- for example if they want to upgrade Crypto++
or apply a patch to Crypto++ (such as a security patch), they need do
so only for the one copy of the shared library, and not for each
package which uses it.

So, I changed the pycryptopp setup.py so that if you pass the option
"--disable-embedded-cryptopp" to the "build" command it will stop
using its own internal copy of Crypto++ and instead simply link to
-lcryptopp.  Now the trouble starts.  An exception thrown by
libcryptopp.so cannot be caught by its specific type
(CryptoPP::InvalidKeyLength) in aes.so.  Investigating this leads me
to the well-known problem of RTTI comparison across shared library
boundaries, and the potential work-around of using the RTLD_GLOBAL in
dlopen().  Trying that work-around makes this problem go away, but
then if I load more than one .so which dynamically links to
libcryptopp.so, the second and later ones that get loaded are messed
up in a way that quickly leads to a crash (see the valgrind-generated
stack trace in [5] to see what I mean).

There is another problem with the same root cause, which is that
Crypto++ uses RTTI for a named-argument feature, see [6] for details.

I'm now considering a few ways forward:

1.  Persuade Debian and Fedora to accept pycryptopp and Tahoe-LAFS
using Crypto++ code compiled directly into the pycryptopp .so files
instead of dynamically linked.

2.  Refactor pycryptopp so that there is only one .so file, named for
example _pycryptopp.so, which is dynamically linked to libcryptopp.so,
and the separate modules for aes, sha256, rsa, ecdsa, etc. would each
import a subset of the Python names defined by _pycryptopp.so, and
then use RTLD_GLOBAL to load _pycryptopp.so.  This would, I think,
solve all currently known issues, but it does mean for example that if
anybody ever imports both pycryptopp and another Python module that
links to libcryptopp.so into the same Python process that one of them
will be screwed up and the process will quickly crash.

3.  Resign myself to working-around the lack of portable RTTI crossing
shared library boundaries in the pycryptopp source code.  Brian Warner
has already submitted patches for pycryptopp (see [6]) to work-around
the two known problems by (a) not catching CryptoPP::InvalidKeyLength
exception by its specific type and instead catching any type of
exception, and (b) not providing the hex-encoding feature which
happens to exercise Crypto++'s named-arguments feature.  I could
accept those two patches and resign myself to a fate of being unable
to safely use some ill-understood subset of the Crypto++ API.

4.  Figure out how to build an aes.so that has the relevant RTTI
symbols marked as "these must be satisfied by some other dynamic
library".  I'm not sure if this is possible, but it seems to how
things are done on Windows.  I read this page from the gcc wiki [7]
and experimented quite a bit with it.  When I started, using "nm" on
libcryptopp.so would show this:

$ nm -C /usr/local/lib/libcryptopp.so | grep "typeinfo for
CryptoPP::InvalidKeyLength"
00000000008747b0 V typeinfo for CryptoPP::InvalidKeyLength

And on my aes.so, it would show this:

$ nm -C ./pycryptopp/cipher/_aes.so | grep "typeinfo for
CryptoPP::InvalidKeyLength"
0000000000214cf0 V typeinfo for CryptoPP::InvalidKeyLength

After extensive exploration of the new gcc visibility features, I
finally managed to build an aes.so like this:

$ nm -C ./pycryptopp/cipher/aes.so | grep "typeinfo for
CryptoPP::InvalidKeyLength"
0000000000214cf0 d typeinfo for CryptoPP::InvalidKeyLength

Oops!  In other words, I managed to make the typeinfo symbol private
to aes.so instead of dynamic, thus guaranteeing that the exception
won't be caught even if I *do* specify RTLD_GLOBAL.  It sort of seems
like gcc offers the rough equivalent of Microsoft's "dllexport"
attribute, but not the rough equivalent of Microsoft's "dllimport"
attribute -- something that would, for example, force the symbol to
appear as "U" -- undefined -- in the .so's symbol table so that the
symbol's value would *have* to be provided by another DSO (in this
case libcryptopp.so) at load-time.  On the other hand, maybe if I
changed the libcryptopp.so so that the symbol was marked as non-weak,
such as "T", instead of its current type of weak -- "V" -- then maybe
the loader would have rewritten the value of the weak-symbol in aes.so
and the exception would have been caught.  I don't see how to do that,
either.

Okay, here's the question: do you know of any alternative besides
these four, and if not, which of these four do you recommend?

Thank you very much.

Regards,

Zooko Wilcox-O'Hearn

[*] Actually it fails one of the unit tests consistently on Mac OS
10.5/Intel, but not on Mac OS 10.4 or on any of the other platforms.
The failure *does* have something to do with RTTI since it is a
failure to downcast, but other than that I don't have any reason to
believe that it is related to the rest of this message, and I haven't
investigated it yet.  See the pycryptopp buildbot on Mac OS 10.5:
http://allmydata.org/buildbot-pycryptopp/builders/mac-i386-osx-10.5-faust

[**] Oh, and there's a mysterious problem on ARMv5 CPU in which a
memory buffer seems to be shifted by one byte, also probably
unrelated: http://allmydata.org/buildbot-pycryptopp/builders/zandr-linkstation

[1] http://allmydata.org/trac/pycryptopp
[2] http://allmydata.org/trac/tahoe
[3] http://mail.python.org/pipermail/python-dev/2002-May/024075.html
[4] http://cryptopp.com
[5] http://groups.google.com/group/cryptopp-users/browse_thread/thread/eb815f228db50380
[6] http://allmydata.org/trac/pycryptopp/ticket/9
[7] http://gcc.gnu.org/wiki/Visibility
---
store your data: $10/month -- http://allmydata.com/?tracking=zsig
I am available for work -- http://zooko.com/résumé.html