[pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans

Nathan Hurst njh at njhurst.com
Sun Mar 30 17:29:05 CEST 2014

On Sun, Mar 30, 2014 at 03:36:08PM +0300, anatoly techtonik wrote:
> On Sun, Mar 30, 2014 at 2:40 PM, Kenny Lasse Hoff Levinsen
> <kennylevinsen at gmail.com> wrote:
> > Okay, just to get things right: What you want is an only-ABI solution, which abstracts completely away from technical details, in a nice pythonic wrapper?
> Not really. I want a language independent ABI solution, yes. ABI-only
> implies that there is some alternative to that. I don't see any
> alternative - for me ABI is the necessary basis for everything else on
> top. So doing ABI solution design without considering these use cases
> is impossible. I want a decoupled ABI level.

If you want to purely binary data in python there are already some
good options: struct and numpy.  The first is optimised for short
records, the latter for multidimensional arrays of binary data.

I've used both on occasion for binary file and cross language communications.

> Nice pythonic wrapper that abstracts completely from technical details
> is not the goal. The goal is to provide practical defaults for
> language and hardware independent abstraction. The primary object that
> abstraction should work with is "platform-independent binary data",
> the method is "to be readable by humans". On implementation level that
> means that by default there is no ambiguity in syntax that defines
> binary data (size or endianness), and if there is a dependency on
> platform (CPU bitness etc.) it should be explicit, so that the
> behavior of structure should be clear (self-describing type names +
> type docs that list relevant platforms and effects on every platform).
> This approach inverts existing practice of using platform dependent
> binary structures by default.

So struct and numpy are existing solutions to this problem, but I
think you are thinking too low level for most problems.  It sounds
like what you really want is a schema based serialisation protocol.
There are a million of these, all alike, but the two I've used the
most are msgpack and thrift.  Generally you want to be specifying
statistical distributions rather that the specific encoding scheme;
the choice between 4 byte and 8 byte ints is not particularly useful
at the python programmer level, but knowing whether the number is a
small int or a large one is.  Thrift is the best implementation of a
remote call I've used, but it still leaves a lot of room for improvment.

If you are actually talking about building something that will talk to
any existing code in any existing language, then you will need
something more like CFFI.  However, I don't think you want to take the
C out of CFFI.  The reason is that the packing of the data is actually
one of the least important parts of that interface.  As you've read on
pypy-dev recently, reference counting vs gc is hard to get right.  But
there are many other problems which you have to address for a truly
language independent ABI:

Stack convention: what order are things pushed onto the stack (IIRC C
and pascal grow their heaps in opposite directions in memory).  Are
things even pushed onto a stack? (LISP stacks are implemented with
linked lists, some languages don't even bother with stack frames when
they perform tail recursion removal, using conditional gotos instead)

Packing conventions: different cpus like pointers to be even, or
aligned to 8 bytes.  Your code won't be portable if you don't handle

Exception handling: There are as many ways to handle exceptions as
there are compilers, all of them with subtle rules around lifetimes of
all the objects that are being excepted over.

Virtual methods and the like: In most languages (C is actually
somewhat unusual here) methods or functions are actually called via a
dereference or two rather than a direct memory location.  In C++
virtual methods and Java this looks like a field lookup to find the
vtable then a jump to a memory location from that table.  This is
tedious and error prone to implement correctly.

Generics and types: just working out which function to call is
difficult once you have C++ templates, performing the correct type
checking is difficult for Java.  Haskell's internal type specification
alone is probably larger than all of the CFFI interfaces put together.

There are no doubt whole areas I've missed here, but I hope this gives
you a taste for why language developers love CFFI - it's easy to
implement, easy enough to use and hides all of these complexities by
pushing everything through the bottleneck which is the C ABI.
Language developers would rather make their own language wonderful,
compiler writers would rather make their own language efficient.
Nobody really cares enough to make supporting other languages directly
a goal, the biggest and best effort in this regard was .net and it
also cheated by folding everything through a small bottleneck (albeit
a more expressive one).  This folding scars the languages around it -
F# is still less pure than Haskell, J is incompatible with Java.

If you want to tackle something in this area I would encourage you to
look at the various serialisation tools out there and work out why
they are not as good as they could be.  This is probably not the right
mailing list for this discussion though.


More information about the pypy-dev mailing list