[pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans

Thu Apr 3 10:33:35 CEST 2014

On Sun, Mar 30, 2014 at 6:29 PM, Nathan Hurst <njh at njhurst.com> wrote:
>
> If you want to purely binary data in python there are already some
> good options: struct and numpy.  The first is optimised for short
> records, the latter for multidimensional arrays of binary data.

numpy is not Python. About struct I wrote in the first post - it is a legacy
interface that was not designed to be human friendly with usability
practices of 2014.

> I've used both on occasion for binary file and cross language communications.
>
>> Nice pythonic wrapper that abstracts completely from technical details
>> is not the goal. The goal is to provide practical defaults for
>> language and hardware independent abstraction. The primary object that
>> abstraction should work with is "platform-independent binary data",
>> the method is "to be readable by humans". On implementation level that
>> means that by default there is no ambiguity in syntax that defines
>> binary data (size or endianness), and if there is a dependency on
>> platform (CPU bitness etc.) it should be explicit, so that the
>> behavior of structure should be clear (self-describing type names +
>> type docs that list relevant platforms and effects on every platform).
>> This approach inverts existing practice of using platform dependent
>> binary structures by default.
>
> So struct and numpy are existing solutions to this problem, but I
> think you are thinking too low level for most problems.  It sounds
> like what you really want is a schema based serialisation protocol.

If you can assemble ELF binary in that protocol, then yes. My goal is
to work with low level binary data in Python with convenient tools. Not
necessary high level tools - just convenient for messing with chunks.

> There are a million of these, all alike, but the two I've used the
> most are msgpack and thrift.  Generally you want to be specifying
> statistical distributions rather that the specific encoding scheme;
> the choice between 4 byte and 8 byte ints is not particularly useful
> at the python programmer level, but knowing whether the number is a
> small int or a large one is.  Thrift is the best implementation of a
> remote call I've used, but it still leaves a lot of room for improvment.

Engineering should start with a task at hand. Statistical distribution
is a too high level task. It goes above the binary manipulating logic,
but the task of choosing a size of data based on other data is a
corner stone in all binary format processing.

> If you are actually talking about building something that will talk to
> any existing code in any existing language, then you will need
> something more like CFFI.  However, I don't think you want to take the
> C out of CFFI.  The reason is that the packing of the data is actually
> one of the least important parts of that interface.  As you've read on
> pypy-dev recently, reference counting vs gc is hard to get right.  But
> there are many other problems which you have to address for a truly
> language independent ABI:
>
> Stack convention: what order are things pushed onto the stack (IIRC C
> and pascal grow their heaps in opposite directions in memory).  Are
> things even pushed onto a stack? (LISP stacks are implemented with
> linked lists, some languages don't even bother with stack frames when
> they perform tail recursion removal, using conditional gotos instead)

Right. You will not be able to operate with stack registers, but if you
have a clean interface to construct the contents of binary stack and
read it back to friendly form - this will make the task of working with
these structures easier, so maybe one day it will be possible to have a
clean code that calls different libs with different calling conventions from
Python.

> Packing conventions: different cpus like pointers to be even, or
> aligned to 8 bytes.  Your code won't be portable if you don't handle
> this.

It is good to know about that, so this should be exposed, of course.
There is also a room for high level abstraction once you've covered the
basics.

> Exception handling: There are as many ways to handle exceptions as
> there are compilers, all of them with subtle rules around lifetimes of
> all the objects that are being excepted over.

I would be interested to see this on binary level, but I doubt such
papers exist. Making a library that can produce diagrams for such
structures will be a good starting point to grok at more serious
performance problem when you switch CPUs.

> Virtual methods and the like: In most languages (C is actually
> somewhat unusual here) methods or functions are actually called via a
> dereference or two rather than a direct memory location.  In C++
> virtual methods and Java this looks like a field lookup to find the
> vtable then a jump to a memory location from that table.  This is
> tedious and error prone to implement correctly.

It is tedious and error prone only if it is not obvious. There are
inevitable complex solutions in this world that are impossible to
avoid, but it is possible to deal with complexity.

If JIT can detect that vtable is not modified, it can shorten the
call chain. Of course you won't be able to define all types of
lookup logic in declarative form, but making this code readable
is a step forward. If not Eve Online - I'd never new about
Stackless and why it is awesome. Eve Online provided a user
friendly interface to introduce the technology by showing its
power through media of game design discipline. I think that the
power of alternative low level manipulations with
"data + processor" paradigm can be even better expressed in
language design discipline which PyPy/RPython is all about
IMHO.

> Generics and types: just working out which function to call is
> difficult once you have C++ templates, performing the correct type
> checking is difficult for Java.  Haskell's internal type specification
> alone is probably larger than all of the CFFI interfaces put together.

These are high level API and I am very interested how do they map
to the low level. That gives me an ability to design a better CPU for
these languages (or LPU FWIW) or see that current limitations
exist.

> There are no doubt whole areas I've missed here, but I hope this gives
> you a taste for why language developers love CFFI - it's easy to
> implement, easy enough to use and hides all of these complexities by
> pushing everything through the bottleneck which is the C ABI.
> Language developers would rather make their own language wonderful,
> compiler writers would rather make their own language efficient.

I never argued that CFFI is not loved. But I really think that it is loved
only by people who know C or had to face it. I do not agree that
language developers should not think about efficiency. As much as I
want to simplify the problem too, in real world they should think about
many things, find compromises and trade between them. On of the
reasons I am so pushy about the process and tools - language design
requires more scaffolding and collaborative tools that usual project
issue tracker allows. Python 3 is a living proof - better greater language,
but the adoption rate is very low. I am not speaking the reasons, I just
state the fact. I think that executable language designers should think
about low level.

http://www.joelonsoftware.com/articles/fog0000000319.html

> Nobody really cares enough to make supporting other languages directly
> a goal, the biggest and best effort in this regard was .net and it
> also cheated by folding everything through a small bottleneck (albeit
> a more expressive one).  This folding scars the languages around it -
> F# is still less pure than Haskell, J is incompatible with Java.
>
> If you want to tackle something in this area I would encourage you to
> look at the various serialisation tools out there and work out why
> they are not as good as they could be.  This is probably not the right
> mailing list for this discussion though.

I'm interested to see a list of user stories and conflicts between them,
like which choice was made in every conflict and (if possible) why. The
problem of modern development is that people face with the same
problems over and over again unable to understand the previous
knowledge. Providing tools that bring greater visibility into the domain
field can significantly help the progress. In context of binary manipulation
I really want to see more interchangeable graphical interfaces and
automated visualization tools built on top of that without overengineering
by various international controlling bodies. That means - they try to
complicate stuff to solve some problem no matter what. I think that the
approach should be more relaxed - there are problems that can't be
solved easily, but getting to the root of the problem and playing with it
should be easy.
-- 
anatoly t.