[pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans

anatoly techtonik techtonik at gmail.com
Sat Mar 29 12:49:00 CET 2014


I know what C in CFFI stands for C way of doing things, so
I hope people won't try to defend that position and instead
try to think about what if we have to re-engineer ABI access
from scratch, for explicit and obvious debug binary interface.


CFFI is not useful for Python programmers and here is why.

The primary reason is that it requires you to know C.
And knowing C requires you to know about OS architecture.
And knowing about OS architecture requires you to know
about ABI, which is:

http://stackoverflow.com/a/3784697

    This is how the compiler builds an application.
    It defines things (but is not limited too):

    How parameters are passed to functions (registers/stack).
    Who cleans parameters from the stack (caller/callee).
    Where the return value is placed for return.
    How exceptions propagate.


The problematic part of it is that you need to think of OS
ABI in terms of unusual C abstractions. Coming through
several levels of them. Suppose you know OS ABI and you
know that you need direct physical memory access to set
bytes for a certain call in this way:

    0024: 00 00 00 6C 33 33 74 00

How would you do this in Python? The most obvious way
is with byte string - \x00\x00\x00\x6c\x33\x33\x74\x00
but that's not how you prepare the data for the call if, for
example, 00 6C means anything to you.

What is the Python way to convert 00 6C to convenient
Python data structure and back and is it Pythonic (user
friendly and intuitive)?

    import struct
    struct.unpack('wtf?', '\x00\x6C')

If you try to lookup the magic string in struct docs:
http://docs.python.org/2/library/struct.html#format-characters
You'll notice that there is the mapping between possible
combinations of these 2 bytes to some Python type is
very mystic. First it requires you to choose either "short"
or "unsigned short", but that's not enough for parsing
binary data - you need to figure out the proper
"endianness" and make up a magic string for it. This is
just for two bytes. Imagine a definition for a binary protocol
with variable message size and nested data structures.
You won't be able to understand it by reading Python
code. More than that - Python *by default* uses platform
specific "endianness", it is uncertain (implicit) about it, so
not only you should care about "endianness", but also be
an expert to find out which is the correct metrics for you.

Look at this:

    0024: 00 00 00 6C 33 33 74 00

Where is "endianness", "alignment", "size" from this doc
http://docs.python.org/2/library/struct.html#byte-order-size-and-alignment

People need to *start* with this base and this concept and
that's why it is harmful. CFFI proposes to provide a better
interface to skip this complexity by getting back to roots
and use C level. That's a pretty nice hack for C guys, I am
sure it makes them completely happy, but for academic
side of PyPy project, for Python interpreter and other
projects build over RPython it is important to have a tool
that allows to experiment with binary interfaces in
convenient, readable and direct way, makes it easier for
humans to understand (by reading Python code) how
Python instructions are translated by JIT into binary pieces
in computer memory, pieces that will be processed by
operating system as a system function call on ABI level.


But let's not digress, and get back to the point that struct
module doesn't allow to work with structured data. In
Python the only alternative standard way to define binary
structure is ctypes.

ctypes documentation is no better for binary guy:
http://docs.python.org/2/library/ctypes.html#fundamental-data-types

See how that binary guy suffered to map binary data to
Python structures through ctypes:
https://bitbucket.org/techtonik/discovery/src/eacd864e6542f14039c9b31eecf94302f3ef99ec/graphics/gfxtablet/gfxtablet.py?at=default

And I am saying that this is the best way available from
standard library. It is pretty close to Django models, but
for binary data. ctypes still is worse that struct in one
thing -  looking into docs, there are no size specifiers for
any kind of C type, so no guarantee that 2 bytes are
read as 4 bytes or worse. By looking at the ctypes code
it is hard to figure out size of structure and when it may
change.

I can't hardly name ctypes mapping process as user
friendly and resulting code as intuitive. Probably nobody
could, and that's why CFFI was born.


But CFFI took a different route - instead of trying to map
C types to binary data (ABI level), it decided to go onto
API level. While it exposes many better tool, it basically
means you are dealing with C interface again - not with
Pythonic interface for binary data.


I am not saying that CFFI is bad - I am saying that it is
good, but not enough, and that it can be fixed with
cleanroom engineering approach for a broader scope of
modern usage pattern for binary data than just calling OS
API in C way.

Why we need it? I frankly think that Stackless way of
doing thing without C stack is the future, and the problem
with not that not many people can see how it works,
builds alternative system without classic C stack with
(R)Python. Can CFFI help this? I doubt that.


So, that am I proposing. Just an idea. Given the fact that
I am mentally incapable of filling 100 sheet requirement
to get funding under H2020, the fact that no existing
commercial body could be interested to support the
development as an open source project and the fact that
hacking on it alone might become boring, giving this idea
is the least I can do.


Cleanroom engineering.
http://en.wikipedia.org/wiki/Cleanroom_software_engineering
"The focus of the Cleanroom process is on defect prevention,
rather than defect removal."

When we talk about Pythonic way of doing thing, how can
we define "a defect"? Basically, we talking about user
experience - the emotions that user experiences when he
uses Python for the given task. What is the task at hand?
For me - it is working with binary data in Python - not just
parsing save games, but creating binary commands such
as OS systems calls that are executed by certain CPU,
GPU or whatever is on the receiver end of whatever
communication interface is used. This is hardware
independent and platform neutral way of doing things.

So, the UX is the key, but properties of engineered product
are not limited single task. The cleanroom approach allows
to concentrate on the defect - when user experience will
start to suffer because of the conflicts between tasks that
users are trying to accomplish.

For PyPy project I see the value in library for compositing
of binary structures in that these operations can be
pipelined and optimized at run-time in a highly effective
fashion.

I think that convenient binary tool is the missing brick in
the basement of academic PyPy infrastructure to enable
universal interoperability from (R)Python with other digital
systems by providing a direct interface to the binary world.
I think that 1973 year views on "high level" and "low level"
systems are a little bit outdated now that we have Python,
Ruby, Erlang and etc. Now C is just not a very good
intermediary for "low level" access. But frankly, I do not
think that with advent of networking, binary can be called a
low level anymore. It is just another data format that can be
as readable for humans as program structure written in
Python.


P.S. I have some design ideas how to make an attractive
gameplay out of binary data by "coloring" regions and
adding "multi-level context" to hex dumps. This falls out of
scope of this issue, and requires more drawing that
texting, but if somebody wants to help me with sharing the
vision - I would not object. It will help to make binary world
more accessible, especially for new people, who start
coding with JavaScript and Python.
-- 
anatoly t.


More information about the pypy-dev mailing list