[Cython] RFC: an inline_ function that dumps c/c++ code to the code emitter

Mon Aug 22 15:26:35 EDT 2016

On Sun, Aug 21, 2016 at 7:26 AM, Jason Newton <nevion at gmail.com> wrote:
>
> On Sun, Aug 21, 2016 at 5:30 AM, Robert Bradshaw <robertwb at gmail.com> wrote:
>>
>> In my experience Cython has generally been fairly easy to pick up for
>> people who already know Python. And Python often easy to pick up for
>> people who already know C/C++. Of course for many wrappings it often
>> takes non-trivial knowledge of the wrapped library itself too, but
>> typically at the same level as would be required to grok code written
>> against that same library directly from C/C++.
>
> Is your experience drawn from binding moderately complex libraries or play
> code (of complexity like from a tutorial).  Bottom up or top down?  Sorry if
> this sounds asinine to you.

A whole variety: from simple string manipulation libraries to complex
ML and big data processing frameworks. And second hand with complete
amateurs and students to senior software engineers.

> To clarify, I'm coming from the case where I
> didn't read the whole tutorial/docs before being faced with pyx in the
> projects I previously mentioned, while on tight turn around time - I was not
> able to grok in that context.

Understood--I am short on time too. Though your self-admitted lack of
familiarity doesn't bolster your argument that Cython's doing it
wrong, we need this new feature.

> For myself and any other ML user who comes across this thread - can you list
> a few libraries that do things the right way?
>
>> Yes, there's bad code out there in any language (no offense meant
>> towards h5py--I haven't looked at that project myself). Much of it due
>> to cargo-cult perpetuations or archaic (or simply flat-out-wrong)
>> contortions due to historical limitations (e.g. creating a Python
>> module to wrap an _extension module, avoiding all C++ features with
>> extensive C wrappers, ...). (You're familiar with C++, so likely no
>> stranger to this effect.)
>>
>> > These projects complied with Cython's current philosophy
>> > to the degradation of clarity, context, and overall idea of how code was
>> > hooked up.  Perhaps Cython should take the lessons learned from it's
>> > inception, time, and the results of the state of the c-python userbase
>> > to
>> > guide us into a new philosophy.
>>
>> I fail to see how "staying close to Python" caused "degradation of
>> clarity, context, etc." If anything, the lessons learned over time
>> have validated this philosophy. More on this later.
>
>
> My point was that multifile multi-level wrapper that I mentioned earlier -
> if you're saying that those projects did Cython extensions wrong, then I'm
> incorrect at faulting Cython and should fault the libraries using it.  I
> didn't say staying close to python caused $blurb.
>
> I don't know in a situation as confusing as to all the binding projects if
> this should be taken as validation of philosophy either - I think it is
> reasonable to consider the attrition of these projects as a function of
> manpower, number of early on project supporters/authors, and if a project
> (like sage) indirectly, through dependency,  kept the project alive.  And
> good old fashioned luck.  I noted most of them don't use distutils and
> something custom but less capable instead which maybe plays a roll in how
> mature/usable/smalltime they where/are.

Certainly the success of a project depends on many external factors,
and even raw luck plays a part. But the approach and philosophy taken
to attack a problem and guard its API (and the users/contributors that
such decisions attract or repel) can't be discounted either,
especially when taken over a long timeframe.

But if anyone wants to believe that Cython's become popular because of
pure luck despite wrongheaded guiding principles or philosophies,
it'll be difficult to persuade them otherwise.

>> I agree that any efforts to trying to parsing C++ without building on
>> an actual compiler are fraught with danger. That's not the case with
>> generating C++ code, which is the direction we're going. In
>> particular, our goal is to understand C++ enough to invoke it, which
>> allows us to be much less pedantic.
>
> I understand and agree with the logic in stating it's a less complicated
> goal but what comparable success stories exist?  I strongly think "devils in
> the details" in correctly making that work and that they will be tough
> solvable problems.  And then you're going and promising on unfamiliar
> territory.  But what the ultimate takeaway for me is that you won't have it
> ready in any near term.  Do you have the skills and resources to implement
> this in under 2 years?  And then the other question is are you and the team
> reasonably confident you will have it working and usable by then.  Otherwise
> you are not being pragmatic.
>
> On the other hand, if it was reasonably simple as many of your other points
> in future emails point out, I'd really like to know why you hadn't addressed
> them earlier.

Other higher priority items for limited resources. And non-type
template args are not necessarily that simple given the way things are
structured now.

>> >> > The idea is that Cython glue makes the playing field for extracting
>> >> > data
>> >> > easy, but that once it's extracted to a cdef variable for instance,
>> >> > cython
>> >> > doesn't need to know what happens.  Maybe in a way sort of like the
>> >> > GCC
>> >> > asm
>> >> > extension.  Hopefully simpler variable passing though.
>> >>
>> >> Cython uses "mangled" names (e.g. with a __pyx prefix) to avoid any
>> >> possible conflicts. Specifying what/how to mangle could get as ugly as
>> >> GCC's asm variable passing. And embedded variable declarations, let
>> >> alone control flow statements (especially return, break, ...) could
>> >> get really messy. It obscures analysis Cython can do on the code, such
>> >> as whether variables are used or what values they may take. Little
>> >> code snippets are not always local either, e.g. do they often need to
>> >> refer to variables (or headers) referenced elsewhere. And they must
>> >> all be mutually compatible.
>> >
>> > Like gcc's asm, let's let adults do what they want and let them worry
>> > about
>> > the consequences of flow control/stray includes. I'm not even sure how
>> > most
>> > of this would be an issue (switch/break/if) if you are properly nesting
>> > pyxd
>> > output.  The only thing I think is an issue here is mangled names.  I
>> > haven't yet figured out why (cdef) variable names must be mangled.  Can
>> > you
>> > explain?  Maybe we add an option to allow it to be unmangled in their
>> > declaration? C++ has extern "C" for example.
>>
>> Name mangling is done for the standard reasons--to avoid possible
>> conflicts with all other symbols that may be defined. E.g. We don't
>> want things to suddenly break if I happen to create a variable called
>> "PyNone." Or "__pyx_something_we_defined_implicitly." And of course we
>> want to mangle globals, function names, etc. lest they conflict with
>> some otherwise irrelevant symbol defined in some (possibly
>> recursively) included header somewhere.
>>
>> Again, you could just say "Don't name things like that." This exposes
>> some more guiding principles. (1) If it's valid Python, it should be
>> valid Cython and (2) we always try to produce valid C code--if you
>> haven't lied to us (too much) about your external declarations, a
>> successful Cython compilation results in a valid C/C++ output. Also
>> (3) you shouldn't have to read or understand the generated C and the
>> Python/C API to use, let alone debug, Cython (though you're happy to
>> do so if you want, like Java developers sometimes read bytecodes, but
>> not usually, though understanding implementation can sometimes be
>> helpful when chasing performance (for all languages)).
>>
>> There's an obvious tension between giving users all the rope they want
>> vs. providing an API that is possibly more restrictive, but inherently
>> correct by construction. I'll concede that Cython necessarily has
>> pointers, so I'll give that there's plenty of room for foot-shooting
>> (and better interfacing with modern C++ would be good help there), but
>> the kind of errors one runs into by injecting arbitrary code snippets
>> take things to a whole new level (and specifically violate (3) when
>> developing and debugging).
>
> I think injecting arbitrary code snippets has a reasonably good probably of
> not breaking 3 in your above, provided we have a way to get at unmangled
> identifiers (*or* document and stick to the mangling strategy, assuming it's
> easy) - that or we scan the snippet code and replace identifiers
> (significantly more complex, instincts make me think fragile for a while
> until it's gotten right - esp without LLVM).  Perhaps syntax errors would be
> an issue if you're just coding things up... but again there's the opt-in to
> this construct and we could make life easier by annotating the snippet in
> the output - to help localize the user.

You're missing the point of (3). The fact that we're generating C
code, and not fortran or directly assembly, should mostly be an
implementation detail. It's not realistic to embed snippets without
caring about the surrounding context in all but the simplest of cases.
And there'd be feature creep here--you're in the middle of a C snippet
and want to report an error, or access a Python object, or ...

>> The escape hatch is to wrap the C++ in an actual C++ file and invoke
>> the wrapping. Typically this is the minority of one's code--if it's
>> the "whole library" then you probably have an API that's only
>> understandable to someone well versed in C++ anyways. You've given a
>> single example (non-type template arguments) that we would like to
>> support that's blocking you.
>
> My lack of examples is due to insufficient time playing with Cython - I hit
> nonstarters so I stop and abandon; as I said, to date, Cython has never been
> able to solve my C++ problems and none of them seem extraordinary.  I think
> you've got alot of more unknown-unknowns here than you give credit to but we
> can't discover that until you at least fix that template bug (properly).

It would be helpful for you to enumerate the (implied many)
nonstarters and problems you've had, to at least get the
known-unknowns out on the table.

> I'm still not looking forward to forward declaring every
> identifier/function/whatever from C++ land in Cython though

This is largely a separate issue, and I agree a big pain point.

> and I still
> strongly dislike that there's no single source way of doing Cython with
> something like a kernel/ufunc that needs to escape to C/C++.  This makes
> doing something like the mako based templates I mentioned in the OP email
> much more cumbersome/hard and Cython would provide no built in mechanism
> (like inline/inline_module) for making that work.

The issue here that you want to inline C code snippets into your
inlined Cython code snippet?

>> > Why is allowing arbitrary code inside not a good idea?  We're not
>> > talking
>> > something necessarily like eval here and the reputation it got,
>>
>> Actually, I think it's a whole lot like eval. It's taking an opaque
>> (string) chunk of data and executing it as code. But potentially worse
>> as it's in a different language and evaluated in a transformed (even
>> if the names were unmangled) context.
>>
>> If we were to go this direction, I might go with a function call (like
>> weave, maybe even follow it) rather than a new statement as the latter
>> is difficult to extend with the myriad of optional configuration
>> parameters, etc. that would beg to follow.
>
> My point with eval is it's bad reputation was mostly due to security
> vulnerabilities and it joined the league of evil like goto and other tools
> that are great in the right hands and times.  The string is known and fixed
> at pyx body cython compile time which is one of the things that made me
> think it had to be a statement rather than a function.  But again, I'm not a
> cython developer.  If you think an inlining function that takes in all the
> args it needs to inline successfully is the mechanism that achieves the
> effect on pyx compile, I don't think I'd mind.   It sure is a privileged and
> weird function though, to be able to emit code and not be a runtime
> statement.

My intuition is that the "interface" would need more complicated
specification which will grow over time--weave.inline itself takes 27
parameters.

> 2 Examples I wanted to pull out real quick are:
>
> https://github.com/scipy/weave/blob/master/examples/wx_example.py
> https://github.com/scipy/weave/blob/master/examples/binary_search.py
>
> But I think I detect magic going on here for adding includes in the wx
> example and that is not usable/reliable approach.  The cython.inline
> implementation also did include magic for numpy variables... Just a warning.
> I don't think weave failed because of Python 3 support, I think it was
> because it was too limited to be useful because of that magic and the walls
> around getting something say like Eigen in, so nobody used it.
>
>>
>> > You must realize that almost any other python driven way to compile
>> > c-code
>> > in the spirit these projects do is deprecated/dead.  Cython has absorbed
>> > all
>> > the reputation and users that didn't go to pure-c/boost.python -
>> > pybind11 is
>> > the new kid on the block there so I'm not including it (I'm of the
>> > opinion
>> > that SWIG users stayed unchanged).  Community belief/QA/designers/google
>> > all
>> > think of Cython first.  Weave has effectively closed up it's doors and
>> > I'm
>> > not even sure it had the power to do what I wanted anyway because Cython
>> > provides a language that eases the data-extraction/typecasting part of
>> > inlining C/C++.
>>
>> You seem to be repeatedly bringing up the points[:]
>>
>> * Many (most?) of these string-based approaches are essentially dead,
>> often pointing people to Cython instead, but
>> * Cython should adopt the string-embedding approach of these earlier
>> projects.
>
> Hoho - zing!  No that is not a conclusion you should be drawing.   Your
> faults here are to imply those projects failed because they used
> string-embedding approaches and to imply string-embedding based approaches
> are the approaches that failed - *most* have failed over a variety of
> implementations both Python driven and not.  I restricted to python-driven
> for the sake of brevity and mentioned the selection. As I tried to hint
> earlier, the several other projects failure happened because of any number
> of unrelated reasons to it being a string based approach.  I believe
> additionally that there were too many options (confusion) and too small a
> potential userbase (at least in those years) to bolster and attract blood to
> each of the projects and make them thrive.  Probably something akin to the
> ton of orphaned projects on pipi, it doesn't mean it happened because the
> approach was wrong.  One thing I did want you to take away is that Cython
> needs to absorb the responsibility of it's reputation and status - the last
> survivor of a somewhat diverse class with different capabilities, if you
> will, that went outside of your original usage.

I'm not saying that these projects died because of the string-based
approach they took, rather that if this "embed strings" approach were
so critical, so superior, it should at least kept one of them alive.
Or a new project could have formed around this approach (e.g. letting
all the executable code be C++, with Python syntax for the structure,
could be an interesting point in the design space). It has its pros
and cons. I think for Cython it's the wrong direction. But I'm in
favor of letting many flowers bloom--and we're in luck that these are
all open source to boot.

>> You ask at the beginning of the email whether time has vindicated our
>> philosophy. I think, based on the mindshare vs. these other attempts
>> at integrating with C, in large part it has. It has served us and our
>> users well; we will strive to stay close to Python.
>>
>> Tight interleaving of multiple languages in is cute for making a
>> polyglot script, but I do not think it leads to legible code. An
>> "eval_cpp" operator would be a lot like the builtin eval--it'd be
>> really tempting to do the "quick and easy" hack of dropping in some
>> executable string instead of thinking how to structure things such
>> that that could be avoided, but putting in this effort leads to more
>> comprehensible code.
>
> It's served your *C* and faster-python users well.  If you had proper
> constructs, I'm sure people wouldn't choose to do it with inline_c unless
> there was a compelling solid reasoning.
>
> I'm not saying you can't make those constructs and that users wouldn't use
> them when they appear, but you are not being pragmatic again.  You currently
> don't have all capabilities and are on risky turf for supporting all c++
> standards for the rest of time.  I'm betting against that you will produce
> in a useful timeframe (maybe this a 5 year scale?) the usable constructs
> needed - which I equate as turning your head away and giving the middle
> finger to C++ developers.  inline_c would allow a forwards compatible way to
> use anything the target c++ compiler allows with some very minimal
> guarantees on Cython's side.  It is a very elegant and capable solution to a
> hard problem.
>
>> It's hard to say "no" to features, but I think such an introduction
>> would fundamentally change Cython and how it's written for the worse.
>
> I agree with the statement but I don't think you've classified the feature
> correctly.

My conclusion is based on thinking a lot about where this feature
would lead us, not just the immediate possibilities it would open up.

> I watched one of your old talks for Sage Days 29 and at the bottom of a
> slide you have "Cython is a very pragmatic project, driven by user needs".
> I'm calling foul.  Go watch that video again and tell me what's changed
> since 2011.

Embedding code is not a need, it's a means to an end. This is why I've
been asking for concrete features that are the most important to try
to support.

- Robert