
Hi Pypy devs! I am a graduate student working on mining software repositories, specifically mining programming patterns (or rules) from code bases.[1] I would like to do some work mining patterns from Python programs and I could jump start that work if I could generate dependence graphs from python source files (or from the bytecode). So I was wondering if Pypy does this already. Actually it would be a big help even if pypy only generated a control flow graph. From my quick skim of some of the source code it seems like it does this for RPython but I wasn't sure about regular python. If I could get a CFG I could then annotate it with the data dependence edges. Any pointers to where I should look in the Pypy code would be appreciated! Additionally, if you feel having python pDG's would be helpful in Pypy's interpreter perhaps any work I do towards that end could be a starting point. Thanks for the excellent work on Pypy! [1] Here is some of my research group's previous work on pattern extraction http://doi.wiley.com/10.1002/smr.532 (if you can't get the paper and would like it let me know) Cheers --- Tim Henderson mail me: tim.tadh@gmail.com mail me: timothy.henderson@case.edu github: github.com/timtadh

On Wed, Aug 8, 2012 at 11:12 AM, Tim Henderson <tim.tadh@gmail.com> wrote:
Hi Pypy devs!
I am a graduate student working on mining software repositories, specifically mining programming patterns (or rules) from code bases.[1] I would like to do some work mining patterns from Python programs and I could jump start that work if I could generate dependence graphs from python source files (or from the bytecode).
So I was wondering if Pypy does this already. Actually it would be a big help even if pypy only generated a control flow graph. From my quick skim of some of the source code it seems like it does this for RPython but I wasn't sure about regular python. If I could get a CFG I could then annotate it with the data dependence edges.
Any pointers to where I should look in the Pypy code would be appreciated! Additionally, if you feel having python pDG's would be helpful in Pypy's interpreter perhaps any work I do towards that end could be a starting point.
Thanks for the excellent work on Pypy!
[1] Here is some of my research group's previous work on pattern extraction http://doi.wiley.com/10.1002/smr.532 (if you can't get the paper and would like it let me know)
Cheers --- Tim Henderson mail me: tim.tadh@gmail.com mail me: timothy.henderson@case.edu github: github.com/timtadh
_______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
You're correct, we only generate full CFGs for RPython. The Python bytecode compiler for PyPy (and CPython) does have a CFG, but they're just local to functions so I don't think they're what you're looking for. Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero

Hi Tim, An additional remark: if you're looking for a tool that is able to extract a complete call graph from a random Python program, then it's impossible. Only approximations can be done, like e.g. done in pylint, I believe. Such tools are pointless in PyPy but could be useful in other projects. For example I can imagine a tool that would statically compile some version of a random Python program, including all necessary guards to check at run-time that the assumptions made are correct. When these guards fail, it would fall back to a regular interpreter. The difference with PyPy is that the latter uses a tracing JIT compiler to do at run-time (partially) the same job as I describe above. It does it by observation of the run-time behavior, which is a rather better estimator of the program's general behavior than a complex static analysis. And it dispenses us from writing any such analysis. This gives us a result that is not specific to a particular version of Python (or to Python at all). A bientôt, Armin.

On Wed, Aug 8, 2012 at 3:05 PM, Armin Rigo <arigo@tunes.org> wrote:
Hi Tim,
An additional remark: if you're looking for a tool that is able to extract a complete call graph from a random Python program, then it's impossible. Only approximations can be done, like e.g. done in pylint, I believe.
Such tools are pointless in PyPy but could be useful in other projects. For example I can imagine a tool that would statically compile some version of a random Python program, including all necessary guards to check at run-time that the assumptions made are correct. When these guards fail, it would fall back to a regular interpreter.
The difference with PyPy is that the latter uses a tracing JIT compiler to do at run-time (partially) the same job as I describe above. It does it by observation of the run-time behavior, which is a rather better estimator of the program's general behavior than a complex static analysis. And it dispenses us from writing any such analysis. This gives us a result that is not specific to a particular version of Python (or to Python at all).
A bientôt,
Armin.
Hi Armin, Thanks for the lengthy reply. Fortunately, I don't need a complete call graph. To get started I really only need procedure level dependency graphs (with control and data-dependencies). This graph is essentially a CFG with extra edges for different types of data dependencies. Thanks to Alex's previous reply I am taking a look at the CPython bytecode compiler to see if I can extract the local CFG's it produces. I could see a static approach potentially beating the dynamic approach prior to the jit getting *hot* but after that it seems like the dynamic approach would win in the long run. Luckily, I don't have to worry about such things as I have no intention of writing a compiler (well for python anyway you guys seem to have that under control), my main interest lies in studying the evolution of the structure of the programs themselves. Thanks for the pointers! -Tim

On 09/08/2012, Tim Henderson <tim.tadh@gmail.com> wrote:
Fortunately, I don't need a complete call graph. To get started I really only need procedure level dependency graphs (with control and data-dependencies). This graph is essentially a CFG with extra edges for different types of data dependencies. Thanks to Alex's previous reply I am taking a look at the CPython bytecode compiler to see if I can extract the local CFG's it produces.
Although pypy's own compiler creates a block structure and allows you to do post-order traversal of the blocks, I've found the easiest way to get the CFG of a single code object to be to use the logic found within the dis module. Here's an example of some old code that does that, somewhat stripped down and provided without tests I'm afraid because I kind of got bored of it. It's slightly more involved than dis.findlabels, part of a large piece of nonsense for deriving dataflow equations from python at app-level. https://gist.github.com/3300866 Functions OP_FN are not hard to specify but I've got them intermingled with other, much more complicated code here, they use the EffectContext to store the residual behaviour. Stack depth is completely statically determinable, and in 2.5 or later, it's determinable without looking at other code objects. I think other people have managed to do this, by the way - I have the vague recollection that Prof. Yanhong Liu at Stony Brook wrote a tool to extract datalog facts from python, although with hand-coded flow for built-ins; if you're only interested in intraprocedural analysis (often next-to-useless in python) I guess that would be fine. But I can't seem to find that, now. -- William Leslie
participants (4)
-
Alex Gaynor
-
Armin Rigo
-
Tim Henderson
-
William ML Leslie