Re: [pypy-dev] Re: psyco in Python was: Minimal Python project
Hi Michael, I'm going to copy this to pypy-dev so that other people can see this discussion.
I just don't see how to implement psyco in Python in a production package without using the C interpreter as a bootstrap! And yes, I suspect that a C language version of psyco will be faster than psyco in Python.
Even after you've run psyco over itself?
Yes, maybe even then. C compilers should do better _on the same algorithm_ on hand-written code than can any GIT.
Weeeeellll, this is the thing that gets me jumping-up-and-down-excited about things like psyco: pysco has MORE information than the C compiler to work with: while the C compiler only knows the algorithm to run, psyco knows both the algorithm *and the data it is to run on*. Obviously, this approach only wins if what you do next resembles what you did last, but I think CPU caches have proved there is some mileage in that idea :)
I don't think CPU caches have much bearing on this discussion. Caches speed up frequently used values. The problems here are not the same. The more I study psyco, the more I am convinced that there is _no way_ that psyco will ever produce better code than a C compiler. In brief, my reasons are as follows: 1. psyco has, in fact, no better information than does a C compiler. Remember that the "values" that psyco specializes upon are in fact _interpreter_ vars, so knowing those values is equivalent to knowing the types of operands. This does indeed allow psyco to emit machine code that is better than the equivalent C interp generic code. In fact, though, most of what psyco is doing is quite routine: replacing bytecode by assembly language. This will speed up the code a _lot_, but it is doing just like a C compiler does. 2. I have looked again at the code generators of the optimizing C compiler I wrote about 10 years ago. I can find no instance where knowing the value of an operand, rather than just its type, would produce better code. Yes, my compiler did constant folding (which eliminates code rather than improves it), but I do not believe doing constant folding at runtime is going to improve matters much, and is in fact very likely to slow down the running times of typical programs, for reasons given below. 3. It's easy to get seduced by the "psyco has more knowledge" mantra. In fact, one mustn't lose track of the obstacles psyco faces in outperforming C compilers. First, and very important, everything that psyco does must be done at run time. This means that even very effective optimizations like peephole optimizations become dubious for psyco. Second, and also very important, is that fact that psyco (or rather the code that psyco produces) must determine whether the code that psyco has already produced is suitable for the task at hand. The "Structure of Psyco" paper says this: "Dispatching is the set of techniques used to store a trace of the compiled code buffers and _find if one already matches the current compiler state_" (emphasis mine) In other words, it's not enough to specialize code! Psyco must determine whether the types used to compile a function match the types presently given to psyco. This searching is done in the "compatible" routine, and no matter how it is done is _pure overhead_ compared to the code generated by a C compiler. As I see it, both these issues are "gotchas". Neither can be eliminated completely, and each introduces overhead that will make the code emitted by psyco clearly inferior to the code produced by any decent C compiler. In particular, the more psyco specializes on runtime values of operands the bigger the code bloat and the more severe the problem of finding out what already-compiled code to use. 4. I have seen _nothing_ in psyco's algorithms or the code emitted by psyco that leads me to believe that psyco is "doing enough work" to outperform a C compiler. When I was a mathematics graduate student 30 years ago one of my professors made the remark that a mathematician need to develop a feel for the level of work required to do a new proof. Deep insights are not going to come from elementary methods, and there is no use in using deep methods to prove elementary results. I believe the same kind of analysis can (and should!) be done here. Yes, psyco is clever. In some ways it is _very_ clever, especially the idea of modeling the compiler on the structure on the original C interp. Yes, it is easy to seen how psyco can improve the running speed of the present C interpreter: _any_ compiling will do so! But with respect, I dispute the assertion that knowing actual values of operands is going to produce faster code that that produce by a decent C compiler applied to an _equivalent_ algorithm done in C. Please note: it is _precisely_ this extremely strong assertion that must be true for there to be any speedup gained in translating Python's C libraries into Python! We already have C libraries that work very well. I would be _very_ leery of messing with them... 5. Another way to see that psyco isn't really doing much is the following. Basically, psyco emits calls to the same runtime helpers that the C interpreter does. Yes, psyco may improve the calling sequences to these helpers, but it does not eliminate calls to those helpers. But a C compiler will never do worse that the code in the helpers, and will often do _much_ better. I challenge this group to provide even one example of Python code, for which psyco will emit better code than that produced by a C compiler on a transliteration of that code into C. I believe no such counter-example will ever be found. I am not much interested in anecdotal reports of Python being faster than C, and I would be happy to study in depth any purported counterexample. If you believe you have a real counterexample, please first make sure that the algorithms appear identical. Second, please try to make the counterexample as small as possible. If I am wrong, the counter-example should provide a _clear reason_ why all my arguments above are wrong. And wouldn't that be great :-) I believe the arguments given should be given greater weight than the "psyco has more knowledge" mantra. You want to convince me otherwise? Show me the code, or provide a _detailed_ explanation for how my arguments do not apply. Edward P.S. I trust that this group will interpret my remarks as constructive. I believe it is important at the beginning of any project to have proper goals and expectations. Make no mistake: I believe this project has the potential to be extremely valuable, even if my arguments are correct. And I would indeed be happy to be _proved_ wrong. However, I think it a dubious policy to assume that Python can outperform C, for several reasons: 1. There is no need to burden this project with unrealistic expectations. Python doesn't have to beat C for Python to rule the world! :-) 2. I am concerned that assuming that we can do the impossible will skew our efforts. Yes, by all means, experiment using Python! Using C to do experiments would not be too swift :-) But lets not spend a lot of time worrying about bootstrapping until we are _sure_ that we won't be doing the final version in C! 3. There may be another way to speed up Python, namely replace (parts of) the byte code with compiled code. Armin mentioned in his papers that some ways of doing so might produce code bloat. But is that the end of the story? I think not. The same tricks that psyco uses in the "dispatcher" might well be applied to intermix machine code with byte code. I wouldn't want to focus _only_ on the "psyco way (tm)" in this project. Compile-time optimizations deserve at least some considerations, IMO. EKR -------------------------------------------------------------------- Edward K. Ream email: edream@tds.net Leo: Literate Editor with Outlines Leo: http://personalpages.tds.net/~edream/front.html --------------------------------------------------------------------
On Fri, 17 Jan 2003, Edward K. Ream wrote:
Hi Michael,
I'm going to copy this to pypy-dev so that other people can see this discussion.
OK.
I just don't see how to implement psyco in Python in a production package without using the C interpreter as a bootstrap! And yes, I suspect that a C language version of psyco will be faster than psyco in Python.
Even after you've run psyco over itself?
Yes, maybe even then. C compilers should do better _on the same algorithm_ on hand-written code than can any GIT.
Weeeeellll, this is the thing that gets me jumping-up-and-down-excited about things like psyco: pysco has MORE information than the C compiler to work with: while the C compiler only knows the algorithm to run, psyco knows both the algorithm *and the data it is to run on*. Obviously, this approach only wins if what you do next resembles what you did last, but I think CPU caches have proved there is some mileage in that idea :)
I don't think CPU caches have much bearing on this discussion. Caches speed up frequently used values. The problems here are not the same.
That wasn't my point. CPU caches derive their benefit (in part) from the assumption that what a program did recently, it will do again. psyco's specializations derive their benefit from the same assumption. Also, I'm wasn't so much talking about psyco-as-is, more psyco-as-may-be, which is one of the reasons I'm trying to stop posting stuff like this...
The more I study psyco, the more I am convinced that there is _no way_ that psyco will ever produce better code than a C compiler. In brief, my reasons are as follows:
1. psyco has, in fact, no better information than does a C compiler. Remember that the "values" that psyco specializes upon are in fact _interpreter_ vars, so knowing those values is equivalent to knowing the types of operands. This does indeed allow psyco to emit machine code that is better than the equivalent C interp generic code. In fact, though, most of what psyco is doing is quite routine: replacing bytecode by assembly language. This will speed up the code a _lot_, but it is doing just like a C compiler does.
You've clearly spent more time looking at pysco lately than me: I don't understand the first half of this paragraph! Please *don't* feel obliged to educate me, I wouldn't have enough time to do anything with the new knowledge...
2. I have looked again at the code generators of the optimizing C compiler I wrote about 10 years ago. I can find no instance where knowing the value of an operand, rather than just its type, would produce better code.
? Surely in compiling if (x < 10) { ... } else { ... } knowing that x is less than 10 only 5% of the time would help? I'm pretty sure there are compilers which allow you to hint the outcome of any given test, so there must be *some* use for such data.
Yes, my compiler did constant folding (which eliminates code rather than improves it), but I do not believe doing constant folding at runtime is going to improve matters much, and is in fact very likely to slow down the running times of typical programs, for reasons given below.
It seems constant folding isn't much of an issue in Python, any way.
3. It's easy to get seduced by the "psyco has more knowledge" mantra. In fact, one mustn't lose track of the obstacles psyco faces in outperforming C compilers.
I sincerely hope I never gave the idea that this stuff would be easy.
First, and very important, everything that psyco does must be done at run time.
Well, yeah.
This means that even very effective optimizations like peephole optimizations become dubious for psyco. Second, and also very important, is that fact that psyco (or rather the code that psyco produces) must determine whether the code that psyco has already produced is suitable for the task at hand.
This is a more subtle issue. OTOH, the seem to be benefits in this kind of caching from algorithms that people implement IN SILICON. I'm thinking of things like the P4's trace cache here. It staggers me that that thing does enough to help, but I gather it does.
The "Structure of Psyco" paper says this: "Dispatching is the set of techniques used to store a trace of the compiled code buffers and _find if one already matches the current compiler state_" (emphasis mine) In other words, it's not enough to specialize code! Psyco must determine whether the types used to compile a function match the types presently given to psyco. This searching is done in the "compatible" routine, and no matter how it is done is _pure overhead_ compared to the code generated by a C compiler.
Well, hopefully you can hoist these choices of specialization out of any core loops. 80/20 rules and so on. It seems fairly likely that psyco will be of most benefit when the dispatcher is invoked least often, i.e. psyco should work with "fairly" large lumps of code. Though of course there are trade-offs here, as everywhere.
As I see it, both these issues are "gotchas". Neither can be eliminated completely, and each introduces overhead that will make the code emitted by psyco clearly inferior to the code produced by any decent C compiler. In particular, the more psyco specializes on runtime values of operands the bigger the code bloat and the more severe the problem of finding out what already-compiled code to use.
Yes. Fun with the I-cache, too.
4. I have seen _nothing_ in psyco's algorithms or the code emitted by psyco that leads me to believe that psyco is "doing enough work" to outperform a C compiler.
Sorry, OF COURSE it's not doing enough work NOW. If it is from my postings that you have got the suggestion that anyone thinks that pysco can beat a decent C compiler today, I am sincerely sorry, and apologise to you and anyone else who might have gotten the same impression.
When I was a mathematics graduate student 30 years ago one of my professors made the remark that a mathematician need to develop a feel for the level of
Hey, I'm a mathematics graduate student *now*!
work required to do a new proof. Deep insights are not going to come from elementary methods,
Hmm.
and there is no use in using deep methods to prove elementary results.
More hmm. Not sure my disagreements with those statements are relavent here, though. [...]
5. Another way to see that psyco isn't really doing much is the following. Basically, psyco emits calls to the same runtime helpers that the C interpreter does. Yes, psyco may improve the calling sequences to these helpers, but it does not eliminate calls to those helpers.
I thought the goal of this project was to reduce the number of these helpers...
But a C compiler will never do worse that the code in the helpers, and will often do _much_ better.
I challenge this group to provide even one example of Python code, for which psyco will emit better code than that produced by a C compiler on a transliteration of that code into C.
I am confident that none such exists.
I believe no such counter-example will ever be found.
For today's psyco, yes. For tomorrow, who knows? It's not like C is an ideally designed language for writing high performance applications in (here I'm thinking of things like the need for the restrict keyword in C99). [...]
P.S. I trust that this group will interpret my remarks as constructive. I believe it is important at the beginning of any project to have proper goals and expectations. Make no mistake: I believe this project has the potential to be extremely valuable, even if my arguments are correct. And I would indeed be happy to be _proved_ wrong.
I also hope that my comments don't produce undue enthusiasm. We don't have a single line of code yet! [...] Something else, to finish. Do you know of HP's dynamo project? http://www.arstechnica.com/reviews/1q00/dynamo/dynamo-1.html is *fascinating* reading. They were investigating issues of dynamic binary translation and started just by "translating" PA-RISC code to PA-RISC code. Some optimizations later, and they noticed that the "translated" code ran sometimes ran 20% faster than the original! Basically because of trace caches. Cheers, M.
I think we are in fairly complete agreement about all issues, including the need to move on. I thought these issues needed to be addressed at the start, so that the project starts off in the right direction. I've gotten everything off my chest: thanks for listening :-) If Python becomes faster than C, great. If not, the project will still be a success, _provided_ it doesn't waste too much time trying to do the impossible :-) Edward -------------------------------------------------------------------- Edward K. Ream email: edream@tds.net Leo: Literate Editor with Outlines Leo: http://personalpages.tds.net/~edream/front.html --------------------------------------------------------------------
Edward K. Ream wrote: ...
If Python becomes faster than C, great. If not, the project will still be a success, _provided_ it doesn't waste too much time trying to do the impossible :-)
Please note that many of the people on this list make their living from doing the impossible all day. These joint deranged minds will do something incredible. The only impossible candidate so far seems getting you healed from your deadlock. :-) I'm sorry to say that, but you are stuck with the current Psyco implementation, instead of getting the ideas of Armin's claim, which I believe is absolutely true. I carried some of them in my heart for years, but way less consequently than him. The more I'm supporting this, since I know it is true. You want a mathematical proof? I will try it once, and then continue my work. Your's sincerely, and expecting-a-*very*-strong-sprint - ly y'rs - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/
Michael Hudson wrote:
2. I have looked again at the code generators of the optimizing C compiler I wrote about 10 years ago. I can find no instance where knowing the value of an operand, rather than just its type, would produce better code.
? Surely in compiling
if (x < 10) { ... } else { ... }
And don't forget the obvious replacement of a multiply by some shifts, so multiply by 5 can be replaced by a copy, a right shift 2 places and a add instead of a costly multiply ... --tf
Thomas Fanslau wrote: ...
And don't forget the obvious replacement of a multiply by some shifts, so multiply by 5 can be replaced by a copy, a right shift 2 places and a add instead of a costly multiply ...
Ok, my 2 Eurocent: Also don't forget that this is not always true. There is hardware that recognized such common multipliers and does an optimization in hardware. I once had such a chip for the Forth language. This is not about to say you're wrong! What I try to express is instead, that we need to be aware of very different hardware, where optimization can have very different paths, and some assumptions may drive you very wrong. While your optimization was absolutely worthy on an 8086 processor, it becomes questionable for a Pentium 4, since multiplication by an immediate has been optimized like hell, and it is not sure in the first place whether to use a single operation that does the multiply, or by replacing it by three operations which you propose. There also can be considerations of occupation of execution units, which can make a multiply cheaper, since it can be done in parallel, while something other is still happening. This is all not relevant in the bootstrap phase. But we will finally write optimized compilers. (Dunno if Armin wants to, but I do). What we need is an abstract model of processors, caching, pipelines, parallel execution lines, prefetches, all of that. This is why I looked into MMIX. I no longer think to use MMIX as a first micro engine target, since I want something now, not something perfect. But that MMIX engine has everything that I mentioned and more, it is very close to current hardware development. --- About my position in this project, I'm not completely settled. But regardless of my actual tasks, I will provide two things, as my personal pet projects: - a) The fastest possible interpreted micro engine, optimized for X86 - b) A very good code generator for X86 which produces better code than gcc, MS-VC++ and IBM's C compiler. These will both, of course, completely be done in Python. There will not be a single line edited with a C editor. Instead, I'm writing high-level Python code, which is able to a) emit a suitable C source, and b) emit optimized X86 assembly. This just as a note what I'm ging to contribute, anyway. maybe-I-spent-5-Eurocents-ly y'rs -- chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/
Hello again, On Fri, Jan 17, 2003 at 09:32:40AM -0600, Edward K. Ream wrote:
I challenge this group to provide even one example of Python code, for which psyco will emit better code than that produced by a C compiler on a transliteration of that code into C. I believe no such counter-example will ever be found.
There are tons of other examples of this if one thinks "mini-language interpretation". For example, a program that asks the user to enter a simple mathematical function and plots its graph would be seriously faster in Python+Psyco than in ANSI C. And if you feel that using Python's compile() is cheating, then write a simple parser and interpreter for the user expressions (just as you would in C) and it could run faster than C if Psyco were advanced enough to specialize it correctly. Armin
participants (5)
-
Armin Rigo -
Christian Tismer -
Edward K. Ream -
Michael Hudson -
Thomas Fanslau