Python optimization (was Python's "only one way to do it" philosophy isn't good?)
Diez B. Roggisch
deets at nospam.web.de
Mon Jun 11 09:27:35 CEST 2007
> It's hard to optimize Python code well without global analysis.
> The problem is that you have to make sure that a long list of "wierd
> things", like modifying code or variables via getattr/setattr, aren't
> happening before doing significant optimizations. Without that,
> you're doomed to a slow implementation like CPython.
> ShedSkin, which imposes some restrictions, is on the right track here.
> The __slots__ feature is useful but doesn't go far enough.
> I'd suggest defining "simpleobject" as the base class, instead of
> which would become a derived class of "simpleobject". Objects descended
> directly from "simpleobject" would have the following restrictions:
> - "getattr" and "setattr" are not available (as with __slots__)
> - All class member variables must be initialized in __init__, or
> in functions called by __init__. The effect is like __slots__,
> but you don't have to explictly write declarations.
> - Class members are implicitly typed with the type of the first
> thing assigned to them. This is the ShedSkin rule. It might
> be useful to allow assignments like
> self.str = None(string)
> to indicate that a slot holds strings, but currently has the null
> - Function members cannot be modified after declaration. Subclassing
> is fine, but replacing a function member via assignment is not.
> This allows inlining of function calls to small functions, which
> is a big win.
> - Private function members (self._foo and self.__foo) really are
> private and are not callable outside the class definition.
> You get the idea. This basically means that "simpleobject" objects have
> roughly the same restrictions as C++ objects, for which heavy compile time
> optimization is possible. Most Python classes already qualify for
> "simpleobject". And this approach doesn't require un-Pythonic stuff like
> declarations or extra "decorators".
> With this, the heavy optimizations are possible. Strength reduction.
> common subexpressious out of loops. Hoisting reference count updates
> out of
> loops. Keeping frequently used variables in registers. And elimination of
> many unnecessary dictionary lookups.
I won't give you the "prove it by doing it"-talk. It's to cheap.
Instead I'd like to say why I don't think that this will buy you much
performance-wise: it's a local optimization only. All it can and will do
is to optimize lookups and storage of attributes - either functions or
values - and calls to methods from within one specialobject. As long as
expressions stay in their own "soup", things might be ok.
The very moment you mix this with "regular", no-strings-attached python
code, you have to have the full dynamic machinery in place + you need
tons of guarding statements in the optimized code to prevent access
So in the end, I seriously doubt the performance gains are noticable.
Instead I'd rather take the pyrex-road, which can go even further
optimizing with some more declarations. But then I at least know exactly
where the boundaries are. As does the compiler.
> Python could get much, much faster. Right now CPython is said to be 60X
> than C. It should be possible to get at least an order of magnitude over
Regardless of the possibility of speeding it up - why should one want
this? Coding speed is more important than speed of coding in 90%+ of all
cases. The other ones - well, if you _really_ want speed, assembler is
the way to go. I'm serious about that. There is one famous mathematical
library author that does code in assembler - because in the end, it's
all about processor architecture and careful optimization for that. 
The same is true for e.g. the new Cell architecture, or the
altivec-optimized code in photoshop that still beats the crap out of
Intel processors on PPC-machines.
I'm all for making python faster if it doesn't suffer
functionality-wise. But until there is a proof that something really
speeds up python w/o crippling it, I'm more than skeptical.
His ev5/ev6 GEMM is used directly by ATLAS if the user answers
"yes" to its use during the configuration procedure on an alpha
processor. This results in a significant speedup over ATLAS's own GEMM
codes, and is the fastest ev5/ev6 implementation we are aware of.
More information about the Python-list