[Python-Dev] Joys of Optimization

Fri Mar 19 00:21:33 EST 2004

 [Hye-Shik]
> Yay!  I'm a big fan of your series of optimizations.
> Great thanks! :)
> 
> By the way, I think there're many junior python hacker wannabes
> like me to challenge objects/compiler/vm optimizations.  Can you
> share your roadmap or work queues/plans, list of known bottle-necks
> for them?

Here are the optimizations on my todo list:

1) Objects can define equality tests through rich comparisons or
__cmp__.  A few timings suggest that the latter run 30% to 50% faster.
This may mean that PyObject_RichCompareBool(v, w, Py_EQ) has a potential
for a significant speedup.  Python pervasively uses equality tests
(internally to dictionaries and just about everywhere else).  So,
efforts to speedup rich comparisions (without changing semantics) would
result in an across the board improvement.  As a proof of concept, I
replaced all instances of rich equality testing with a new function,
_PyObject_RichCompareBoolEQ(v, w) which tries PyObject_Compare() and if
there is an error, falls back to the regular rich comparision.  This
gave the expected speedup and passed most, but not all of the
test_suite.  The reasons for the test failures are that the simplistic
redefinition slightly changed semantics when both __cmp__ and rich
comparisons were defined.  The correct approach is to investigate rich
comparisons and find the bottleneck.  At the root of it all, most
objects implement only one equality test and rich comparisons need to be
able to find it and run it as fast as __cmp__.

2) Michael started an investigation into the costs of setting up new
stackframes.  Standing on his shoulders, Armin proposed the concept of a
Zombie frame which goes beyond freelisting and takes advantage of the
contents of older, discarded frames.  His implementation eliminated
freelists in favor a single captive frame per code object.  The timings
were impressive, but there were some negative consequences -- recursive
calls and multi-threaded apps needed more than one frame per code object
and they suffered miserably in the absence of a freelist -- also, there
was a ton of code run only once and not only doesn't benefit from the
zombie frame, it costs memory because the zombie is never gets freed.  I
and one other respondant independently suggested keeping the freelist
(which would need to become doubly linked) and using a weak reference
from the code object to see if the matching pre-made frame is still on
the freelist.  This would capture all the benefits and incur none of the
detriments.  AFAICT, the only potential downfall is that so little setup
time is saved by zombie frames that any significant tracking logic would
completely offset the gain.  IOW, if you do this, the mechanism needs to
be dirt simple and fast.

3) I had re-coded several modules in C with the goal of making pure
python more viable for certain high performance apps (for instance,
implementing heapq and bisect in C made them a more viable as components
of superfast code written in pure python).  If the threading module were
re-coded in C, then it would benefit a broad class of multi-threaded
applications making it more likely that Python becomes the language of
choice for those apps.

4) The peephole optimizer in compile.c is currently limited in scope
because it lacks a basic block checker and the ability to re-insert
opcodes of a different length than the original (which entails moving
all the jump targets and recomputing the line numbering).  I've already
written the basic block checker, have a jump retargeter, and have
several peephole optimizations thoroughly tested; however, they can't go
in until the line renumberer is done.  For example, the code a,b=b,a
currently runs slower than t=a;a=b;b=t but it could run several times
faster if the BUILD_TUPLE 2 UNPACK_SEQUENCE 2 were replaced by ROT_TWO.
This is an important optimization because it changes the way we write
pure python (avoiding multiple assignment because it is slower).

5) Python's startup time got much worse in Py2.3.  Several people took a
look into it but there were no clear-cut results.  This is an important
optimization because there are plenty of short running apps where python
is called to knock about a single basic task.  Especially in web apps,
frequent starts and exits can be the norm and Python's startup time can
dominate total resource utilization and response time.

6) The concept behind ceval.c 's goto_fast_nextopcode bypassing the
periodic checking for signals and occasional thread switching.  Taking
this concept to the limit, those operations could be pulled completely
out of the eval loop and only called by a handful of opcodes such as
CALL_FUNCTION (which can be of arbitrarily long duration) and
JUMP_ABSOLUTE (which appears in any non-recursive repeating code block).
The goal is not to make error checking or thread switching less
frequent; rather, the idea is to increase the number of things that can
be written in pure python without incurring tons of eval loop overhead
(I think it is enough check the first opcode, every function call, and
every loop iteration).  This will increase the viability of writing
algorithms in pure python.  Of course, this needs to be handled
carefully, many simple operations have the potential to trigger some
long running operations; however, some of those situations exist today
(i.e. map() can run a huge number of iterations with zero trips around
the eval loop for tasking switching or signal checking).

7) There are three PEPs for reducing the overhead associated with
attribute lookup.  This, IMO, is the potentially largest remaining
optimization.  More importantly, it will improve the way we code,
allowing more free use of method calls, smaller functions, and less need
to put frequent calls in a local variable (I truly hate seeing code
like:  _int=int # speed hack).

Aside from optimizations, there are some modules to be developed:

1) collections.heap(), a high performance Fibonacci heap implemented as
a type and taking sort style arguments (cmp=, key=, reverse=).

2) finish the statistics module (except for the docs, this is basically
done and awaiting collections.heap() to underlie nsmallest() and
nlargest()).

3) write a calculator module - done right, this will take a long time, a
lot of research, and require much public discussion (everyone will have
an opinion on this).

4) explore whether the StringIO module would perform better if based on
array objects rather than strings.

Also, there are a couple of bugfixes in the queue:

1) fix operator.isMappingType() to exclude anything that passes
operator.isSequenceType().

2) the logging module is in dire need of a quick start tutorial
(patterned after that for unittests which gives you the 90% of what you
need to know in one easy lesson) and a new method or two that
encapsulates standard usage patterns boiling it down to
start_logging_stuff_here() and okay_log_this().  Right now, the module
suffers from a near vertical learning curve and even the simplest
logging applications require more instructions than you would expect.
Otherwise, the module is extremely well engineered.  If the docs were
fixed up and a couple of methods added, it could be a runaway success
instead of a debacle.

3) the Standard Library Tour in the tutorial was wildly successful and
demands a sequel. A part II should be written that covers more of what
you need to know to get productive from the start.

4) eval() only takes real dictionaries as arguments.  A huge number of
possibilities open up if it could take dictionary look-alikes.  Previous
proposals were rejected because they cost a percentage point or two of
execution time -- why have everyone pay the price for this.  So, efforts
need to be directed at how to do this without a performance penalty -
several approaches were mentioned; none have been tried and proven.

Also, the generator expression patch is about ready for prime time.  I
have it as a priority to get it reviewed and to fixup any outstanding
issues.

Okay, that's Raymond's list.  Feel free to take one and run with it.

Raymond Hettinger