Re: [Python-Dev] Python 2.3 release schedule

25 May 2002

      ...
...
Piling more complexity on is likely to slow down the common case though.
Not at all. The common case (the macro) will not have a single
instruction added.  The rare case (fallback function) will be slowed
down by two instructions: decrement a global variable, jump if zero.
The reshuffle will be triggered every 100th occurence of the rare
case (which may be every 10000th occurence of the common case).
It's the same approach used when designing a memory cache for a CPU:
it's OK if a cache miss is made slower by loading an entire cache
row as long as it improves the cache hit rate enough to justify it.
OK, OK.

I timed pystone, and it's about 1.5% faster.  If that's the best we
can do (and I think that any of the PEPs trying to optimize global and
builtin access probably have about the same improvement rate as your
code), I'm not sure we should worry about this.  I'd rather try to
make method calls and instance attribute (both methods and instance
variables) access faster...
...
...
...
For each dictionary, the number of negative entries is bound by the
number of global and builtin names in the co_names of all code
objects that get executed with that dictionary as their namespace.
And that's my worry.  Suppose a program has only one global but
touches every builtin.  Will the dictionary properly get resized to
accommodate all those negative entries?
...
...
...
len(dir(__builtins__))*24  # sizeof PyDictEntry on i386
2736
Caching of string hashes also costs memory. Speed optimizations
often do.  These costs need to be weighed against complexity,
performance, and even the code size - other alternatives may use up
as much code space as this one uses data in this worst-case
scenario.
I wasn't worried about size; I was worried about correctness.  But it
appears that your code is fine.

BTW, here's a benchmarking tip.  Instead of timing this:

  for i in xrange(10000000):
      hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;
      hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;

you should time the for loop in this code:

  x = [0]*10000000
  for i in x:
      hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;
      hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;

The xrange iterator allocates (and implicitly deallocates) an int per
iteration, and you don't use that int at all.  (10 million
pointers is 40 MB -- if that's too much, it's OK to test with 1
million iterations.  Quick outcome:

CVS python 2.3:
builtin: 8.480
global: 6.780
local: 5.080
fast: 2.890

With your patch:
builtin: 4.660
global: 4.160
local: 4.020
fast: 3.000

Hm, this reports:
0 inline dictionary lookups
1568 fast dictionary lookups
121 slow dictionary lookups
0.00% inline
7.16% slow
created 308 string dicts
converted 32 to normal dicts
10.39% conversion rate

Somehow the counts seem to be off by several orders of magnitude.
What *do* these numbers report?

Also note that your patch slows down fast access (by 3%)!  How can it?
Adding more code to the interpreter's inner loop changes the cache
behavior, etc.  Tim Peters can tell you more about this.

--Guido van Rossum (home page: http://www.python.org/~guido/)