Speed: bytecode vz C API calls

Wed Dec 10 04:32:26 EST 2003

"Francis Avila" <francisgavila at yahoo.com> writes:

> It is probably possible to custom-build a code object which makes _cache a
> local reference to the outer cache, and play with the bytecode to make the
> LOAD_DEREF into a LOAD_FAST, but it's not worth the effort.

I'm tempted to agree.

> Jacek, I think this is as fast as pure Python will get you with this
> approach.

That's pretty much what I thought ... which is why I was (reluctantly)
investigating the C approach.

> If the speed is simply not acceptable, there are some things you
> need to consider, Is this the *real* bottleneck?

Well, when I profile the code, proxy comes out at the top in terms of
total time. Yes, there are other functions that are hotspots too. Yes,
maybe I'm being completely stupid, and 99% of those calls are unnecessary.

However, I simply thought that if I can increase the speed of "proxy"
by a factor of 3 or 7 or something, without too much pain, by recoding
it in C, then I'd get a noticeable speed gain overall. If it involves
re-hacking the interpreter, then I won't bother.

> Consider this before you delve into C solutions, which greatly increase
> headaches.

You don't need to tell me that :-)

> An easy way to check is to cut out the map and memoization and
> just use a dictionary preloaded with known argument:value pairs for a given
> set of test data.
> 
> e.g.,
> 
> def func(...):
>     ...
> 
> data = [foo, bar, ....]
> cache = dict([(i, func(i)) for i in data])
> 
> def timetrial(_data=data, _cache=cache):
>     return map(_cache.__getitem__, _data)
> # [operator.getitem(_cache, i) for i in _data] *might* be faster.
> 
> then timeit.py -s"import ttrial" "ttrial.timetrial()"

I tried something similar, namely inlining the 

  try: return cache[args]
  except KeyError: return cache.setdefault(args, callable(*args))

rather than using proxy. This gives a little over a factor of 3
speedup, but it's not something I can use everywhere, as often the
proxy is called inside map(...)

> If it's still too slow, then either Python simply isn't for this task, or
> you need to optimize your functions.
> 
> If this isn't the real bottleneck:
> - Perhaps the functions I'm memoizing need optimization themselves?

But, to a pretty good approximation, they _never_ get called (the
value is (almost) always looked up). How would optimizing a function
that doesn't get called, help ?

> - Are there any modules out there which implement these in C
> already?

Other than the competing C++ implementation, no.

> - How does the competing C++ code do it?  Does it get its speed from some
> wicked data structure for the memoizing, or are the individual functions
> faster, or what?

I have about 10 functions, written in Python, which the profiler
highlights as taking up most of the time, between them (of which
"proxy" is the main culprit). The C++ version doesn't have _any_
Python in its inner loops. I think it's as simple as that.

(Admittedly, a group of 4 of my functions was initially written with
development ease in mind, and should be replaced by a single mean and
lean C(++) function. That's probably where the real bottleneck is.)

> If this *is* the only significant bottleneck,

No, it's not the _only_ bottleneck.

Yes, I am pursuing other paths too. But I feel more confident that I
know what I am doing in the other cases, so I won't bother the list
with those. If you remember, my original question was about what I
could expect in terms of speedup when recoding a Python function in C,
which will make many calls to the Python C API. It's not that I'm
determined to fix my whole speed problem just in this one area, but I
would like to understand a bit more about what to expect when going to
C.

> but you want to try harder to speed it up, you can pursue the C
> closure path you were mentioning.

Yes, I think that this could be edifying. It's also likely to be a
huge pain, so I'm not that determined to pursue it, as I do have more
productive things to be getting on with.

> However, comparing your Python to a C++ version on the basis of speed is
> most certainly the wrong way to go.  If the C++ is well designed with good
> algorithms, its going to be faster than Python. Period.

Not if my inner loops are C.

(BTW, I don't believe that the C++ version is all that wonderfully
designed, but enough about that :-)

> But if you need maximum speed at all costs, the best Python can do
> for you is act as a glue language to bits and pieces of hand-crafted
> C.

I think that it's the other way around, in this case. I build up a
whole structure in Python ... and just need C(++) to call a very small
subset of the functionality I provide, very efficiently.

> Do you really *need* the speed, or are you just bothered that it's not as
> fast as the C++ version?

Politically speaking, yes, I need the speed. If a silly benchmark (one
which completely misrepresents reality, BTW), shows that the Python
version is slower than the C++ one, then the Python one is terminated.

However, as you point out:

> The advantage of Python is that it's much faster and easier to
> *write* (and less likely to introduce bugs), easier to *read* and
> maintain, and can model your problem more intuitively and flexibly.

... and for these reasons I am very keen to ensure that the Python
version survives ... which, ironically enough, means that it must be
fast, even though I personally don't care that much about the speed.