LOAD_SELF and SELF_ATTR opcodes
I ran across an interesting paper about some VM optimizations yesterday: http://www.object-arts.com/Papers/TheInterpreterIsDead.PDF One thing mentioned was that saving even one cycle in their 'PUSH_SELF' opcode improved interpreter performance by 5%. I thought that was pretty cool, and then I realized CPython doesn't even *have* a PUSH_SELF opcode. So, today, I took a stab at implementing one, by converting "LOAD_FAST 0" calls to a "LOAD_SELF" opcode. Pystone and Parrotbench improved by about 2% or so. That wasn't great, so I added a "SELF_ATTR" opcode that combines a LOAD_SELF and a LOAD_ATTR in the same opcode while avoiding extra stack and refcount manipulation. This raised the total improvement for pystone to about 5%, but didn't seem to improve parrotbench any further. I guess parrotbench doesn't do much self.attr stuff in places that really count, and looking at the code it indeed seems that most self.* stuff is done at higher levels of the parsing benchmark, not the innermost loops. Indeed, even pystone doesn't do much attribute access on the first argument of most of its functions, especially not those in inner loops. Only Proc1() and the Record.copy() method do anything that would be helped by SELF_ATTR. But it seems to me that this is very unusual for object-oriented code, and that more common uses of Python should be helped a lot more by this. Do we have any benchmarks that don't use 'foo = self.foo' type shortcuts in their inner loops? Anyway, my main question is, do these sound like worthwhile optimizations? The code isn't that complex; the only tricky thing I did was having the opcodes' error case (unbound local) fall through to the LOAD_FAST opcode so as not to duplicate the error handling code, in the hopes of keeping the eval loop size down.
Phillip> Indeed, even pystone doesn't do much attribute access on the Phillip> first argument of most of its functions, especially not those Phillip> in inner loops. Only Proc1() and the Record.copy() method do Phillip> anything that would be helped by SELF_ATTR. But it seems to me Phillip> that this is very unusual for object-oriented code, and that Phillip> more common uses of Python should be helped a lot more by this. Phillip> Do we have any benchmarks that don't use 'foo = self.foo' type Phillip> shortcuts in their inner loops? (Just thinking out loud...) Maybe we should create an alternate "object-oriented" version of pystone as a way to inject more attribute access into a convenient benchmark. Even if it's completely artificial and has no connection to other versions of the Drhystone benchmark, it might be useful for testing improvements to attribute access. Skip
Phillip J. Eby wrote:
Anyway, my main question is, do these sound like worthwhile optimizations?
In the past, I think the analysis was always "no". It adds an opcode, so increases the size of the switch, causing more pressure on the cache, with an overall questionable effect. As for measuring the effect of the change: how often does that pattern occur in the standard library? (compared to what total number of LOAD_ATTR) Regards, Martin
At 12:33 AM 10/15/2005 +0200, Martin v. Löwis wrote:
Phillip J. Eby wrote:
Anyway, my main question is, do these sound like worthwhile optimizations?
In the past, I think the analysis was always "no". It adds an opcode, so increases the size of the switch, causing more pressure on the cache, with an overall questionable effect.
Hm. I'd have thought 5% pystone and 2% pybench is nothing to sneeze at, for such a minor change. I thought Skip's peephole optimizer originally only produced a 5% or so speedup. In any case, in relation to this specific kind of optimization, this is the only thing I found: http://mail.python.org/pipermail/python-dev/2002-February/019854.html which is a proposal by Guido to do the same thing, but also speeding up the actual attribute lookup. I didn't find any follow-up suggesting that anybody tried this, but perhaps it was put on hold pending the AST branch? :)
As for measuring the effect of the change: how often does that pattern occur in the standard library? (compared to what total number of LOAD_ATTR)
[pje@ns src]$ grep 'self\.[A-Za-z_]' Lib/*.py | wc -l 9919 [pje@ns src]$ grep '[a-zA-Z_][a-zA-Z_0-9]*\.[a-zA-Z_]' Lib/*.py | wc -l 19804 So, something like 50% of lines doing an attribute access include a 'self' attribute access. This very rough estimate may be thrown off by: * Import statements (causing an error in favor of more non-self attributes) * Functions whose first argument isn't 'self' (error in favor of non-self attributes) * Comments or docstrings talking about attributes or modules (could go either way) * Multiple attribute accesses on the same line (could go either way) The parrotbench code shows a similar ratio of self to non-self attribute usage, but nearly all of parrotbench's self-attribute usage is in b0.py, and not called in the innermost loop. That also suggests that the volume of usage of 'self.' isn't the best way to determine the performance impact, because pystone has almost no 'self.' usage at all, but still got a 5% total boost.
>> Phillip J. Eby wrote: >> > Anyway, my main question is, do these sound like worthwhile >> > optimizations? >> >> In the past, I think the analysis was always "no". It adds an opcode, >> so increases the size of the switch, causing more pressure on the >> cache, with an overall questionable effect. Phillip> Hm. I'd have thought 5% pystone and 2% pybench is nothing to Phillip> sneeze at, for such a minor change. We've added lots of new opcodes over the years. CPU caches have grown steadily in that time as well, from maybe 128KB-256KB in the early 90's to around 1MB today. I suspect cache size has kept up with the growth of Python's VM inner loop. At any rate, each change has to be judged on its own merits. If it speeds things up and is uncontroversial implementation-wise, I see no reason it should be rejected out-of-hand. (Send it to Raymond H. He'll probably sneak it in when Martin's not looking. <wink>) Skip
"Phillip J. Eby" <pje@telecommunity.com> writes:
Indeed, even pystone doesn't do much attribute access on the first argument of most of its functions, especially not those in inner loops. Only Proc1() and the Record.copy() method do anything that would be helped by SELF_ATTR. But it seems to me that this is very unusual for object-oriented code, and that more common uses of Python should be helped a lot more by this.
Is it that unusual though? I don't think it's that unreasonable to suppose that 'typical smalltalk code' sends messages to self a good deal more often than 'typical python code'. I'm not saying that this *is* the case, but my intuition is that it might be (not all Python code is that object oriented, after all). Cheers, mwh -- The source passes lint without any complaint (if invoked with
/dev/null). -- Daniel Fischer, http://www.ioccc.org/1998/df.hint
At 09:17 AM 10/15/2005 +0100, Michael Hudson wrote:
"Phillip J. Eby" <pje@telecommunity.com> writes:
Indeed, even pystone doesn't do much attribute access on the first argument of most of its functions, especially not those in inner loops. Only Proc1() and the Record.copy() method do anything that would be helped by SELF_ATTR. But it seems to me that this is very unusual for object-oriented code, and that more common uses of Python should be helped a lot more by this.
Is it that unusual though? I don't think it's that unreasonable to suppose that 'typical smalltalk code' sends messages to self a good deal more often than 'typical python code'. I'm not saying that this *is* the case, but my intuition is that it might be (not all Python code is that object oriented, after all).
Well, my greps on the stdlib suggest that about 50% of all lines doing attribute access, include an attribute access on 'self'. So for the stdlib, it's darn common. Plus, all functions benefit a tiny bit from faster access to their first argument via the LOAD_SELF opcode, which is what produced the roughly 2% improvement of parrotbench. The overall performance question has more to do with whether any of those accesses to self or self attributes are in loops. A person who's experienced at doing Python performance tuning will probably lift as many of them out of the innermost loops as possible, thereby lessening the impact of the change somewhat. But someone who doesn't know to do that, or just hasn't done it yet, will get more benefit from the change, but not as much as they'd get by lifting out the attribute access. Thus my guess is that it'll speed up "typical", un-tuned Python code by a few %, and is unlikely to slow anything down - even compilation. (The compiler changes are extremely minimal and localized to the relevant bytecode emission points.) Seems like a freebie, all in all.
participants (4)
-
"Martin v. Löwis" -
Michael Hudson -
Phillip J. Eby -
skip@pobox.com