Mailman 3 PATCH: Fast globals/builtins lookups for 2.6 - Python-Dev

PATCH: Fast globals/builtins lookups for 2.6

Neil Toronto

Nov. 29, 2007

10:26 a.m.

I've posted a patch here: http://bugs.python.org/issue1518 for 2.6a0 that speeds up LOAD_GLOBAL to where it's just barely slower than LOAD_FAST for both globals and builtins. It started life here, in Python-ideas: http://mail.python.org/pipermail/python-ideas/2007-November/001212.html The idea is to cache pointers to dictionary entries rather than dictionary values as is usually suggested for speeding up globals. In my original approach, every dictionary would keep a list of "observers" that it would notify when an entry pointer became invalid. However, the surgery on PyDictObject became too invasive, to the point where the new code was affecting performance of unrelated code paths. (It was probably caching issues.) It also made some (very rare) operations on builtins and globals very, very slow. The new approach tags each PyDictObject with a "version": a 64-bit value that is incremented for every operation that invalidates at least one entry pointer. (Those are inserting set, delete, pop, clear and resize. Non-inserting set and get are unaffected.) In this approach, changes to PyDictObject are uninvasive and do not affect performance in any measurable way (as far as I can tell). Every function has a PyFastGlobalsObject, which keeps entry pointers for everything in co_names and tracks its dicts' versions so it can update the pointers when one might have become invalid. LOAD_GLOBALS uses the PyFastGlobalsObject to get globals and builtins by name index rather than by string key. With the patch, Python behaves differently in these two ways (as far as I can tell): 1. As before, a __builtins__ ({'None': None}) is created for frames whose globals do not have one. Unlike before, it's added to the globals dict rather than just kept internally. This made implementation easier - I don't think it's a big deal (but I might be wrong). 2. A change of __builtins__ (its value, not its contents) always appears at the beginning of a stack frame. (Before, changing __builtins__ in a module would be undetectable if function calls were kept within that module.) This is likely of little importance. I'd love to see it included. I have donned my asbestos underwear and my best chain mail shirt and I'm begging for comments. Neil

Show replies by date

Guido van Rossum

November 2007

5:29 p.m.

Cool! Can't wait to get my hands on it. How does it affect pystone? What happens if the globals are not a real dict but an instance of another collections.MutableMapping (virtual or real) subclass? We've worked hard (well, some folks have) to enable this. It would be a show-stopper if this broke (or changed semantics or became significantly slower). --Guido On Nov 29, 2007 2:26 AM, Neil Toronto <ntoronto@cs.byu.edu> wrote:

...

I've posted a patch here:

http://bugs.python.org/issue1518

for 2.6a0 that speeds up LOAD_GLOBAL to where it's just barely slower than LOAD_FAST for both globals and builtins. It started life here, in Python-ideas:

http://mail.python.org/pipermail/python-ideas/2007-November/001212.html

The idea is to cache pointers to dictionary entries rather than dictionary values as is usually suggested for speeding up globals. In my original approach, every dictionary would keep a list of "observers" that it would notify when an entry pointer became invalid. However, the surgery on PyDictObject became too invasive, to the point where the new code was affecting performance of unrelated code paths. (It was probably caching issues.) It also made some (very rare) operations on builtins and globals very, very slow.

The new approach tags each PyDictObject with a "version": a 64-bit value that is incremented for every operation that invalidates at least one entry pointer. (Those are inserting set, delete, pop, clear and resize. Non-inserting set and get are unaffected.) In this approach, changes to PyDictObject are uninvasive and do not affect performance in any measurable way (as far as I can tell).

Every function has a PyFastGlobalsObject, which keeps entry pointers for everything in co_names and tracks its dicts' versions so it can update the pointers when one might have become invalid. LOAD_GLOBALS uses the PyFastGlobalsObject to get globals and builtins by name index rather than by string key.

With the patch, Python behaves differently in these two ways (as far as I can tell):

1. As before, a __builtins__ ({'None': None}) is created for frames whose globals do not have one. Unlike before, it's added to the globals dict rather than just kept internally. This made implementation easier - I don't think it's a big deal (but I might be wrong).

2. A change of __builtins__ (its value, not its contents) always appears at the beginning of a stack frame. (Before, changing __builtins__ in a module would be undetectable if function calls were kept within that module.) This is likely of little importance.

I'd love to see it included. I have donned my asbestos underwear and my best chain mail shirt and I'm begging for comments.

Neil _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Neil Toronto

9:07 p.m.

Guido van Rossum wrote:

...

Cool! Can't wait to get my hands on it. How does it affect pystone?

Pystone likes it, according to my tests and Chris's. On his machine, if I'm reading these stones right, it takes about 7% off the run time on average. On mine it takes off 4%.

...

What happens if the globals are not a real dict but an instance of another collections.MutableMapping (virtual or real) subclass?

Globals has to be a real dict or a subclass, because otherwise PyFastGlobalsObject won't be able to point directly at its entries. This doesn't change anything, since globals had to be a real dict or subclass before.

...

We've worked hard (well, some folks have) to enable this. It would be a show-stopper if this broke (or changed semantics or became significantly slower).

Besides what I outlined about __builtins__ (which should be an arbitrary implementation detail), everything is exactly the same. The details of fast globals are completely transparent to everything but PyDictObject (in which they're nearly insignificant) and PyFastGlobalsObject. In other words, every other bit of code in the interpreter can happily do whatever the heck it wants with globals and builtins without having to worry about messing anything up. Since it's nearly transparent to the interpreter core, Python-the-language shouldn't see any differences at all. But then, I know less about the rest of the core than just about everybody else here, so I may just be talking out of my rear. :) Pystone and my microbenchmarks look good, but we don't know yet about Pybench. On my tests, it's nearly the same, with small variations in individual tests. On Chris's, there are large variations that appear (to me) to be random. Pybench does almost nothing with globals, though - the numbers on that really only need to stay put. Neil

Guido van Rossum

10:05 p.m.

On Nov 29, 2007 1:07 PM, Neil Toronto <ntoronto@cs.byu.edu> wrote:

...

Guido van Rossum wrote:

...
Cool! Can't wait to get my hands on it. How does it affect pystone?

Pystone likes it, according to my tests and Chris's. On his machine, if I'm reading these stones right, it takes about 7% off the run time on average. On mine it takes off 4%.

Hm. On my Linux box, in the trunk: Before the patch: Pystone(1.1) time for 50000 passes = 1.16 This machine benchmarks at 43103.4 pystones/second After the patch: Pystone(1.1) time for 50000 passes = 1.14 This machine benchmarks at 43859.6 pystones/second That's only about 1.75% faster. But pystone is a lousy benchmark.

...

...
What happens if the globals are not a real dict but an instance of another collections.MutableMapping (virtual or real) subclass?

Globals has to be a real dict or a subclass, because otherwise PyFastGlobalsObject won't be able to point directly at its entries. This doesn't change anything, since globals had to be a real dict or subclass before.

...
We've worked hard (well, some folks have) to enable this. It would be a show-stopper if this broke (or changed semantics or became significantly slower).

Besides what I outlined about __builtins__ (which should be an arbitrary implementation detail), everything is exactly the same. The details of fast globals are completely transparent to everything but PyDictObject (in which they're nearly insignificant) and PyFastGlobalsObject. In other words, every other bit of code in the interpreter can happily do whatever the heck it wants with globals and builtins without having to worry about messing anything up. Since it's nearly transparent to the interpreter core, Python-the-language shouldn't see any differences at all.

But then, I know less about the rest of the core than just about everybody else here, so I may just be talking out of my rear. :)

My apologies. I though I remembered we changed exec() and eval() to accept a totally arbitrary mapping for globals() -- but that's not the case, it must be a dict subclass.

...

Pystone and my microbenchmarks look good, but we don't know yet about Pybench. On my tests, it's nearly the same, with small variations in individual tests. On Chris's, there are large variations that appear (to me) to be random. Pybench does almost nothing with globals, though - the numbers on that really only need to stay put.

The only pybench tests that matter would be the ones checking dict performance. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Neil Toronto

5:43 a.m.

Guido van Rossum wrote:

...

Hm.

On my Linux box, in the trunk:

Before the patch: Pystone(1.1) time for 50000 passes = 1.16 This machine benchmarks at 43103.4 pystones/second

After the patch: Pystone(1.1) time for 50000 passes = 1.14 This machine benchmarks at 43859.6 pystones/second

That's only about 1.75% faster. But pystone is a lousy benchmark.

I'm not aware of any benchmark that isn't. :) Can you humor me and change the PY_LONG_LONG to Py_ssize_t in both PyDictObject and PyFastGlobalsObject and see if that helps? It does on one of my test machines. Speaking of which, here's a question for everybody. I was wondering whether 64 bits is necessary. It takes an hour of concerted effort - nothing but "module.d = 1; del module.d" for an hour straight - to overflow a 32-bit version number. Is anybody going to actually get close to doing that in a global namespace? I don't think a malicious user could exploit it. The most they could do is segfault by doing exactly 2**32 entry-invalidating operations and then one get or set. They've got better things to do if they're running code on your machine. FWIW - and I wouldn't bother with this if I weren't mucking about with dict internals - with a 32-bit version number, I've verified that gcc emits only one extra instruction in dict functions that increment it. It's two for a 64-bit number. The version test in LOAD_GLOBAL does take a bit more time with 64 bits, though. Neil

Steve Holden

1:52 p.m.

Neil Toronto wrote:

...

Guido van Rossum wrote:

...
Hm.

On my Linux box, in the trunk:

Before the patch: Pystone(1.1) time for 50000 passes = 1.16 This machine benchmarks at 43103.4 pystones/second

After the patch: Pystone(1.1) time for 50000 passes = 1.14 This machine benchmarks at 43859.6 pystones/second

That's only about 1.75% faster. But pystone is a lousy benchmark.

I'm not aware of any benchmark that isn't. :)

Can you humor me and change the PY_LONG_LONG to Py_ssize_t in both PyDictObject and PyFastGlobalsObject and see if that helps? It does on one of my test machines.

Speaking of which, here's a question for everybody. I was wondering whether 64 bits is necessary. It takes an hour of concerted effort - nothing but "module.d = 1; del module.d" for an hour straight - to overflow a 32-bit version number. Is anybody going to actually get close to doing that in a global namespace?

Of course not. And 640k is as much memory as anyone could reasonably need ...

...

I don't think a malicious user could exploit it. The most they could do is segfault by doing exactly 2**32 entry-invalidating operations and then one get or set. They've got better things to do if they're running code on your machine.

FWIW - and I wouldn't bother with this if I weren't mucking about with dict internals - with a 32-bit version number, I've verified that gcc emits only one extra instruction in dict functions that increment it. It's two for a 64-bit number. The version test in LOAD_GLOBAL does take a bit more time with 64 bits, though.

That's a good local optimization for today's conditions, probably. Who nows whether it will survive the next ten years. And decisions like that have a weird habit of running into pathological cases whose authors curse you when they find out where the problem arose. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/

Neil Toronto

4:14 p.m.

Steve Holden wrote:

...

Neil Toronto wrote:

...
Speaking of which, here's a question for everybody. I was wondering whether 64 bits is necessary. It takes an hour of concerted effort - nothing but "module.d = 1; del module.d" for an hour straight - to overflow a 32-bit version number. Is anybody going to actually get close to doing that in a global namespace?

Of course not. And 640k is as much memory as anyone could reasonably need ...

Point taken - forget I asked. Neil

Armin Rigo

December 2007

10:22 a.m.

Hi Neil, On Fri, Nov 30, 2007 at 09:14:04AM -0700, Neil Toronto wrote:

...

...
...
whether 64 bits is necessary. It takes an hour of concerted effort - nothing but "module.d = 1; del module.d" for an hour straight - to overflow a 32-bit version number. Is anybody going to actually get close to doing that in a global namespace?

Of course not. And 640k is as much memory as anyone could reasonably need ...

Point taken - forget I asked.

How much time does it take if the loop is in a C extension module, e.g. like the following? while (1) { PyDict_SetItem(d, k, v); PyDict_DelItem(d, k); } How much time would it take the same loop to overflow even the 64-bit version number? And how much will *this* take in ten year's time? A bientot, Armin

Daniel Stutzbach

4:46 p.m.

On Dec 27, 2007 4:22 AM, Armin Rigo <arigo@tunes.org> wrote:

...

How much time does it take if the loop is in a C extension module, e.g. like the following?

while (1) { PyDict_SetItem(d, k, v); PyDict_DelItem(d, k); }

How much time would it take the same loop to overflow even the 64-bit version number?

On a modern computer, more than a century.

...

And how much will *this* take in ten year's time?

My rule of thumb is that a tight loop performing 2**32 operations takes around a minute on a modern computer. The PyDict operations presumably take somewhat longer than a few primitive operations, so this is a good lower bound. Assuming that processing speed doubles every 1.5 year, we have (2**64 operations / 2**32 operations per minute / 2**(10 years / 1.5 years per speed doubling)) = 42275935 minutes = 1.34 years of running that tight loop to get an overflow. I think 64 bits is pretty safe :-) -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises LLC

6246

Age (days ago)

6274

Last active (days ago)

List overview

Download

8 comments

5 participants

participants (5)

Armin Rigo
Daniel Stutzbach
Guido van Rossum
Neil Toronto
Steve Holden