I've written a proof of concept patch to add a hook to PyDict_SetItem at http://bugs.python.org/issue5654 My motivation is to enable watchpoints in a python debugger that are called when an attribute or global changes. I know that this won't cover function locals and objects with slots (as Martin pointed out). We talked about this at the sprints and a few issues came up: * Is this worth it for debugger watchpoint support? This is a feature that probably wouldn't be used regularly but is extremely useful in some situations. * Would it be better to create a namespace dict subclass of dict, use it for modules, classes, & instances, and only allow watches of the subclass instances? * To what extent should non-debugger code use the hook? At one end of the spectrum, the hook could be made readily available for non-debug use and at the other end, it could be documented as being debug only, disabled in python -O, & not exposed in the stdlib to python code. John
On Wed, Apr 1, 2009 at 4:29 PM, John Ehresman
I've written a proof of concept patch to add a hook to PyDict_SetItem at http://bugs.python.org/issue5654 My motivation is to enable watchpoints in a python debugger that are called when an attribute or global changes. I know that this won't cover function locals and objects with slots (as Martin pointed out).
We talked about this at the sprints and a few issues came up:
* Is this worth it for debugger watchpoint support? This is a feature that probably wouldn't be used regularly but is extremely useful in some situations.
* Would it be better to create a namespace dict subclass of dict, use it for modules, classes, & instances, and only allow watches of the subclass instances?
* To what extent should non-debugger code use the hook? At one end of the spectrum, the hook could be made readily available for non-debug use and at the other end, it could be documented as being debug only, disabled in python -O, & not exposed in the stdlib to python code.
Have you measured the impact on performance? Collin
Collin Winter wrote:
Have you measured the impact on performance?
I've tried to test using pystone, but am seeing more differences between runs than there is between python w/ the patch and w/o when there is no hook installed. The highest pystone is actually from the binary w/ the patch, which I don't really believe unless it's some low level code generation affect. The cost is one test of a global variable and then a switch to the branch that doesn't call the hooks. I'd be happy to try to come up with better numbers next week after I get home from pycon. John
On Thu, Apr 2, 2009 at 04:16, John Ehresman
Collin Winter wrote:
Have you measured the impact on performance?
I've tried to test using pystone, but am seeing more differences between runs than there is between python w/ the patch and w/o when there is no hook installed. The highest pystone is actually from the binary w/ the patch, which I don't really believe unless it's some low level code generation affect. The cost is one test of a global variable and then a switch to the branch that doesn't call the hooks.
I'd be happy to try to come up with better numbers next week after I get home from pycon.
Pystone is pretty much a useless benchmark. If it measures anything, it's
the speed of the bytecode dispatcher (and it doesn't measure it particularly
well.) PyBench isn't any better, in my experience. Collin has collected a
set of reasonable benchmarks for Unladen Swallow, but they still leave a lot
to be desired. From the discussions at the VM and Language summits before
PyCon, I don't think anyone else has better benchmarks, though, so I would
suggest using Unladen Swallow's:
http://code.google.com/p/unladen-swallow/wiki/Benchmarks
--
Thomas Wouters
The measurements are just a distractor. We all already know that the hook is being added to a critical path. Everyone will pay a cost for a feature that few people will use. This is a really bad idea. It is not part of a thorough, thought-out framework of container hooks (something that would need a PEP at the very least). The case for how it helps us is somewhat thin. The case for DTrace hooks was much stronger.
If something does go in, it should be #ifdef'd out by default. But then, I don't think it should go in at all.
Raymond
On Thu, Apr 2, 2009 at 04:16, John Ehresman
Wow. Can you possibly be more negative?
2009/4/2 Raymond Hettinger
The measurements are just a distractor. We all already know that the hook is being added to a critical path. Everyone will pay a cost for a feature that few people will use. This is a really bad idea. It is not part of a thorough, thought-out framework of container hooks (something that would need a PEP at the very least). The case for how it helps us is somewhat thin. The case for DTrace hooks was much stronger.
If something does go in, it should be #ifdef'd out by default. But then, I don't think it should go in at all.
Raymond
On Thu, Apr 2, 2009 at 04:16, John Ehresman
wrote: Collin Winter wrote:
Have you measured the impact on performance?
I've tried to test using pystone, but am seeing more differences between runs than there is between python w/ the patch and w/o when there is no hook installed. The highest pystone is actually from the binary w/ the patch, which I don't really believe unless it's some low level code generation affect. The cost is one test of a global variable and then a switch to the branch that doesn't call the hooks.
I'd be happy to try to come up with better numbers next week after I get home from pycon.
Pystone is pretty much a useless benchmark. If it measures anything, it's the speed of the bytecode dispatcher (and it doesn't measure it particularly well.) PyBench isn't any better, in my experience. Collin has collected a set of reasonable benchmarks for Unladen Swallow, but they still leave a lot to be desired. From the discussions at the VM and Language summits before PyCon, I don't think anyone else has better benchmarks, though, so I would suggest using Unladen Swallow's: http://code.google.com/p/unladen-swallow/wiki/Benchmarks
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
Wow. Can you possibly be more negative?
I think it's worse to give the poor guy the run around
by making him run lots of random benchmarks. In
the end, someone will run a timeit or have a specific
case that shows the full effect. All of the respondents
so far seem to have a clear intuition that hook is right
in the middle of a critical path. Their intuition matches
what I learned by spending a month trying to find ways
to optimize dictionaries.
Am surprised that there has been no discussion of why
this should be in the default build (as opposed to a
compile time option). AFAICT, users have not previously
requested a hook like this.
Also, there has been no discussion for an overall strategy
for monitoring containers in general. Lists and tuples will
both defy this approach because there is so much code
that accesses the arrays directly. Am not sure whether the
setitem hook would work for other implementations either.
It seems weird to me that Collin's group can be working
so hard just to get a percent or two improvement in
specific cases for pickling while python-dev is readily
entertaining a patch that slows down the entire language.
If my thoughts on the subject bug you, I'll happily
withdraw from the thread. I don't aspire to be a
source of negativity. I just happen to think this
proposal isn't a good idea.
Raymond
----- Original Message -----
From: "Guido van Rossum"
The measurements are just a distractor. We all already know that the hook is being added to a critical path. Everyone will pay a cost for a feature that few people will use. This is a really bad idea. It is not part of a thorough, thought-out framework of container hooks (something that would need a PEP at the very least). The case for how it helps us is somewhat thin. The case for DTrace hooks was much stronger.
If something does go in, it should be #ifdef'd out by default. But then, I don't think it should go in at all.
Raymond
On Thu, Apr 2, 2009 at 04:16, John Ehresman
wrote: Collin Winter wrote:
Have you measured the impact on performance?
I've tried to test using pystone, but am seeing more differences between runs than there is between python w/ the patch and w/o when there is no hook installed. The highest pystone is actually from the binary w/ the patch, which I don't really believe unless it's some low level code generation affect. The cost is one test of a global variable and then a switch to the branch that doesn't call the hooks.
I'd be happy to try to come up with better numbers next week after I get home from pycon.
Pystone is pretty much a useless benchmark. If it measures anything, it's the speed of the bytecode dispatcher (and it doesn't measure it particularly well.) PyBench isn't any better, in my experience. Collin has collected a set of reasonable benchmarks for Unladen Swallow, but they still leave a lot to be desired. From the discussions at the VM and Language summits before PyCon, I don't think anyone else has better benchmarks, though, so I would suggest using Unladen Swallow's: http://code.google.com/p/unladen-swallow/wiki/Benchmarks
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
Raymond Hettinger
It seems weird to me that Collin's group can be working so hard just to get a percent or two improvement in specific cases for pickling while python-dev is readily entertaining a patch that slows down the entire language.
I think it's really more than a percent or two: http://bugs.python.org/issue5670 Regards Antoine.
It seems weird to me that Collin's group can be working so hard just to get a percent or two improvement in specific cases for pickling while python-dev is readily entertaining a patch that slows down the entire language.
[Antoine Pitrou]
I think it's really more than a percent or two: http://bugs.python.org/issue5670
For lists, it was a percent or two: http://bugs.python.org/issue5671 I expect Collin's overall efforts to payoff nicely. I was just pointing-out the contrast between module specific optimization efforts versus anti-optimizations that affect the whole language. Raymond
On Fri, Apr 3, 2009 at 00:07, Raymond Hettinger
It seems weird to me that Collin's group can be working so hard just to get a percent or two improvement in specific cases for pickling while python-dev is readily entertaining a patch that slows down the entire language.
Collin's group has unfortunately seen that you cannot know the actual impact
of a change until you measure it. GCC performance, for instance, is
extremely unpredictable, and I can easily see a change like this proving to
have zero impact -- or even positive impact -- on most platforms because,
say, it warms the cache for the common case. I doubt it will, but you can't
*know* until you measure it.
--
Thomas Wouters
On Thu, Apr 2, 2009 at 3:07 PM, Raymond Hettinger
Wow. Can you possibly be more negative?
I think it's worse to give the poor guy the run around
Mind your words please.
by making him run lots of random benchmarks. In the end, someone will run a timeit or have a specific case that shows the full effect. All of the respondents so far seem to have a clear intuition that hook is right in the middle of a critical path. Their intuition matches what I learned by spending a month trying to find ways to optimize dictionaries.
Am surprised that there has been no discussion of why this should be in the default build (as opposed to a compile time option). AFAICT, users have not previously requested a hook like this.
I may be partially to blame for this. John and Stephan are requesting this because it would (mostly) fulfill one of the top wishes of the users of Wingware. So the use case is certainly real.
Also, there has been no discussion for an overall strategy for monitoring containers in general. Lists and tuples will both defy this approach because there is so much code that accesses the arrays directly. Am not sure whether the setitem hook would work for other implementations either.
The primary use case is some kind of trap on assignment. While this cannot cover all cases, most non-local variables are stored in dicts. List mutations are not in the same league, as use case.
It seems weird to me that Collin's group can be working so hard just to get a percent or two improvement in specific cases for pickling while python-dev is readily entertaining a patch that slows down the entire language.
I don't actually believe that you can know whether this affects performance at all without serious benchmarking. The patch amounts to a single global flag check as long as the feature is disabled, and that flag could be read from the L1 cache.
If my thoughts on the subject bug you, I'll happily withdraw from the thread. I don't aspire to be a source of negativity. I just happen to think this proposal isn't a good idea.
I think we need more proof either way.
Raymond
----- Original Message ----- From: "Guido van Rossum"
To: "Raymond Hettinger" Cc: "Thomas Wouters" ; "John Ehresman" ; Sent: Thursday, April 02, 2009 2:19 PM Subject: Re: [Python-Dev] PyDict_SetItem hook Wow. Can you possibly be more negative?
2009/4/2 Raymond Hettinger
: The measurements are just a distractor. We all already know that the hook is being added to a critical path. Everyone will pay a cost for a feature that few people will use. This is a really bad idea. It is not part of a thorough, thought-out framework of container hooks (something that would need a PEP at the very least). The case for how it helps us is somewhat thin. The case for DTrace hooks was much stronger.
If something does go in, it should be #ifdef'd out by default. But then, I don't think it should go in at all.
Raymond
On Thu, Apr 2, 2009 at 04:16, John Ehresman
wrote: Collin Winter wrote:
Have you measured the impact on performance?
I've tried to test using pystone, but am seeing more differences between runs than there is between python w/ the patch and w/o when there is no hook installed. The highest pystone is actually from the binary w/ the patch, which I don't really believe unless it's some low level code generation affect. The cost is one test of a global variable and then a switch to the branch that doesn't call the hooks.
I'd be happy to try to come up with better numbers next week after I get home from pycon.
Pystone is pretty much a useless benchmark. If it measures anything, it's the speed of the bytecode dispatcher (and it doesn't measure it particularly well.) PyBench isn't any better, in my experience. Collin has collected a set of reasonable benchmarks for Unladen Swallow, but they still leave a lot to be desired. From the discussions at the VM and Language summits before PyCon, I don't think anyone else has better benchmarks, though, so I would suggest using Unladen Swallow's: http://code.google.com/p/unladen-swallow/wiki/Benchmarks
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Thu, Apr 2, 2009 at 5:57 PM, Guido van Rossum
On Thu, Apr 2, 2009 at 3:07 PM, Raymond Hettinger
wrote: Wow. Can you possibly be more negative?
I think it's worse to give the poor guy the run around
Mind your words please.
by making him run lots of random benchmarks. In the end, someone will run a timeit or have a specific case that shows the full effect. All of the respondents so far seem to have a clear intuition that hook is right in the middle of a critical path. Their intuition matches what I learned by spending a month trying to find ways to optimize dictionaries.
Am surprised that there has been no discussion of why this should be in the default build (as opposed to a compile time option). AFAICT, users have not previously requested a hook like this.
I may be partially to blame for this. John and Stephan are requesting this because it would (mostly) fulfill one of the top wishes of the users of Wingware. So the use case is certainly real.
Also, there has been no discussion for an overall strategy for monitoring containers in general. Lists and tuples will both defy this approach because there is so much code that accesses the arrays directly. Am not sure whether the setitem hook would work for other implementations either.
The primary use case is some kind of trap on assignment. While this cannot cover all cases, most non-local variables are stored in dicts. List mutations are not in the same league, as use case.
It seems weird to me that Collin's group can be working so hard just to get a percent or two improvement in specific cases for pickling while python-dev is readily entertaining a patch that slows down the entire language.
I don't actually believe that you can know whether this affects performance at all without serious benchmarking. The patch amounts to a single global flag check as long as the feature is disabled, and that flag could be read from the L1 cache.
When I was optimizing the tracing support in the eval loop, we started with two memory loads and an if test. Removing the whole thing saved about 3% of runtime, although I think that had been as high as 5% when Neal measured it a year before. (That indicates that the exact arrangement of the code can affect performance in subtle and annoying ways.) Removing one of the two loads saved about 2% of runtime. I don't remember exactly which benchmark that was; it may just have been pybench. Here, we're talking about introducing a load+if in dicts, which is less critical than the eval loop, so I'd guess that the effect will be less than 2% overall. I do think the real-life benchmarks are worth getting for this, but they may not predict the effect after other code changes. And I don't really have an opinion on what performance hit for normal use is worth better debugging.
If my thoughts on the subject bug you, I'll happily withdraw from the thread. I don't aspire to be a source of negativity. I just happen to think this proposal isn't a good idea.
I think we need more proof either way.
Raymond
----- Original Message ----- From: "Guido van Rossum"
To: "Raymond Hettinger" Cc: "Thomas Wouters" ; "John Ehresman" ; Sent: Thursday, April 02, 2009 2:19 PM Subject: Re: [Python-Dev] PyDict_SetItem hook Wow. Can you possibly be more negative?
2009/4/2 Raymond Hettinger
: The measurements are just a distractor. We all already know that the hook is being added to a critical path. Everyone will pay a cost for a feature that few people will use. This is a really bad idea. It is not part of a thorough, thought-out framework of container hooks (something that would need a PEP at the very least). The case for how it helps us is somewhat thin. The case for DTrace hooks was much stronger.
If something does go in, it should be #ifdef'd out by default. But then, I don't think it should go in at all.
Raymond
On Thu, Apr 2, 2009 at 04:16, John Ehresman
wrote: Collin Winter wrote:
Have you measured the impact on performance?
I've tried to test using pystone, but am seeing more differences between runs than there is between python w/ the patch and w/o when there is no hook installed. The highest pystone is actually from the binary w/ the patch, which I don't really believe unless it's some low level code generation affect. The cost is one test of a global variable and then a switch to the branch that doesn't call the hooks.
I'd be happy to try to come up with better numbers next week after I get home from pycon.
Pystone is pretty much a useless benchmark. If it measures anything, it's the speed of the bytecode dispatcher (and it doesn't measure it particularly well.) PyBench isn't any better, in my experience. Collin has collected a set of reasonable benchmarks for Unladen Swallow, but they still leave a lot to be desired. From the discussions at the VM and Language summits before PyCon, I don't think anyone else has better benchmarks, though, so I would suggest using Unladen Swallow's: http://code.google.com/p/unladen-swallow/wiki/Benchmarks
On 3 Apr, 2009, at 0:57, Guido van Rossum wrote:
The primary use case is some kind of trap on assignment. While this cannot cover all cases, most non-local variables are stored in dicts. List mutations are not in the same league, as use case.
I have a slightly different use-case than a debugger, although it boils down to "some kind of trap on assignment": implementing Key- Value Observing support for Python objects in PyObjC. "Key-Value Observing" is a technique in Cocoa where you can get callbacks when a property of an object changes and is something I cannot support for plain python objects at the moment due to lack of a callback mechanism. A full implementation would require hooks for mutation of lists and sets as well. The lack of mutation hooks is not a terrible problem for PyObjC, we can always use Cocoa datastructures when using KVO, but it is somewhat annoying that Cocoa datastructures leak into code that could be pure python just because I want to use KVO. Ronald
I think it's worse to give the poor guy the run around by making him run lots of random benchmarks.
"the poor guy" works for Wingware (a company you may have heard of) and has contributed to Python at several occasions. His name is John Ehresmann.
In the end, someone will run a timeit or have a specific case that shows the full effect. All of the respondents so far seem to have a clear intuition that hook is right in the middle of a critical path. Their intuition matches what I learned by spending a month trying to find ways to optimize dictionaries.
Ok, so add me as a respondent who thinks that this deserves to be added despite being in the critical path. I doubt it will be noticeable in practice.
Am surprised that there has been no discussion of why this should be in the default build (as opposed to a compile time option).
Because, as a compile time option, it will be useless. It's not targeted for people who want to work on the Python VM (who are the primary users of compile time options), but for people developing Python applications.
AFAICT, users have not previously requested a hook like this.
That's because debugging Python in general is in a sad state (which, in turn, is because you can get very far with just print calls).
Also, there has been no discussion for an overall strategy for monitoring containers in general. Lists and tuples will both defy this approach because there is so much code that accesses the arrays directly.
Dicts are special because they are used to implement namespaces. Watchpoints is an incredibly useful debugging aid.
Am not sure whether the setitem hook would work for other implementations either.
I can't see why it shouldn't.
If my thoughts on the subject bug you, I'll happily withdraw from the thread. I don't aspire to be a source of negativity. I just happen to think this proposal isn't a good idea.
As somebody who has worked a lot on performance, I'm puzzled how easily you judge a the performance impact of a patch without having seen any benchmarks. If I have learned anything about performance, it is this: never guess the performance aspects of code without benchmarking. Regards, Martin
Just want to reply quickly because I'm traveling -- I appreciate the feedback from Raymond and others. Part of the reason I created an issue with a proof of concept patch is to get this kind of feedback. I also agree that this shouldn't go in if it slows things down noticeably. I will do some benchmarking and look at the dtrace patches next week to see if there is some sort of more systematic way of adding these types of hooks. John
Thomas Wouters
Pystone is pretty much a useless benchmark. If it measures anything, it's the
speed of the bytecode dispatcher (and it doesn't measure it particularly well.) PyBench isn't any better, in my experience. I don't think pybench is useless. It gives a lot of performance data about crucial internal operations of the interpreter. It is of course very little real-world, but conversely makes you know immediately where a performance regression has happened. (by contrast, if you witness a regression in a high-level benchmark, you still have a lot of investigation to do to find out where exactly something bad happened) Perhaps someone should start maintaining a suite of benchmarks, high-level and low-level; we currently have them all scattered around (pybench, pystone, stringbench, richard, iobench, and the various Unladen Swallow benchmarks; not to mention other third-party stuff that can be found in e.g. the Computer Language Shootout). I also know Gregory P. Smith had emitted the idea of plotting benchmark figures for each new revision of trunk or py3k (and, perhaps, other implementations), but I don't know if he's willing to do it himself :-) Regards Antoine.
On Fri, Apr 3, 2009 at 11:27, Antoine Pitrou
Thomas Wouters
writes: Pystone is pretty much a useless benchmark. If it measures anything, it's
the speed of the bytecode dispatcher (and it doesn't measure it particularly well.) PyBench isn't any better, in my experience.
I don't think pybench is useless. It gives a lot of performance data about crucial internal operations of the interpreter. It is of course very little real-world, but conversely makes you know immediately where a performance regression has happened. (by contrast, if you witness a regression in a high-level benchmark, you still have a lot of investigation to do to find out where exactly something bad happened)
Really? Have you tried it? I get at least 5% noise between runs without any changes. I have gotten results that include *negative* run times. And yes, I tried all the different settings for calibration runs and timing mechanisms. The tests in PyBench are not micro-benchmarks (they do way too much for that), they don't try to minimize overhead or noise, but they are also not representative of real-world code. That doesn't just mean "you can't infer the affected operation from the test name", but "you can't infer anything." You can just be looking at differently borrowed runtime. I have in the past written patches to Python that improved *every* micro-benchmark and *every* real-world measurement I made, except PyBench. Trying to pinpoint the slowdown invariably lead to tests that did too much in the measurement loop, introduced too much noise in the "calibration" run or just spent their time *in the measurement loop* on doing setup and teardown of the test. Collin and Jeffrey have seen the exact same thing since starting work on Unladen Swallow. So, sure, it might be "useful" if you have 10% or more difference across the board, and if you don't have access to anything but pybench and pystone.
Perhaps someone should start maintaining a suite of benchmarks, high-level and low-level; we currently have them all scattered around (pybench, pystone, stringbench, richard, iobench, and the various Unladen Swallow benchmarks; not to mention other third-party stuff that can be found in e.g. the Computer Language Shootout).
That's exactly what Collin proposed at the summits last week. Have you seen
http://code.google.com/p/unladen-swallow/wiki/Benchmarks
? Please feel free to suggest more benchmarks to add :)
--
Thomas Wouters
Thomas Wouters
Really? Have you tried it? I get at least 5% noise between runs without any
The tests in PyBench are not micro-benchmarks (they do way too much for
changes. I have gotten results that include *negative* run times. That's an implementation problem, not an issue with the tests themselves. Perhaps a better timing mechanism could be inspired from the timeit module. Perhaps the default numbers of iterations should be higher (many subtests run in less than 100ms on a modern CPU, which might be too low for accurate measurement). Perhaps the so-called "calibration" should just be disabled. etc. that), Then I wonder what you call a micro-benchmark. Should it involve direct calls to low-level C API functions?
but they are also not representative of real-world code.
Representativity is not black or white. Is measuring Spitfire performance representative of the Genshi templating engine, or str.format-based templating? Regardless of the answer, it is still an interesting measurement.
That doesn't just mean "you can't infer the affected operation from the test name"
I'm not sure what you mean by that. If you introduce an optimization to make list comprehensions faster, it will certainly show up in the list comprehensions subtest, and probably in none of the other tests. Isn't it enough in terms of specificity? Of course, some optimizations are interpreter-wide, and then the breakdown into individual subtests is less relevant.
I have in the past written patches to Python that improved *every* micro-benchmark and *every* real-world measurement I made, except PyBench.
Well, I didn't claim that pybench measures /everything/. That's why we have other benchmarks as well (stringbench, iobench, whatever). It does test a bunch of very common operations which are important in daily use of Python. If some important operation is missing, it's possible to add a new test. Conversely, someone optimizing e.g. list comprehensions and trying to measure the impact using a set of so-called "real-world benchmarks" which don't involve any list comprehension in their critical path will not see any improvement in those "real-world benchmarks". Does it mean that the optimization is useless? No, certainly not. The world is not black and white.
That's exactly what Collin proposed at the summits last week. Have you seen http://code.google.com/p/unladen-swallow/wiki/Benchmarks
Yes, I've seen. I haven't tried it, I hope it can be run without installing the whole unladen-swallow suite? These are the benchmarks I've had a tendency to use depending on the issue at hand: pybench, richards, stringbench, iobench, binary-trees (from the Computer Language Shootout). And various custom timeit runs :-) Cheers Antoine.
On Fri, Apr 3, 2009 at 9:43 AM, Antoine Pitrou
Thomas Wouters
writes: Really? Have you tried it? I get at least 5% noise between runs without any
changes. I have gotten results that include *negative* run times.
That's an implementation problem, not an issue with the tests themselves. Perhaps a better timing mechanism could be inspired from the timeit module. Perhaps the default numbers of iterations should be higher (many subtests run in less than 100ms on a modern CPU, which might be too low for accurate measurement). Perhaps the so-called "calibration" should just be disabled. etc.
The tests in PyBench are not micro-benchmarks (they do way too much for that),
Then I wonder what you call a micro-benchmark. Should it involve direct calls to low-level C API functions?
I agree that a suite of microbenchmarks is supremely useful: I would very much like to be able to isolate, say, raise statement performance. PyBench suffers from implementation defects that in its current incarnation make it unsuitable for this, though: - It does not effectively isolate component performance as it claims. When I was working on a change to BINARY_MODULO to make string formatting faster, PyBench would report that floating point math got slower, or that generator yields got slower. There is a lot of random noise in the results. - We have observed overall performance swings of 10-15% between runs on the same machine, using the same Python binary. Using the same binary on the same unloaded machine should give as close an answer to 0% as possible. - I wish PyBench actually did more isolation. Call.py:ComplexPythonFunctionCalls is on my mind right now; I wish it didn't put keyword arguments and **kwargs in the same microbenchmark. - In experimenting with gcc 4.4's FDO support, I produced a training load that resulted in a 15-30% performance improvement (depending on benchmark) across all benchmarks. Using this trained binary, PyBench slowed down by 10%. - I would like to see PyBench incorporate better statistics for indicating the significance of the observed performance difference. I don't believe that these are insurmountable problems, though. A great contribution to Python performance work would be an improved version of PyBench that corrects these problems and offers more precise measurements. Is that something you might be interested in contributing to? As performance moves more into the wider consciousness, having good tools will become increasingly important. Thanks, Collin
Collin Winter
- I wish PyBench actually did more isolation. Call.py:ComplexPythonFunctionCalls is on my mind right now; I wish it didn't put keyword arguments and **kwargs in the same microbenchmark.
Well, there is a balance to be found between having more subtests and keeping a reasonable total running time :-) (I have to plead guilty for ComplexPythonFunctionCalls, btw)
- I would like to see PyBench incorporate better statistics for indicating the significance of the observed performance difference.
I see you already have this kind of measurement in your perf.py script, would it be easy to port it? We could also discuss making individual tests longer (by changing the default "warp factor").
On Fri, Apr 3, 2009 at 10:50 AM, Antoine Pitrou
Collin Winter
writes: - I wish PyBench actually did more isolation. Call.py:ComplexPythonFunctionCalls is on my mind right now; I wish it didn't put keyword arguments and **kwargs in the same microbenchmark.
Well, there is a balance to be found between having more subtests and keeping a reasonable total running time :-) (I have to plead guilty for ComplexPythonFunctionCalls, btw)
Sure, there's definitely a balance to maintain. With perf.py, we're going down the road of having different tiers of benchmarks: the default set is the one we pay the most attention to, with other benchmarks available for benchmarking certain specific subsystems or workloads (like pickling list-heavy input data). Something similar could be done for PyBench, giving the user the option of increasing the level of detail (and run-time) as appropriate.
- I would like to see PyBench incorporate better statistics for indicating the significance of the observed performance difference.
I see you already have this kind of measurement in your perf.py script, would it be easy to port it?
Yes, it should be straightforward to incorporate these statistics into PyBench. In the same directory as perf.py, you'll find test_perf.py which includes tests for the stats functions we're using. Collin
On Fri, Apr 03, 2009, Collin Winter wrote:
I don't believe that these are insurmountable problems, though. A great contribution to Python performance work would be an improved version of PyBench that corrects these problems and offers more precise measurements. Is that something you might be interested in contributing to? As performance moves more into the wider consciousness, having good tools will become increasingly important.
GSoC work? -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." --Brian W. Kernighan
On Sat, Apr 04, 2009 at 06:28:01AM -0700, Aahz wrote: -> On Fri, Apr 03, 2009, Collin Winter wrote: -> > -> > I don't believe that these are insurmountable problems, though. A -> > great contribution to Python performance work would be an improved -> > version of PyBench that corrects these problems and offers more -> > precise measurements. Is that something you might be interested in -> > contributing to? As performance moves more into the wider -> > consciousness, having good tools will become increasingly important. -> -> GSoC work? Alas, it's too late to submit new proposals; the deadline was yesterday. The next "Google gives us money to wrangle students into doing development" project will probably be GHOP for highschool students, in the winter, although it has not been announced and may not happen. cheers, --titus -- C. Titus Brown, ctb@msu.edu
On 2009-04-03 18:06, Thomas Wouters wrote:
On Fri, Apr 3, 2009 at 11:27, Antoine Pitrou
wrote: Thomas Wouters
writes: Pystone is pretty much a useless benchmark. If it measures anything, it's
the speed of the bytecode dispatcher (and it doesn't measure it particularly well.) PyBench isn't any better, in my experience.
I don't think pybench is useless. It gives a lot of performance data about crucial internal operations of the interpreter. It is of course very little real-world, but conversely makes you know immediately where a performance regression has happened. (by contrast, if you witness a regression in a high-level benchmark, you still have a lot of investigation to do to find out where exactly something bad happened)
Really? Have you tried it? I get at least 5% noise between runs without any changes. I have gotten results that include *negative* run times.
On which platform ? pybench 2.0 works reasonably well on Linux and Windows, but of course can't do better than the timers available for those platforms. If you have e.g. NTP running and it uses wall clock timers, it is possible that you get negative round times. If you don't and still get negative round times, you have to change the test parameters (see below).
And yes, I tried all the different settings for calibration runs and timing mechanisms. The tests in PyBench are not micro-benchmarks (they do way too much for that), they don't try to minimize overhead or noise,
That is not true. They were written as micro-benchmarks and adjusted to have a high signal-noise ratio. For some operations this isn't easy to do, but I certainly tried hard to get the overhead low (note that the overhead is listed in the output). That said, please keep in mind that the settings in pybench were last adjusted some years ago to have the tests all run in more or less the same wall clock time. CPUs have evolved a lot since then and this shows.
but they are also not representative of real-world code.
True and they never were meant for that, since I was frustrated by other benchmarks at the time and the whole approach in general. Each of the tests checks one specific aspect of Python. If your application happens to use a lot of dictionary operations, you'll be mostly interested in those. If you do a lot of simple arithmetic, there's another test for that. On top of that the application is written to be easily extensible, so it's easy to add new tests specific to whatever application space you're after.
That doesn't just mean "you can't infer the affected operation from the test name", but "you can't infer anything." You can just be looking at differently borrowed runtime. I have in the past written patches to Python that improved *every* micro-benchmark and *every* real-world measurement I made, except PyBench. Trying to pinpoint the slowdown invariably lead to tests that did too much in the measurement loop, introduced too much noise in the "calibration" run or just spent their time *in the measurement loop* on doing setup and teardown of the test.
pybench calibrates itself to remove that kind of noise from the output. Each test has a .calibrate() method which does all the setup and tear down minus the actual benchmark operations. If you get wrong numbers, try adjusting the parameters and add more "packets" of operations. Don't forget to adjust the version number to not compare apples and orange, though. Perhaps it's time to readjust the pybench parameters to todays CPUs. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 03 2009)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Fri, Apr 3, 2009 at 2:27 AM, Antoine Pitrou
Thomas Wouters
writes: Pystone is pretty much a useless benchmark. If it measures anything, it's the
speed of the bytecode dispatcher (and it doesn't measure it particularly well.) PyBench isn't any better, in my experience.
I don't think pybench is useless. It gives a lot of performance data about crucial internal operations of the interpreter. It is of course very little real-world, but conversely makes you know immediately where a performance regression has happened. (by contrast, if you witness a regression in a high-level benchmark, you still have a lot of investigation to do to find out where exactly something bad happened)
Perhaps someone should start maintaining a suite of benchmarks, high-level and low-level; we currently have them all scattered around (pybench, pystone, stringbench, richard, iobench, and the various Unladen Swallow benchmarks; not to mention other third-party stuff that can be found in e.g. the Computer Language Shootout).
Already in the works :) As part of the common standard library and test suite that we agreed on at the PyCon language summit last week, we're going to include a common benchmark suite that all Python implementations can share. This is still some months off, though, so there'll be plenty of time to bikeshed^Wrationally discuss which benchmarks should go in there. Collin
Collin Winter wrote:
On Fri, Apr 3, 2009 at 2:27 AM, Antoine Pitrou
wrote: Thomas Wouters
writes: Pystone is pretty much a useless benchmark. If it measures anything, it's the
speed of the bytecode dispatcher (and it doesn't measure it particularly well.) PyBench isn't any better, in my experience.
I don't think pybench is useless. It gives a lot of performance data about crucial internal operations of the interpreter. It is of course very little real-world, but conversely makes you know immediately where a performance regression has happened. (by contrast, if you witness a regression in a high-level benchmark, you still have a lot of investigation to do to find out where exactly something bad happened)
Perhaps someone should start maintaining a suite of benchmarks, high-level and low-level; we currently have them all scattered around (pybench, pystone, stringbench, richard, iobench, and the various Unladen Swallow benchmarks; not to mention other third-party stuff that can be found in e.g. the Computer Language Shootout).
Already in the works :)
As part of the common standard library and test suite that we agreed on at the PyCon language summit last week, we're going to include a common benchmark suite that all Python implementations can share. This is still some months off, though, so there'll be plenty of time to bikeshed^Wrationally discuss which benchmarks should go in there.
Where is the right place for us to discuss this common benchmark and test suite? As the benchmark is developed I would like to ensure it can run on IronPython. The test suite changes will need some discussion as well - Jython and IronPython (and probably PyPy) have almost identical changes to tests that currently rely on deterministic finalisation (reference counting) so it makes sense to test changes on both platforms and commit a single solution. Michael
Collin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.u...
-- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog
On Fri, Apr 3, 2009 at 10:28 AM, Michael Foord
Collin Winter wrote:
As part of the common standard library and test suite that we agreed on at the PyCon language summit last week, we're going to include a common benchmark suite that all Python implementations can share. This is still some months off, though, so there'll be plenty of time to bikeshed^Wrationally discuss which benchmarks should go in there.
Where is the right place for us to discuss this common benchmark and test suite?
As the benchmark is developed I would like to ensure it can run on IronPython.
The test suite changes will need some discussion as well - Jython and IronPython (and probably PyPy) have almost identical changes to tests that currently rely on deterministic finalisation (reference counting) so it makes sense to test changes on both platforms and commit a single solution.
I believe Brett Cannon is the best person to talk to about this kind of thing. I don't know that any common mailing list has been set up, though there may be and Brett just hasn't told anyone yet :) Collin
Collin Winter wrote:
On Fri, Apr 3, 2009 at 10:28 AM, Michael Foord
wrote: Collin Winter wrote:
As part of the common standard library and test suite that we agreed on at the PyCon language summit last week, we're going to include a common benchmark suite that all Python implementations can share. This is still some months off, though, so there'll be plenty of time to bikeshed^Wrationally discuss which benchmarks should go in there.
Where is the right place for us to discuss this common benchmark and test suite?
As the benchmark is developed I would like to ensure it can run on IronPython.
The test suite changes will need some discussion as well - Jython and IronPython (and probably PyPy) have almost identical changes to tests that currently rely on deterministic finalisation (reference counting) so it makes sense to test changes on both platforms and commit a single solution.
I believe Brett Cannon is the best person to talk to about this kind of thing. I don't know that any common mailing list has been set up, though there may be and Brett just hasn't told anyone yet :)
Collin
Which begs the question of whether we *should* have a separate mailing list. I don't think we discussed this specific point in the language summit - although it makes sense. Should we have a list specifically for the test / benchmarking or would a more general implementations-sig be appropriate? And is it really Brett who sets up mailing lists? My understanding is that he is pulling out of stuff for a while anyway, so that he can do Java / Phd type things... ;-) Michael -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog
On Fri, Apr 03, 2009 at 07:00:43PM +0100, Michael Foord wrote:
-> Collin Winter wrote:
-> >On Fri, Apr 3, 2009 at 10:28 AM, Michael Foord
-> >
C. Titus Brown wrote:
I vote for a separate mailing list -- 'python-tests'? -- but I don't know exactly how splintered to make the conversation. It probably belongs at python.org but if you want me to host it, I can.
If too many things get moved off to SIGs there won't be anything left for python-dev to talk about ;) (Although in this case it makes sense, as I expect there will be developers involved in alternate implementations that would like to be part of the test suite discussion without having to sign up for the rest of python-dev) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
Nick Coghlan
C. Titus Brown wrote:
I vote for a separate mailing list -- 'python-tests'? -- but I don't know exactly how splintered to make the conversation. It probably belongs at python.org but if you want me to host it, I can.
If too many things get moved off to SIGs there won't be anything left for python-dev to talk about ;)
There is already an stdlib-sig, which has been almost unused. Regards Antoine.
Antoine Pitrou wrote:
Nick Coghlan
writes: C. Titus Brown wrote:
I vote for a separate mailing list -- 'python-tests'? -- but I don't know exactly how splintered to make the conversation. It probably belongs at python.org but if you want me to host it, I can.
If too many things get moved off to SIGs there won't be anything left for python-dev to talk about ;)
There is already an stdlib-sig, which has been almost unused.
stdlib-sig isn't *quite* right (the testing and benchmarking are as much about core python as the stdlib) - although we could view the benchmarks and tests themselves as part of the standard library... Either way we should get it underway. Collin and Jeffrey - happy to use stdlib-sig? Michael
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.u...
-- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog
On Sat, Apr 4, 2009 at 7:33 AM, Michael Foord
Antoine Pitrou wrote:
Nick Coghlan
writes: C. Titus Brown wrote:
I vote for a separate mailing list -- 'python-tests'? -- but I don't know exactly how splintered to make the conversation. It probably belongs at python.org but if you want me to host it, I can.
If too many things get moved off to SIGs there won't be anything left for python-dev to talk about ;)
There is already an stdlib-sig, which has been almost unused.
stdlib-sig isn't *quite* right (the testing and benchmarking are as much about core python as the stdlib) - although we could view the benchmarks and tests themselves as part of the standard library...
Either way we should get it underway. Collin and Jeffrey - happy to use stdlib-sig?
Works for me. Collin
On Fri, Apr 3, 2009 at 12:28 PM, Michael Foord
Collin Winter wrote:
On Fri, Apr 3, 2009 at 2:27 AM, Antoine Pitrou
wrote: Thomas Wouters
writes: Pystone is pretty much a useless benchmark. If it measures anything, it's the
speed of the bytecode dispatcher (and it doesn't measure it particularly well.) PyBench isn't any better, in my experience.
I don't think pybench is useless. It gives a lot of performance data about crucial internal operations of the interpreter. It is of course very little real-world, but conversely makes you know immediately where a performance regression has happened. (by contrast, if you witness a regression in a high-level benchmark, you still have a lot of investigation to do to find out where exactly something bad happened)
Perhaps someone should start maintaining a suite of benchmarks, high-level and low-level; we currently have them all scattered around (pybench, pystone, stringbench, richard, iobench, and the various Unladen Swallow benchmarks; not to mention other third-party stuff that can be found in e.g. the Computer Language Shootout).
Already in the works :)
As part of the common standard library and test suite that we agreed on at the PyCon language summit last week, we're going to include a common benchmark suite that all Python implementations can share. This is still some months off, though, so there'll be plenty of time to bikeshed^Wrationally discuss which benchmarks should go in there.
Where is the right place for us to discuss this common benchmark and test suite?
Dunno. Here, by default, but I'd subscribe to a tests-sig or commonlibrary-sig or benchmark-sig if one were created.
As the benchmark is developed I would like to ensure it can run on IronPython.
We want to ensure the same thing for the current unladen swallow suite. If you find ways it currently doesn't, send us patches (until we get it moved to the common library repository at which point you'll be able to submit changes yourself). You should be able to check out http://unladen-swallow.googlecode.com/svn/tests independently of the rest of the repository. Follow the instructions at http://code.google.com/p/unladen-swallow/wiki/Benchmarks to run benchmarks though perf.py. You'll probably want to select benchmarks individually rather than accepting the default of "all" because it's currently not very resilient to tests that don't run on one of the comparison pythons. Personally, I'd be quite happy moving our performance tests into the main python repository before the big library+tests move, but I don't know what directory to put it in, and I don't know what Collin+Thomas think of that.
The test suite changes will need some discussion as well - Jython and IronPython (and probably PyPy) have almost identical changes to tests that currently rely on deterministic finalisation (reference counting) so it makes sense to test changes on both platforms and commit a single solution.
IMHO, any place in the test suite that relies on deterministic finalization but isn't explicitly testing that CPython-specific feature is a bug and should be fixed, even before we export it to the new repository. Jeffrey
John Ehresman wrote:
* To what extent should non-debugger code use the hook? At one end of the spectrum, the hook could be made readily available for non-debug use and at the other end, it could be documented as being debug only, disabled in python -O, & not exposed in the stdlib to python code.
To explain Collin's mail: Python's dict implementation is crucial to the performance of any Python program. Modules, types, instances all rely on the speed of Python's dict type because most of them use a dict to store their name space. Even the smallest change to the C code may lead to a severe performance penalty. This is especially true for set and get operations.
John Ehresman wrote:
* To what extent should non-debugger code use the hook? At one end of the spectrum, the hook could be made readily available for non-debug use and at the other end, it could be documented as being debug only, disabled in python -O, & not exposed in the stdlib to python code.
To explain Collin's mail: Python's dict implementation is crucial to the performance of any Python program. Modules, types, instances all rely on the speed of Python's dict type because most of them use a dict to store their name space. Even the smallest change to the C code may lead to a severe performance penalty. This is especially true for set and get operations.
See my comments in http://bugs.python.org/issue5654 Raymond
On Thu, Apr 2, 2009 at 03:23, Christian Heimes
John Ehresman wrote:
* To what extent should non-debugger code use the hook? At one end of the spectrum, the hook could be made readily available for non-debug use and at the other end, it could be documented as being debug only, disabled in python -O, & not exposed in the stdlib to python code.
To explain Collin's mail: Python's dict implementation is crucial to the performance of any Python program. Modules, types, instances all rely on the speed of Python's dict type because most of them use a dict to store their name space. Even the smallest change to the C code may lead to a severe performance penalty. This is especially true for set and get operations.
A change that would have no performance impact could be to set mp->ma_lookup to another function, that calls all the hooks it wants before calling the "super()" method (lookdict). This ma_lookup is already an attribute of every dict, so a debugger could trace only the namespaces it monitors. The only problem here is that ma_lookup is called with the key and its hash, but not with the value, and you cannot know whether you are reading or setting the dict. It is easy to add an argument and call ma_lookup with the value (or NULL, or -1 depending on the action: set, get or del), but this may have a slight impact (benchmark needed!) even if this argument is not used by the standard function. -- Amaury Forgeot d'Arc
participants (16)
-
"Martin v. Löwis"
-
Aahz
-
Amaury Forgeot d'Arc
-
Antoine Pitrou
-
C. Titus Brown
-
Christian Heimes
-
Collin Winter
-
Guido van Rossum
-
Jeffrey Yasskin
-
John Ehresman
-
M.-A. Lemburg
-
Michael Foord
-
Nick Coghlan
-
Raymond Hettinger
-
Ronald Oussoren
-
Thomas Wouters