[Python-Dev] PyDict_SetItem hook

Fri Apr 3 18:43:58 CEST 2009

Thomas Wouters <thomas <at> python.org> writes:
> 
> Really? Have you tried it? I get at least 5% noise between runs without any
changes. I have gotten results that include *negative* run times.

That's an implementation problem, not an issue with the tests themselves.
Perhaps a better timing mechanism could be inspired from the timeit module.
Perhaps the default numbers of iterations should be higher (many subtests run
in less than 100ms on a modern CPU, which might be too low for accurate
measurement). Perhaps the so-called "calibration" should just be disabled.
etc.

> The tests in PyBench are not micro-benchmarks (they do way too much for
that),

Then I wonder what you call a micro-benchmark. Should it involve direct calls
to
low-level C API functions?

> but they are also not representative of real-world code.

Representativity is not black or white. Is measuring Spitfire performance
representative of the Genshi templating engine, or str.format-based templating?
Regardless of the answer, it is still an interesting measurement.

> That doesn't just mean "you can't infer the affected operation from the test
name"

I'm not sure what you mean by that. If you introduce an optimization to make
list comprehensions faster, it will certainly show up in the list
comprehensions subtest, and probably in none of the other tests. Isn't it enough
in terms of specificity?

Of course, some optimizations are interpreter-wide, and then the breakdown into
individual subtests is less relevant.

> I have in the past written patches to Python that improved *every*
micro-benchmark and *every* real-world measurement I made, except PyBench.

Well, I didn't claim that pybench measures /everything/. That's why we have
other benchmarks as well (stringbench, iobench, whatever).
It does test a bunch of very common operations which are important in daily use
of Python. If some important operation is missing, it's possible to add a new
test.

Conversely, someone optimizing e.g. list comprehensions and trying to measure
the impact using a set of so-called "real-world benchmarks" which don't involve
any list comprehension in their critical path will not see any improvement in
those "real-world benchmarks". Does it mean that the optimization is useless?
No, certainly not. The world is not black and white.

> That's exactly what Collin proposed at the summits last week. Have you seen
http://code.google.com/p/unladen-swallow/wiki/Benchmarks

Yes, I've seen. I haven't tried it, I hope it can be run without installing the
whole unladen-swallow suite?

These are the benchmarks I've had a tendency to use depending on the issue at
hand: pybench, richards, stringbench, iobench, binary-trees (from the Computer
Language Shootout). And various custom timeit runs :-)

Cheers

Antoine.