Re: [pypy-dev] benchmarking input

Since everyone but me seemed to want to take the discussion on IRC instead, I propose we meet up on #pypy-sync on monday at 14:00 /Anders

On Sep 25, 2009, at 12:23 PM, Anders Hammarquist wrote:
I want to know why PyPy doesn't use the unladen swallow benchmarks in complement to the ones already there and maybe reuse and extend their reporting tools. This could make comparing results easier and divide the work of creating comprehensive benchmarks for python. I could wait to ask this monday, but email is interesting in that people can take their time to answer things. -- Leonardo Santagada santagada at gmail.com

Hi Leonardo, On Fri, Sep 25, 2009 at 06:52:44PM -0300, Leonardo Santagada wrote:
A number of benchmarks are not applicable to us, or they are uninteresting at this point (e.g. pickling, regexp, or just microbenchmarks...). That would leave 2 usable benchmarks, at a first glance: 'ai', and possibly 'spitfire/slowspitfire'. (Btw, I wonder why they think that richards is "too artificial" when they include a number of microbenchmarks that look far more artificial to me...) A bientot, Armin.

On Sep 27, 2009, at 12:47 PM, Armin Rigo wrote:
Uninteresting for benchmarking the jit, but important for python users.
That would leave 2 usable benchmarks, at a first glance: 'ai', and possibly 'spitfire/slowspitfire'.
The django one is also interesting.
I thought that too... maybe just adding richards is okay, they can discard the results if they want. I think that talking to them and adding to their benchmarks. Maybe creating a python benchmark project on google to be moved together with the stdlib separation to python.org is a good idea to bring the community together. Using the same benchmark framework could help both pypy (they already process the benchmarks and do a form of reporting) and unladden swallow (probably all the benchmarks that pypy adds can show possible problems for their jit). If you would like to try this course I could talk to the guys there so to make a separate project... maybe even start sharing stdlib tests like you talked about on pycon 09. -- Leonardo Santagada santagada at gmail.com

Hi Benjamin, On Sun, Sep 27, 2009 at 09:49:28PM -0500, Benjamin Peterson wrote:
Well, at the moment it's not actually possible to use the unladen swallow benchmarks because the JIT has no thread support.
That can be a confusing statement. I added thread support to asmgcc a short while ago, so our JIT "mostly" supports threads, where "mostly" is defined as "yes it should support threads but we never really tried". A bientot, Armin.

And benchmarking the jit is what we're actually doing.
This is rather dummy loop for template generation, rather than "django". You can probably reduce it to something as advanced as dictionary lookup + string concatenation in a loop.
I think richards does not reflect what they do at google (like pickling :-)

On Mon, Sep 28, 2009 at 8:48 AM, Maciej Fijalkowski <fijall@gmail.com>wrote:
I think richards does not reflect what they do at google (like pickling :-)
Hi, yeah, their project goals are to speed up things that are interesting for them. Not to speed up things in general - but to speed up google projects running on python. ie, they want to speed up their own code mostly. They also run a number of python projects test suites, and use them to benchmark too. I imagine they benchmark google apps internally too. their recent talk describes their recent work - and also the work of speedups like the wpython project: http://unladen-swallow.googlecode.com/files/Unladen_Swallow_PyCon.pdf wpython: http://code.google.com/p/wpython/ http://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-bas... The programming language shootout tests would seem a useful set of benchmarks to compare against other languages. eg. http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=pypy&lang2=python&box=1 Note, they have old pythons there... that is cpython 2.5.2 and pypy 1.1 in their comparisons. cheers, ps. happy happy jit!! great work.

Hi. On Tue, Sep 29, 2009 at 4:58 AM, René Dudfield <renesd@gmail.com> wrote:
thanks for links.
I have also no clue how they did those benchmarks. I was trying to repeat them, but was unable to get even close to that numbers. Also, I failed at reporting it to them since you need to create an account there and account creation did not work too well... Cheers, fijal

Hi, On Fri, Sep 25, 2009 at 05:23:25PM +0200, Anders Hammarquist wrote:
Since everyone but me seemed to want to take the discussion on IRC instead, I propose we meet up on #pypy-sync on monday at 14:00
Outcome of this meeting: as a first step, we will do a system that runs nightly. It does a build of a pypy-c-jit, runs richards (only) and stores the result in a database format. Then it extracts the results from the database and write them as a static text or html file, for a web server. We will then see where to go from there. I think it's important to store the results in a DB instead of just as a static text or html file, for future refactorings; this issue blocked us when developing the tuatara benchmarks. A bientot, Armin.

Hi Holger, On Mon, Sep 28, 2009 at 04:38:43PM +0200, holger krekel wrote:
why is a DB table format better for refactoring than a text file?
I suppose it isn't intrinsically. The point is that as far as I know all our previous pypy benchmarks did not even produce a text file containing the results only, in a way that can be nicely re-read by program and refactored. They just produce directly text or html files with the results embedded in whatever presentation was deemed best at the time. More than that I have no preference for a DB format versus an easily-reparsable text file format. A bientot, Armin.

Personally I think the best reason behind using DB is that there is a ton of software that will help you read it in nice objective way. With text files you rather need to write parser/dumper, hence adding more work. Cheers, fijal On Wed, Sep 30, 2009 at 2:31 AM, Armin Rigo <arigo@tunes.org> wrote:

Hi Maciej, Armin, all, On Wed, Sep 30, 2009 at 02:46 -0600, Maciej Fijalkowski wrote:
all fine. As commonly discussed with Miquel i think it's best to offer scripts for injecting/retrieving data remotely and then i don't care if it is stored in a DB, Keyvalue-store or text files in a filesystem. best, holger
-- Metaprogramming, Python, Testing: http://tetamap.wordpress.com Python, PyPy, pytest contracting: http://merlinux.eu

On Sep 25, 2009, at 12:23 PM, Anders Hammarquist wrote:
I want to know why PyPy doesn't use the unladen swallow benchmarks in complement to the ones already there and maybe reuse and extend their reporting tools. This could make comparing results easier and divide the work of creating comprehensive benchmarks for python. I could wait to ask this monday, but email is interesting in that people can take their time to answer things. -- Leonardo Santagada santagada at gmail.com

Hi Leonardo, On Fri, Sep 25, 2009 at 06:52:44PM -0300, Leonardo Santagada wrote:
A number of benchmarks are not applicable to us, or they are uninteresting at this point (e.g. pickling, regexp, or just microbenchmarks...). That would leave 2 usable benchmarks, at a first glance: 'ai', and possibly 'spitfire/slowspitfire'. (Btw, I wonder why they think that richards is "too artificial" when they include a number of microbenchmarks that look far more artificial to me...) A bientot, Armin.

On Sep 27, 2009, at 12:47 PM, Armin Rigo wrote:
Uninteresting for benchmarking the jit, but important for python users.
That would leave 2 usable benchmarks, at a first glance: 'ai', and possibly 'spitfire/slowspitfire'.
The django one is also interesting.
I thought that too... maybe just adding richards is okay, they can discard the results if they want. I think that talking to them and adding to their benchmarks. Maybe creating a python benchmark project on google to be moved together with the stdlib separation to python.org is a good idea to bring the community together. Using the same benchmark framework could help both pypy (they already process the benchmarks and do a form of reporting) and unladden swallow (probably all the benchmarks that pypy adds can show possible problems for their jit). If you would like to try this course I could talk to the guys there so to make a separate project... maybe even start sharing stdlib tests like you talked about on pycon 09. -- Leonardo Santagada santagada at gmail.com

Hi Benjamin, On Sun, Sep 27, 2009 at 09:49:28PM -0500, Benjamin Peterson wrote:
Well, at the moment it's not actually possible to use the unladen swallow benchmarks because the JIT has no thread support.
That can be a confusing statement. I added thread support to asmgcc a short while ago, so our JIT "mostly" supports threads, where "mostly" is defined as "yes it should support threads but we never really tried". A bientot, Armin.

And benchmarking the jit is what we're actually doing.
This is rather dummy loop for template generation, rather than "django". You can probably reduce it to something as advanced as dictionary lookup + string concatenation in a loop.
I think richards does not reflect what they do at google (like pickling :-)

On Mon, Sep 28, 2009 at 8:48 AM, Maciej Fijalkowski <fijall@gmail.com>wrote:
I think richards does not reflect what they do at google (like pickling :-)
Hi, yeah, their project goals are to speed up things that are interesting for them. Not to speed up things in general - but to speed up google projects running on python. ie, they want to speed up their own code mostly. They also run a number of python projects test suites, and use them to benchmark too. I imagine they benchmark google apps internally too. their recent talk describes their recent work - and also the work of speedups like the wpython project: http://unladen-swallow.googlecode.com/files/Unladen_Swallow_PyCon.pdf wpython: http://code.google.com/p/wpython/ http://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-bas... The programming language shootout tests would seem a useful set of benchmarks to compare against other languages. eg. http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=pypy&lang2=python&box=1 Note, they have old pythons there... that is cpython 2.5.2 and pypy 1.1 in their comparisons. cheers, ps. happy happy jit!! great work.

Hi. On Tue, Sep 29, 2009 at 4:58 AM, René Dudfield <renesd@gmail.com> wrote:
thanks for links.
I have also no clue how they did those benchmarks. I was trying to repeat them, but was unable to get even close to that numbers. Also, I failed at reporting it to them since you need to create an account there and account creation did not work too well... Cheers, fijal

Hi, On Fri, Sep 25, 2009 at 05:23:25PM +0200, Anders Hammarquist wrote:
Since everyone but me seemed to want to take the discussion on IRC instead, I propose we meet up on #pypy-sync on monday at 14:00
Outcome of this meeting: as a first step, we will do a system that runs nightly. It does a build of a pypy-c-jit, runs richards (only) and stores the result in a database format. Then it extracts the results from the database and write them as a static text or html file, for a web server. We will then see where to go from there. I think it's important to store the results in a DB instead of just as a static text or html file, for future refactorings; this issue blocked us when developing the tuatara benchmarks. A bientot, Armin.

Hi Holger, On Mon, Sep 28, 2009 at 04:38:43PM +0200, holger krekel wrote:
why is a DB table format better for refactoring than a text file?
I suppose it isn't intrinsically. The point is that as far as I know all our previous pypy benchmarks did not even produce a text file containing the results only, in a way that can be nicely re-read by program and refactored. They just produce directly text or html files with the results embedded in whatever presentation was deemed best at the time. More than that I have no preference for a DB format versus an easily-reparsable text file format. A bientot, Armin.

Personally I think the best reason behind using DB is that there is a ton of software that will help you read it in nice objective way. With text files you rather need to write parser/dumper, hence adding more work. Cheers, fijal On Wed, Sep 30, 2009 at 2:31 AM, Armin Rigo <arigo@tunes.org> wrote:

Hi Maciej, Armin, all, On Wed, Sep 30, 2009 at 02:46 -0600, Maciej Fijalkowski wrote:
all fine. As commonly discussed with Miquel i think it's best to offer scripts for injecting/retrieving data remotely and then i don't care if it is stored in a DB, Keyvalue-store or text files in a filesystem. best, holger
-- Metaprogramming, Python, Testing: http://tetamap.wordpress.com Python, PyPy, pytest contracting: http://merlinux.eu
participants (7)
-
Anders Hammarquist
-
Armin Rigo
-
Benjamin Peterson
-
holger krekel
-
Leonardo Santagada
-
Maciej Fijalkowski
-
René Dudfield