On Tue, Jan 31, 2012 at 15:04, Maciej Fijalkowski <fijall(a)gmail.com> wrote:
> On Tue, Jan 31, 2012 at 9:55 PM, Brett Cannon <brett(a)python.org> wrote:
> >
> >
> > On Tue, Jan 31, 2012 at 13:44, Maciej Fijalkowski <fijall(a)gmail.com>
> wrote:
> >>
> >> On Tue, Jan 31, 2012 at 7:40 PM, Brett Cannon <brett(a)python.org> wrote:
> >> >
> >> >
> >> > On Tue, Jan 31, 2012 at 11:58, Paul Graydon <paul(a)paulgraydon.co.uk>
> >> > wrote:
> >> >>
> >> >>
> >> >>> And this is a fundamental issue with tying benchmarks to real
> >> >>> applications and libraries; if the code the benchmark relies on
> never
> >> >>> changes to Python 3, then the benchmark is dead in the water. As
> >> >>> Daniel
> >> >>> pointed out, if spitfire simply never converts then either we need
> to
> >> >>> convert them ourselves *just* for the benchmark (yuck), live w/o
> the
> >> >>> benchmark (ok, but if this happens to a bunch of benchmarks then we
> >> >>> are
> >> >>> going to not have a lot of data), or we look at making new
> benchmarks
> >> >>> based
> >> >>> on apps/libraries that _have_ made the switch to Python 3 (which
> means
> >> >>> trying to agree on some new set of benchmarks to add to the current
> >> >>> set).
> >> >>>
> >> >>>
> >> >> What is the criteria by which the original benchmark sets were
> chosen?
> >> >> I'm assuming it was because they're generally popular libraries
> >> >> amongst
> >> >> developers across a variety of purposes, so speed.pypy would show the
> >> >> speed
> >> >> of regular tasks?
> >> >
> >> >
> >> > That's the reason unladen swallow chose them, yes. PyPy then adopted
> >> > them
> >> > and added in the Twisted benchmarks.
> >> >
> >> >>
> >> >> If so, presumably it shouldn't be too hard to find appropriate
> >> >> libraries
> >> >> for Python 3?
> >> >
> >> >
> >> > Perhaps, but someone has to put in the effort to find those
> benchmarks,
> >> > code
> >> > them up, show how they are a reasonable workload, and then get them
> >> > accepted. Everyone likes the current set because the unladen team put
> in
> >> > a
> >> > lot of time and effort into selecting and creating those benchmarks.
> >>
> >> I think we also spent significant amount of time grabbing various
> >> benchmarks from various places (we = people who contributed to
> >> speed.pypy.org benchmark suite, that's by far not a group consisting
> >> only pypy devs).
> >
> >
> > Where does the PyPy benchmark code live, anyway?
>
> http://bitbucket.org/pypy/benchmarks
>
> >
> >>
> >>
> >> You might be surprised, but the criteria we used were mostly
> >> "contributed benchmarks showing some sort of real workload". I don't
> >> think we ever *rejected* a benchmark barring one case that was very
> >> variable and not very interesting (Depending on the HD performance).
> >> Some benchmarks were developed from "we know pypy is slow on this"
> >> scenarios as well.
> >
> >
> > Yeah, you and Alex have told me that in-person before.
> >
> >>
> >>
> >> The important part is that we want also "interesting" benchmarks to be
> >> included. This mostly means "run by someone somewhere" which includes
> >> a very broad category of things, but *excludes* fibonacci, richards,
> >> pystone and stuff like this. I think it's fine if we have a benchmark
> >> that runs python 3 version of whatever is there, but this requires
> >> work. Is there someone willing to do that work?
> >
> >
> > Right, I'm not suggesting something as silly as fibonacci.
> >
> > I think we need to first decide which set of benchmarks we are using
> since
> > there is already divergence between what is on hg.python.org and what is
> > measured at speed.pypy.org (e.g. hg.python.org tests 2to3 while pypy.org
> > does not, reverse goes for twisted). Once we know what set of benchmarks
> we
> > care about (it can be a cross-section), then we need to take a hard look
> at
> > where we are coming up short for Python 3. But from a python-dev
> > perspective, benchmarks running against Python 2 are not interesting
> since
> > we are simply no longer developing performance improvements for Python
> 2.7.
>
> 2to3 is essentially an overlook on pypy side, we'll integrate it back.
> Other than that I think pypy benchmarks are mostly a superset (there
> is also pickle and a bunch of pointless microbenchmarks).
>
I think pickle was mostly for unladen's pickle performance patches (trying
saying that three times fast =), so I don't really care about that one.
Would it make sense to change the pypy repo to make the unladen_swallow
directory an external repo from hg.python.org/benchmarks? Because as it
stands right now there are two mako benchmarks that are not identical.
Otherwise we should talk at PyCon and figure this all out before we end up
with two divergent benchmark suites that are being independently maintained
(since we are all going to be running the same benchmarks on
speed.python.org).
On Tue, Jan 31, 2012 at 9:55 PM, Brett Cannon <brett(a)python.org> wrote:
>
>
> On Tue, Jan 31, 2012 at 13:44, Maciej Fijalkowski <fijall(a)gmail.com> wrote:
>>
>> On Tue, Jan 31, 2012 at 7:40 PM, Brett Cannon <brett(a)python.org> wrote:
>> >
>> >
>> > On Tue, Jan 31, 2012 at 11:58, Paul Graydon <paul(a)paulgraydon.co.uk>
>> > wrote:
>> >>
>> >>
>> >>> And this is a fundamental issue with tying benchmarks to real
>> >>> applications and libraries; if the code the benchmark relies on never
>> >>> changes to Python 3, then the benchmark is dead in the water. As
>> >>> Daniel
>> >>> pointed out, if spitfire simply never converts then either we need to
>> >>> convert them ourselves *just* for the benchmark (yuck), live w/o the
>> >>> benchmark (ok, but if this happens to a bunch of benchmarks then we
>> >>> are
>> >>> going to not have a lot of data), or we look at making new benchmarks
>> >>> based
>> >>> on apps/libraries that _have_ made the switch to Python 3 (which means
>> >>> trying to agree on some new set of benchmarks to add to the current
>> >>> set).
>> >>>
>> >>>
>> >> What is the criteria by which the original benchmark sets were chosen?
>> >> I'm assuming it was because they're generally popular libraries
>> >> amongst
>> >> developers across a variety of purposes, so speed.pypy would show the
>> >> speed
>> >> of regular tasks?
>> >
>> >
>> > That's the reason unladen swallow chose them, yes. PyPy then adopted
>> > them
>> > and added in the Twisted benchmarks.
>> >
>> >>
>> >> If so, presumably it shouldn't be too hard to find appropriate
>> >> libraries
>> >> for Python 3?
>> >
>> >
>> > Perhaps, but someone has to put in the effort to find those benchmarks,
>> > code
>> > them up, show how they are a reasonable workload, and then get them
>> > accepted. Everyone likes the current set because the unladen team put in
>> > a
>> > lot of time and effort into selecting and creating those benchmarks.
>>
>> I think we also spent significant amount of time grabbing various
>> benchmarks from various places (we = people who contributed to
>> speed.pypy.org benchmark suite, that's by far not a group consisting
>> only pypy devs).
>
>
> Where does the PyPy benchmark code live, anyway?
http://bitbucket.org/pypy/benchmarks
>
>>
>>
>> You might be surprised, but the criteria we used were mostly
>> "contributed benchmarks showing some sort of real workload". I don't
>> think we ever *rejected* a benchmark barring one case that was very
>> variable and not very interesting (Depending on the HD performance).
>> Some benchmarks were developed from "we know pypy is slow on this"
>> scenarios as well.
>
>
> Yeah, you and Alex have told me that in-person before.
>
>>
>>
>> The important part is that we want also "interesting" benchmarks to be
>> included. This mostly means "run by someone somewhere" which includes
>> a very broad category of things, but *excludes* fibonacci, richards,
>> pystone and stuff like this. I think it's fine if we have a benchmark
>> that runs python 3 version of whatever is there, but this requires
>> work. Is there someone willing to do that work?
>
>
> Right, I'm not suggesting something as silly as fibonacci.
>
> I think we need to first decide which set of benchmarks we are using since
> there is already divergence between what is on hg.python.org and what is
> measured at speed.pypy.org (e.g. hg.python.org tests 2to3 while pypy.org
> does not, reverse goes for twisted). Once we know what set of benchmarks we
> care about (it can be a cross-section), then we need to take a hard look at
> where we are coming up short for Python 3. But from a python-dev
> perspective, benchmarks running against Python 2 are not interesting since
> we are simply no longer developing performance improvements for Python 2.7.
2to3 is essentially an overlook on pypy side, we'll integrate it back.
Other than that I think pypy benchmarks are mostly a superset (there
is also pickle and a bunch of pointless microbenchmarks).
On Tue, Jan 31, 2012 at 13:44, Maciej Fijalkowski <fijall(a)gmail.com> wrote:
> On Tue, Jan 31, 2012 at 7:40 PM, Brett Cannon <brett(a)python.org> wrote:
> >
> >
> > On Tue, Jan 31, 2012 at 11:58, Paul Graydon <paul(a)paulgraydon.co.uk>
> wrote:
> >>
> >>
> >>> And this is a fundamental issue with tying benchmarks to real
> >>> applications and libraries; if the code the benchmark relies on never
> >>> changes to Python 3, then the benchmark is dead in the water. As Daniel
> >>> pointed out, if spitfire simply never converts then either we need to
> >>> convert them ourselves *just* for the benchmark (yuck), live w/o the
> >>> benchmark (ok, but if this happens to a bunch of benchmarks then we are
> >>> going to not have a lot of data), or we look at making new benchmarks
> based
> >>> on apps/libraries that _have_ made the switch to Python 3 (which means
> >>> trying to agree on some new set of benchmarks to add to the current
> set).
> >>>
> >>>
> >> What is the criteria by which the original benchmark sets were chosen?
> >> I'm assuming it was because they're generally popular libraries amongst
> >> developers across a variety of purposes, so speed.pypy would show the
> speed
> >> of regular tasks?
> >
> >
> > That's the reason unladen swallow chose them, yes. PyPy then adopted them
> > and added in the Twisted benchmarks.
> >
> >>
> >> If so, presumably it shouldn't be too hard to find appropriate libraries
> >> for Python 3?
> >
> >
> > Perhaps, but someone has to put in the effort to find those benchmarks,
> code
> > them up, show how they are a reasonable workload, and then get them
> > accepted. Everyone likes the current set because the unladen team put in
> a
> > lot of time and effort into selecting and creating those benchmarks.
>
> I think we also spent significant amount of time grabbing various
> benchmarks from various places (we = people who contributed to
> speed.pypy.org benchmark suite, that's by far not a group consisting
> only pypy devs).
>
Where does the PyPy benchmark code live, anyway?
>
> You might be surprised, but the criteria we used were mostly
> "contributed benchmarks showing some sort of real workload". I don't
> think we ever *rejected* a benchmark barring one case that was very
> variable and not very interesting (Depending on the HD performance).
> Some benchmarks were developed from "we know pypy is slow on this"
> scenarios as well.
>
Yeah, you and Alex have told me that in-person before.
>
> The important part is that we want also "interesting" benchmarks to be
> included. This mostly means "run by someone somewhere" which includes
> a very broad category of things, but *excludes* fibonacci, richards,
> pystone and stuff like this. I think it's fine if we have a benchmark
> that runs python 3 version of whatever is there, but this requires
> work. Is there someone willing to do that work?
>
Right, I'm not suggesting something as silly as fibonacci.
I think we need to first decide which set of benchmarks we are using since
there is already divergence between what is on hg.python.org and what is
measured at speed.pypy.org (e.g. hg.python.org tests 2to3 while
pypy.orgdoes not, reverse goes for twisted). Once we know what set of
benchmarks we
care about (it can be a cross-section), then we need to take a hard look at
where we are coming up short for Python 3. But from a python-dev
perspective, benchmarks running against Python 2 are not interesting since
we are simply no longer developing performance improvements for Python 2.7.
On Tue, Jan 31, 2012 at 7:40 PM, Brett Cannon <brett(a)python.org> wrote:
>
>
> On Tue, Jan 31, 2012 at 11:58, Paul Graydon <paul(a)paulgraydon.co.uk> wrote:
>>
>>
>>> And this is a fundamental issue with tying benchmarks to real
>>> applications and libraries; if the code the benchmark relies on never
>>> changes to Python 3, then the benchmark is dead in the water. As Daniel
>>> pointed out, if spitfire simply never converts then either we need to
>>> convert them ourselves *just* for the benchmark (yuck), live w/o the
>>> benchmark (ok, but if this happens to a bunch of benchmarks then we are
>>> going to not have a lot of data), or we look at making new benchmarks based
>>> on apps/libraries that _have_ made the switch to Python 3 (which means
>>> trying to agree on some new set of benchmarks to add to the current set).
>>>
>>>
>> What is the criteria by which the original benchmark sets were chosen?
>> I'm assuming it was because they're generally popular libraries amongst
>> developers across a variety of purposes, so speed.pypy would show the speed
>> of regular tasks?
>
>
> That's the reason unladen swallow chose them, yes. PyPy then adopted them
> and added in the Twisted benchmarks.
>
>>
>> If so, presumably it shouldn't be too hard to find appropriate libraries
>> for Python 3?
>
>
> Perhaps, but someone has to put in the effort to find those benchmarks, code
> them up, show how they are a reasonable workload, and then get them
> accepted. Everyone likes the current set because the unladen team put in a
> lot of time and effort into selecting and creating those benchmarks.
I think we also spent significant amount of time grabbing various
benchmarks from various places (we = people who contributed to
speed.pypy.org benchmark suite, that's by far not a group consisting
only pypy devs).
You might be surprised, but the criteria we used were mostly
"contributed benchmarks showing some sort of real workload". I don't
think we ever *rejected* a benchmark barring one case that was very
variable and not very interesting (Depending on the HD performance).
Some benchmarks were developed from "we know pypy is slow on this"
scenarios as well.
The important part is that we want also "interesting" benchmarks to be
included. This mostly means "run by someone somewhere" which includes
a very broad category of things, but *excludes* fibonacci, richards,
pystone and stuff like this. I think it's fine if we have a benchmark
that runs python 3 version of whatever is there, but this requires
work. Is there someone willing to do that work?
Cheers,
fijal
> And this is a fundamental issue with tying benchmarks to real
> applications and libraries; if the code the benchmark relies on never
> changes to Python 3, then the benchmark is dead in the water. As
> Daniel pointed out, if spitfire simply never converts then either we
> need to convert them ourselves *just* for the benchmark (yuck), live
> w/o the benchmark (ok, but if this happens to a bunch of benchmarks
> then we are going to not have a lot of data), or we look at making new
> benchmarks based on apps/libraries that _have_ made the switch to
> Python 3 (which means trying to agree on some new set of benchmarks
> to add to the current set).
>
>
What is the criteria by which the original benchmark sets were chosen?
I'm assuming it was because they're generally popular libraries amongst
developers across a variety of purposes, so speed.pypy would show the
speed of regular tasks?
If so, presumably it shouldn't be too hard to find appropriate libraries
for Python 3?
Paul
Brett Cannon wrote:
[snip]
>
> BTW, which benchmark set are we talking about? speed.pypy.org
> <http://speed.pypy.org> runs a different set of benchmarks than the ones
> at http://hg.python.org/benchmarks . What set are we worrying about
I think we should aim for supporting the same set of benchmarks as PyPy,
since they have a nice historical record.
It is also means we can have a meaningful comparison of the latest
version of CPython with the latest version of PyPy
(even if it is only 2.7 ;) )
> porting here? If it's the latter we will have to wait a while for at
> lest 20% of the benchmarks since they rely on Twisted (which is only 50%
> done according to http://twistedmatrix.com/trac/milestone/Python-3.x).
At least they are working on it. We may just have to be patient
and just add in benchmarks one by one.
Cheers,
Mark.
On Mon, Jan 30, 2012 at 13:28, Maciej Fijalkowski <fijall(a)gmail.com> wrote:
> On Mon, Jan 30, 2012 at 7:56 PM, Brett Cannon <brett(a)python.org> wrote:
> >
> >
> > On Thu, Jan 26, 2012 at 15:21, Carsten Senger <senger(a)rehfisch.de>
> wrote:
> >>
> >> Hi everybody
> >>
> >> With the help of Maciej I worked on the buildbot in the last days. It
> >> can build cpython, run the benchmarks and upload the results to one or
> >> more codespeed instances. Maciej will look at the changes so we will
> >> hopefully have a working buildbot for python 2.7 in the next days.
> >>
> >> This has a ticket in pypy's bugtracker: https://bugs.pypy.org/issue1015
> >>
> >> I also have a script we can use to run the benchmarks for parts of the
> >> history and get data for a year or so into codespeed. The question is if
> >> this data is interesting to anyone.
> >
> >
> > I would say "don't worry about it unless you have some personal
> motivation
> > to want to bother". While trending data is interesting, it isn't critical
> > and a year will eventually pass anyway. =)
> >
> >>
> >>
> >>
> >> What are the plans for benchmarking python 3?
> >> How much of the benchmark suite will work with python 3, or can be made
> >> work without much effort? Porting the runner and the support code is
> >> easy, but directly porting the benchmarks including the used libraries
> >> seems unrealistic.
> >>
> >> Can we replace them with newer versions that support python3 to get some
> >> benchmarks working? Or build a second set of python3 compatible
> >> benchmarks with these newer versions?
> >>
> >
> > That's an open question. Until the libraries the benchmarks get ported
> > officially then it's up in the air when the pre-existing benchmarks can
> > move. We might have to look at pulling in a new set to start and then add
> > back in the old ones (possibly) as they get ported.
>
> Changing benchmarks is *never* a good idea. Note that we have quite
> some history of those benchmarks running on pypy and I would strongly
> object changing them in any way. Adding python 3 versions next to them
> is much better. Also porting runner etc. is not a very good idea I
> think.
>
> The problem really is that most of interesting benchmarks don't work
> on python 3, only the uninteresting ones. What we gonna do about that?
>
And this is a fundamental issue with tying benchmarks to real applications
and libraries; if the code the benchmark relies on never changes to Python
3, then the benchmark is dead in the water. As Daniel pointed out, if
spitfire simply never converts then either we need to convert them
ourselves *just* for the benchmark (yuck), live w/o the benchmark (ok, but
if this happens to a bunch of benchmarks then we are going to not have a
lot of data), or we look at making new benchmarks based on apps/libraries
that _have_ made the switch to Python 3 (which means trying to agree on
some new set of benchmarks to add to the current set).
BTW, which benchmark set are we talking about? speed.pypy.org runs a
different set of benchmarks than the ones at
http://hg.python.org/benchmarks. What set are we worrying about
porting here? If it's the latter we will
have to wait a while for at lest 20% of the benchmarks since they rely on
Twisted (which is only 50% done according to
http://twistedmatrix.com/trac/milestone/Python-3.x).
On Mon, Jan 30, 2012 at 9:56 AM, Brett Cannon <brett(a)python.org> wrote:
> That's an open question. Until the libraries the benchmarks get ported
> officially then it's up in the air when the pre-existing benchmarks can
> move. We might have to look at pulling in a new set to start and then add
> back in the old ones (possibly) as they get ported.
>
+1
In particular, I don't think the Spitfire authors are working on a Python3
port at all. If we have to wait for Spitfire, we may be waiting a very
long time.
--
Daniel Stutzbach
On Mon, Jan 30, 2012 at 7:56 PM, Brett Cannon <brett(a)python.org> wrote:
>
>
> On Thu, Jan 26, 2012 at 15:21, Carsten Senger <senger(a)rehfisch.de> wrote:
>>
>> Hi everybody
>>
>> With the help of Maciej I worked on the buildbot in the last days. It
>> can build cpython, run the benchmarks and upload the results to one or
>> more codespeed instances. Maciej will look at the changes so we will
>> hopefully have a working buildbot for python 2.7 in the next days.
>>
>> This has a ticket in pypy's bugtracker: https://bugs.pypy.org/issue1015
>>
>> I also have a script we can use to run the benchmarks for parts of the
>> history and get data for a year or so into codespeed. The question is if
>> this data is interesting to anyone.
>
>
> I would say "don't worry about it unless you have some personal motivation
> to want to bother". While trending data is interesting, it isn't critical
> and a year will eventually pass anyway. =)
>
>>
>>
>>
>> What are the plans for benchmarking python 3?
>> How much of the benchmark suite will work with python 3, or can be made
>> work without much effort? Porting the runner and the support code is
>> easy, but directly porting the benchmarks including the used libraries
>> seems unrealistic.
>>
>> Can we replace them with newer versions that support python3 to get some
>> benchmarks working? Or build a second set of python3 compatible
>> benchmarks with these newer versions?
>>
>
> That's an open question. Until the libraries the benchmarks get ported
> officially then it's up in the air when the pre-existing benchmarks can
> move. We might have to look at pulling in a new set to start and then add
> back in the old ones (possibly) as they get ported.
Changing benchmarks is *never* a good idea. Note that we have quite
some history of those benchmarks running on pypy and I would strongly
object changing them in any way. Adding python 3 versions next to them
is much better. Also porting runner etc. is not a very good idea I
think.
The problem really is that most of interesting benchmarks don't work
on python 3, only the uninteresting ones. What we gonna do about that?
PS. I pulled your changes
Hi everybody
With the help of Maciej I worked on the buildbot in the last days. It
can build cpython, run the benchmarks and upload the results to one or
more codespeed instances. Maciej will look at the changes so we will
hopefully have a working buildbot for python 2.7 in the next days.
This has a ticket in pypy's bugtracker: https://bugs.pypy.org/issue1015
I also have a script we can use to run the benchmarks for parts of the
history and get data for a year or so into codespeed. The question is if
this data is interesting to anyone.
What are the plans for benchmarking python 3?
How much of the benchmark suite will work with python 3, or can be made
work without much effort? Porting the runner and the support code is
easy, but directly porting the benchmarks including the used libraries
seems unrealistic.
Can we replace them with newer versions that support python3 to get some
benchmarks working? Or build a second set of python3 compatible
benchmarks with these newer versions?
Are there other tasks for speed.python.org atm?
Cheers,
..Carsten
--
Carsten Senger - Schumannstr. 38 - 65193 Wiesbaden
senger(a)rehfisch.de - (0611) 5324176
PGP: gpg --recv-keys --keyserver hkp://subkeys.pgp.net 0xE374C75A