Mailman 3 Fleet health - Python-Buildbots

Fleet health

older
Re: [Python-buildbots] Fleet health

Zachary Ware

June 11, 2015

5:08 a.m.

Hi all,

The health of our buildbot fleet is frankly a bit depressing at the moment. Of the 44 buildslaves [1], 25 (!) are currently down. Several of the ones that are up are routinely failing some step, which may or may not be the fault of the slave itself.

I just wanted to touch base with everybody and ask that you give your slaves a quick once-over to make sure they're working properly, or give an update on why they may be down and when (or if) they can be expected to be back up. In cases where a slave is down for an extended period of time, I'd like to clean up the waterfall view by temporarily removing those builders (and in cases where a slave is down for good, I'd like to clean up the list of slaves as well).

If there's anything that can be done to help on the master side, let me know!

Thanks,

Zach

[1]http://buildbot.python.org/all/buildslaves

Show replies by date

Chris Angelico

June 2015

5:17 a.m.

On Thu, Jun 11, 2015 at 3:08 PM, Zachary Ware <zachary.ware+pydev@gmail.com> wrote:

...

Hi all,

The health of our buildbot fleet is frankly a bit depressing at the moment. Of the 44 buildslaves [1], 25 (!) are currently down. Several of the ones that are up are routinely failing some step, which may or may not be the fault of the slave itself.

Hi Zach! Thanks for setting up this list. If nothing else, it's the obvious place to ask questions like this...

In terms of monitoring our slaves, is there any easy way to say "show me all the ones on this hardware"? Currently, I have a bookmarked page that looks like this:

http://buildbot.python.org/all/waterfall?builder=AMD64+Debian+root+2.7&builder=AMD64+Debian+root+3.3&builder=AMD64+Debian+root+3.4&builder=AMD64+Debian+root+3.x&builder=AMD64+Debian+root+custom&reload=none

(And I think that's out of date now, since there would be a 3.5 buildbot as well as 3.x.)

There's this page:

http://buildbot.python.org/all/buildslaves/angelico-debian-amd64

but I don't know of a simple way to ask for the waterfall view of all of those slaves.

ChrisA

Zachary Ware

4:52 p.m.

On Thu, Jun 11, 2015 at 12:17 AM, Chris Angelico <rosuav@gmail.com> wrote:

...

In terms of monitoring our slaves, is there any easy way to say "show me all the ones on this hardware"? Currently, I have a bookmarked page that looks like this:

http://buildbot.python.org/all/waterfall?builder=AMD64+Debian+root+2.7&builder=AMD64+Debian+root+3.3&builder=AMD64+Debian+root+3.4&builder=AMD64+Debian+root+3.x&builder=AMD64+Debian+root+custom&reload=none

(And I think that's out of date now, since there would be a 3.5 buildbot as well as 3.x.)

And no longer a 3.3 :)

...

There's this page:

http://buildbot.python.org/all/buildslaves/angelico-debian-amd64

but I don't know of a simple way to ask for the waterfall view of all of those slaves.

I'm not sure if there is one; your bookmark may be the best you can get for that.

-- Zach

Kubilay Kocak

5:45 a.m.

On 11/06/2015 3:08 PM, Zachary Ware wrote:

...

Hi all,

The health of our buildbot fleet is frankly a bit depressing at the moment. Of the 44 buildslaves [1], 25 (!) are currently down. Several of the ones that are up are routinely failing some step, which may or may not be the fault of the slave itself.

I just wanted to touch base with everybody and ask that you give your slaves a quick once-over to make sure they're working properly, or give an update on why they may be down and when (or if) they can be expected to be back up. In cases where a slave is down for an extended period of time, I'd like to clean up the waterfall view by temporarily removing those builders (and in cases where a slave is down for good, I'd like to clean up the list of slaves as well).

If there's anything that can be done to help on the master side, let me know!

Thanks,

Thanks for setting this up Zach.

I've kept my slaves (koobs-*) updated since they were brought online, including running the latest buildbot releases (currently latest), and updating to the latest branch versions (read: future next release) of FreeBSD. There have been no issues doing so.

I think there's a few things that can be done to improve the situation, both in the short and longer term:

Progressively update all buildbots to the latest buildbot version. This allows any new features/configurations to be used, with less risk of incompatible changes.
Recreate the 'stable' builders list to account for buildslave fleet changes since it was last modified.
Use the (new) 'stable' buildbot list to block releases (if its not being done now, or not being observed), forcing failing tests to be fixed. "All Green or No-Go". This is critical.

OR,

Remove the distinction between stable/unstable builders, remove unconnected / long-time-flaky slaves. The definition of flaky should be that the slave is broken, not the builds on the slave.
Block releases if !All-Green

I'd go for first prize in this regard and remove as many distinctions as possible differentiating buildslaves. It's not surprising that certain (many?) buildbots are disregarded as unimportant and ignored.

tldr: All buildbots should either be critical to release engineering and quality assurance, or not and removed. We as buildbot providers should be held accountable for our part in that. It is upto Python (Core) to set the standard for what the expectation is.

Additionally:

Right now each os-arch combination is a standalone bot/config and highly static in nature.

The biggest gain I can see to be had is to evolve the master build configuration to:

Segment/Class build configurations on the master to gain greater coverage of under-tested components and new build-types. Some examples are:
- --shared builds vs non-shared builds
- using system ffi, vs not (this might even help de-vendor libffi!)
- compiler: gcc vs clang (FreeBSD has both on 9.x)
- Architecture builders (x86_64, x86-32, mips, arm, blah)

Python would benefit by:

Allowing each buildslave to be used in multiple build classes
Greater coverage in build related infrastructure (notoriously problematic)
Allow a 'build class' oriented view of build results, rather than just by OS.

Once a new builder class is created, it is then just a matter of adding in the buildslaves that support that buildtype or features.

I'm on IRC (koobs @ #python-dev freenode) if anyone wants to chat further about these and other ideas.

-- Regards,

Kubilay FreeBSD/Python

R. David Murray

12:52 p.m.

On Thu, 11 Jun 2015 15:45:45 +1000, Kubilay Kocak <koobs@FreeBSD.org> wrote:

...

I've kept my slaves (koobs-*) updated since they were brought online, including running the latest buildbot releases (currently latest), and updating to the latest branch versions (read: future next release) of FreeBSD. There have been no issues doing so.

Yes, thank you for your great work with your slaves.

...

Progressively update all buildbots to the latest buildbot version. This allows any new features/configurations to be used, with less risk of incompatible changes.

We should at least be coordinating on what minimum version is running on the stable set...on the other hand, I don't think we've upgraded the master in a while :)

...

Recreate the 'stable' builders list to account for buildslave fleet changes since it was last modified.

Use the (new) 'stable' buildbot list to block releases (if its not being done now, or not being observed), forcing failing tests to be fixed. "All Green or No-Go". This is critical.

The release managers do pay attention to the stable set. A failing test doesn't necessarily block an alpha, beta, or even an early stage rc, depending on the nature of the failing test (the release manager uses their judgement).

There are a number of reasons to try to improve the state of the stable fleet, and this will require a multi-pronged effort, not the least of which is improvements in the flaky tests.

...

OR,

Remove the distinction between stable/unstable builders, remove unconnected / long-time-flaky slaves. The definition of flaky should be that the slave is broken, not the builds on the slave.

I think the stable/unstable split is important. A buildbot can be unstable for two reasons: the buildbot itself is flaky, as you say, or the tests are failing because the platform (or whatever other factor the slave was set up to test) is not completely supported yet. Having buildbots for the latter category is important. They shouldn't block releases, but they should be available to facilitate working on making Python work better.

The snakebite hosts, for example, were in the latter category initially, though it is questionable whether anyone other than Trent was interested in working on getting the tests to pass :). Unfortunately snakebite is a lower priority for Trent now, and hardware issues have taken a number (most?) of them offline, and they should probably be deleted or at least commented out, depending on what Trent plans to do with them in the future.

For flaky buildbots in your sense, we should have a conversation with the owner. The goal should be to either get it to be non-flaky, or delete it.

Of course, almost all of this is volunteer work, so the timeframes over which this happens may be a bit longer that would be ideal :)

...

Block releases if !All-Green

As I said above, this is the goal, but it is always the release managers' call.

...

I'd go for first prize in this regard and remove as many distinctions as possible differentiating buildslaves. It's not surprising that certain (many?) buildbots are disregarded as unimportant and ignored.

Slaves that are in the unstable set *should* be ignored in general, except by those people interested in working on making them stable.

...

tldr: All buildbots should either be critical to release engineering and quality assurance, or not and removed. We as buildbot providers should be held accountable for our part in that. It is upto Python (Core) to set the standard for what the expectation is.

As noted above, there is also the category of "being worked on", which is not critical to release engineering, but is the pathway to taking a buildslave from "not working yet" to being part of the stable set.

If no progress is being made over an extended period, though, we should indeed probably do a delete. Such a bot can be re-added when someone or ones show up with a renewed interested in whatever the project was :)

...

Additionally:

Right now each os-arch combination is a standalone bot/config and highly static in nature.

The biggest gain I can see to be had is to evolve the master build configuration to:

Segment/Class build configurations on the master to gain greater coverage of under-tested components and new build-types. Some examples are:

--shared builds vs non-shared builds

using system ffi, vs not (this might even help de-vendor libffi!)

compiler: gcc vs clang (FreeBSD has both on 9.x)

Architecture builders (x86_64, x86-32, mips, arm, blah)

Python would benefit by:

Allowing each buildslave to be used in multiple build classes

Greater coverage in build related infrastructure (notoriously problematic)

Allow a 'build class' oriented view of build results, rather than just by OS.

Once a new builder class is created, it is then just a matter of adding in the buildslaves that support that buildtype or features.

This is an interesting idea. The big disadvantage is that right now each buildslave runs one build job per modified release. Under the above scenario they would need to run multiple builds per modified release, which results in a small combinatoric explosion. Not all slave machines are up to that task. So that would be another consideration as to whether to include a particular machine in more than one column of the matrix.

However, we should clean up what we've got before we venture into that area, I think.

--David

Zachary Ware

7:20 p.m.

On Thu, Jun 11, 2015 at 12:45 AM, Kubilay Kocak <koobs@freebsd.org> wrote:

...

Thanks for setting this up Zach.

You're welcome :)

...

I've kept my slaves (koobs-*) updated since they were brought online, including running the latest buildbot releases (currently latest), and updating to the latest branch versions (read: future next release) of FreeBSD. There have been no issues doing so.

And thank you for that.

...

I think there's a few things that can be done to improve the situation, both in the short and longer term:

Progressively update all buildbots to the latest buildbot version. This allows any new features/configurations to be used, with less risk of incompatible changes.

I'm for that. I'm also for updating the master, but that's going to take some extra work (we have our own patches to the master). It'll take a while to get everything updated, though.

...

Recreate the 'stable' builders list to account for buildslave fleet changes since it was last modified.

I also agree with this. I'll go through and make up a list at some point, which we can then discuss here (or if anybody else wants to make up such a list before I have a chance, please do :)).

...

Additionally:

Right now each os-arch combination is a standalone bot/config and highly static in nature.

The biggest gain I can see to be had is to evolve the master build configuration to:

Segment/Class build configurations on the master to gain greater coverage of under-tested components and new build-types. Some examples are:

--shared builds vs non-shared builds

using system ffi, vs not (this might even help de-vendor libffi!)

compiler: gcc vs clang (FreeBSD has both on 9.x)

Architecture builders (x86_64, x86-32, mips, arm, blah)

Python would benefit by:

Allowing each buildslave to be used in multiple build classes

Greater coverage in build related infrastructure (notoriously problematic)

Allow a 'build class' oriented view of build results, rather than just by OS.

Once a new builder class is created, it is then just a matter of adding in the buildslaves that support that buildtype or features.

This sounds interesting, and I'd like to hear more about how exactly you would set this up. However, I agree with David that we need to be sure not to overload the slaves with less guts, and also that we should hold off on this kind of change until the other points mentioned above are addressed.

-- Zach

s.krah

7:08 p.m.

Hi,

thanks for taking care of this! My buildbots had been ultra-stable for several years until last year a certain group of people decided to attack individual infrastructure contributions very publicly.

As a consequence my bots can be deleted.

Stefan Krah

Zachary Ware

7:37 p.m.

On Thu, Jun 11, 2015 at 2:08 PM, s.krah <stefan@bytereef.org> wrote:

...

Hi,

thanks for taking care of this! My buildbots had been ultra-stable for several years until last year a certain group of people decided to attack individual infrastructure contributions very publicly.

As a consequence my bots can be deleted.

I am sorry to hear that.

-- Zach

Nick Coghlan

5 a.m.

On 11 June 2015 at 15:08, Zachary Ware <zachary.ware+pydev@gmail.com> wrote:

...

Hi all,

The health of our buildbot fleet is frankly a bit depressing at the moment. Of the 44 buildslaves [1], 25 (!) are currently down. Several of the ones that are up are routinely failing some step, which may or may not be the fault of the slave itself.

I'd suggest dropping my current RHEL buildbot for the time being, and I'll look at setting up a better maintained replacement later (perhaps as part of the Fedora or CentOS QA infrastructure).

Cheers, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Zachary Ware

6:12 a.m.

On Fri, Jun 12, 2015 at 12:00 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

On 11 June 2015 at 15:08, Zachary Ware <zachary.ware+pydev@gmail.com> wrote:

...
Hi all,

The health of our buildbot fleet is frankly a bit depressing at the moment. Of the 44 buildslaves [1], 25 (!) are currently down. Several of the ones that are up are routinely failing some step, which may or may not be the fault of the slave itself.

I'd suggest dropping my current RHEL buildbot for the time being, and I'll look at setting up a better maintained replacement later (perhaps as part of the Fedora or CentOS QA infrastructure).

Ok, I've removed your RHEL slave. Thanks for letting me know!

-- Zach

David Edelsohn

1:46 p.m.

On Thu, Jun 11, 2015 at 1:08 AM, Zachary Ware <zachary.ware+pydev@gmail.com> wrote:

...

Hi all,

The health of our buildbot fleet is frankly a bit depressing at the moment. Of the 44 buildslaves [1], 25 (!) are currently down. Several of the ones that are up are routinely failing some step, which may or may not be the fault of the slave itself.

I just wanted to touch base with everybody and ask that you give your slaves a quick once-over to make sure they're working properly, or give an update on why they may be down and when (or if) they can be expected to be back up. In cases where a slave is down for an extended period of time, I'd like to clean up the waterfall view by temporarily removing those builders (and in cases where a slave is down for good, I'd like to clean up the list of slaves as well).

If there's anything that can be done to help on the master side, let me know!

Hi, Zach

Thanks for setting up this discussion list.

Internally, IBM has been discussing how to improve its participation in Open Source Software CI Testers to ensure that IBM POWER and IBM System z have better coverage. IBM may be able to help with hosting an open build service of diverse systems. We don't have a complete solution, but maybe we can use the Python Buildbot Fleet concept and this group to develop a solution.

Thanks, David

Nick Coghlan

2:41 a.m.

On 12 June 2015 at 23:46, David Edelsohn <dje.gcc@gmail.com> wrote:

...

Internally, IBM has been discussing how to improve its participation in Open Source Software CI Testers to ensure that IBM POWER and IBM System z have better coverage. IBM may be able to help with hosting an open build service of diverse systems. We don't have a complete solution, but maybe we can use the Python Buildbot Fleet concept and this group to develop a solution.

That would be very handy, as there's a patch to add AF_IUCV support in http://bugs.python.org/issue23830, which we don't currently have a way to test upstream.

Neale suggested on that issue that the Linux Foundation might also be able to help out with s390x access, but having multiple test systems for any given architecture would be a good thing.

Cheers, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

David Edelsohn

1:18 p.m.

On Fri, Jun 12, 2015 at 10:41 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

On 12 June 2015 at 23:46, David Edelsohn <dje.gcc@gmail.com> wrote:

...
Internally, IBM has been discussing how to improve its participation in Open Source Software CI Testers to ensure that IBM POWER and IBM System z have better coverage. IBM may be able to help with hosting an open build service of diverse systems. We don't have a complete solution, but maybe we can use the Python Buildbot Fleet concept and this group to develop a solution.

That would be very handy, as there's a patch to add AF_IUCV support in http://bugs.python.org/issue23830, which we don't currently have a way to test upstream.

Neale suggested on that issue that the Linux Foundation might also be able to help out with s390x access, but having multiple test systems for any given architecture would be a good thing.

I already am running CPython buildbots on two separate zSeries Linux systems, two PPC64 Linux system, and one PPC64 AIX system.

Thanks, David

3523

Age (days ago)

3525

Last active (days ago)

List overview

Download

12 comments

7 participants

participants (7)

Chris Angelico
David Edelsohn
Kubilay Kocak
Nick Coghlan
R. David Murray
s.krah
Zachary Ware

Fleet health

Thanks,

Kubilay Kocak

tags

participants (7)