
Antoine Pitrou wrote:
For a), I think we can solve this only by redundancy, i.e. create more build slaves, hoping that a sufficient number would be up at any point in time.
We are already doing this, aren't we? http://www.python.org/dev/buildbot/3.x/
It doesn't seem to work very well, it's a bit like a Danaides vessel.
Both true. However, it seems that Mark is unhappy with the current set of systems, so we probably need to do it again.
Well, to be fair, buildbots breaking also happens much more frequently (perhaps one or two orders of magnitude) than the SVN server or the Web site going down. Maintaining them looks like a Sisyphean task, and nobody wants that.
It only looks so. It is like any server management task - it takes constant effort. However, it is not Sisyphean (feeling Greek today, ain't you :-); since you actually achieve something. It's not hard to restart a buildbot when it has crashed, and it gives a warm feeling of having achieved something.
I don't know what kind of machines are the current slaves, but if they are 24/7 servers, isn't it a bit surprising that the slaves would go down so often? Is the buildbot software fragile?
Not really. It sometimes happens that the slaves don't reconnect after a master restart, but more often, it is just a change on the slave side that breaks it (such as a reboot done to the machine, and not having the machine configured to restart the slave after the reboot).
Does it require a lot of (maintenance, repair) work from the slave owners?
On Unix, not really. On Windows, there is still the issue that sometimes, some error message pops up which you need to click away. Over several builds, you may find that you have to click away dozens of such messages. This could use some improvement. Regards, Martin