
For a), I think we can solve this only by redundancy, i.e. create more build slaves, hoping that a sufficient number would be up at any point in time.
We are already doing this, aren't we? http://www.python.org/dev/buildbot/3.x/ It doesn't seem to work very well, it's a bit like a Danaides vessel.
The source of the problem is that such a system can degrade without anybody taking action. If the web server's hard disk breaks down, people panic and look for a solution quickly. If the source control is down, somebody *will* "volunteer" to fix it. If the automated build system produces results less useful, people will worry, but not take action.
Well, to be fair, buildbots breaking also happens much more frequently (perhaps one or two orders of magnitude) than the SVN server or the Web site going down. Maintaining them looks like a Sisyphean task, and nobody wants that. I don't know what kind of machines are the current slaves, but if they are 24/7 servers, isn't it a bit surprising that the slaves would go down so often? Is the buildbot software fragile? Does it require a lot of (maintenance, repair) work from the slave owners?