IIRC I suggested earlier that buildsbots should be integrated into the PR workflow in order to make it the contributor's rather than a core dev's burden to fix any breakages that result from their changes.
On 22.02.2021 14:20, Victor Stinner wrote:
On Sun, Feb 21, 2021 at 8:57 PM Michał Górny firstname.lastname@example.org wrote:
The checker serves two purposes:
- It gives users an opportunity to provide full PEP 11 support
(buildbot, engineering time) for a platform.
Does that mean that if someone offers to run the build bot for a minor platform and do the necessary maintenance to keep it working, they will be able to stay? How much maintenance is actually expected, i.e. is it sufficient to maintain CPython in a 'good enough' working state to resolve major bugs blocking real usage on these platforms?
Maintaining a buildbot doesn't mean to look at it every 6 months. It means getting emails multiple times per month about real bug which must be fixed. My main annoyance is that every single buildbot failure sends me an email, and I'm overwhelmed by emails (I'm not only getting emails from buildbots ;-)).
Python has already a long list of buildbot workers (between 75 and 100, I'm not sure of the exact number) and they require a lot of attention. Over the last 5 years, there is basically only Pablo Galindo and me who pay attention to them.
I fear that if more buildbots are added, Pablo and me will be the only ones to look at ones. FYI if nobody looks at buildbots, they are basically useless. They only waste resources.
To have an idea of the existing maintenance burden, look at emails sent to: https://email@example.com/
Every single email is basically a problem. There are around 110 emails over the last 30 years: 3.6 email/day in average. When a bug is identified, it requires an investigation which takes between 5 minutes and 6 months depending on the bug. I would say 2 hours in average. Sometimes, if the investigation is too long, we simply revert the change.
The buildbot configuration also requires maintenance. For example, 14 commits have been pushed since January 1st: https://github.com/python/buildmaster-config/commits/master
Multiple buildbots are "unstable": tests are failing randomly. Again, each failure means a new email. For example, test_asyncio likes to fail once every 10 runs (coarse average, I didn't check exactly). multiprocessing tests, tests using network like imaplib or nntplib, and some other tests fail randomly. Some tests just fail because just once, the buildbot became slower.
People have many ideas to automate bug triage from emails, but so far, nobody came with a concrete working solution, and so emails are still read manually one by one. Also, almost nobody is trying to fix tests which are failing randomly.
For example, I called multiple times for help to fix test_asyncio, so far it's still randomly every day:
- 2020: https://firstname.lastname@example.org/message/Y7I5ADXA...
- 2019: https://email@example.com/message/R7X6NKGE...
By the way, these random failures are not only affecting buildbots, but also CIs run on pull requests. It's *common* that these failures prevent to merge a pull request and require manual actions to be able to merge the PR (usually, re-run all CIs).
I'm not talking about exotic platforms with very slow hardware, but platorms like Linux/x86-64 with "fast" hardware.
I expect more random errors on exotic platforms. For example, I reported a crash on AIX one year ago, and nobody fixed it so far. I pushed a few fixes for that crash, but it's not enough to fully fix it: https://bugs.python.org/issue40068
I pushed AIX fixes only because I was annoyed by getting buildbot emails about AIX failures. Sometimes, I just turn off emails from AIX.
Since there is no proactive work on fixing AIX issues, I would even prefer to *remove* the AIX buildbots.