Status of the Buildbot fleet and related bugs

The buildbot waterfall is much greener now. Thanks to all who have contributed to making it so (and it hasn't just been Mark and Antoine and I, though we've been the most directly active (and yes, Mark, you did contribute several fixes!)). The 'stable builders' fleet is green now except for: (1) issue 7269: occasional 2.6/trunk bsddb3 failures on windows (2) issue 6748: random 3.1/3.x failures on most buidbots. (3) the klose-debian-ppc builder being offline Of these, (2) is by _far_ the biggest issue, and the one that causes the most flap (success-failure-success). And flap is the thing that most harms the buildbot usefulness. Anyone who wants to debug this on a platform where it is consistently reproducible please email me your public key and I'll give you a shell account on my buildbot (Antoine already has one). In the 'unstable' builder fleet, Antoine's new builder seems to be stable across the board, while mine fails consistently on 3.1 and 3.x because of the test_telnetlib bug. Thomas Heller's OS X buildbot is consistently having lots of test failures (the same ones each time I've checked). The master claims the klose-debian-alpha buildbot doesn't know about branches, which is odd since it was working not too long ago. The remaining buildslaves appear to have been offline for some time. Open issues here are: (1) issue 3864: FreeBSD testing hangs consistently. According to the ticket this is a FreeBSD bug fixed in 6.4, so an OS upgrade on the buildslave would probably solve it. (2) issue 4970: consistent signal 32 error on the norwitz-x86 Gentoo buildslave in 3.1 and 3.x. This may be due to the box running an old threading library, but it does make one wonder what changed in 3.x that exposed it. Another issue that I've seen on the buildbots but that doesn't seem to be showing up right now (is it fixed?) is issue 7251, which Mark is working on. So, overall I think the buildbot fleet is in good shape, and if we can nail issue 6748 I think it will be back to being an important resource for sanity checking our checkins. By the way, Georg set up the IRC interface on the #python-dev channel, so you can hang out there if you want to get realtime reports of which buildbots have going from success to failure and vice versa. --David

On Fri, Nov 6, 2009 at 3:53 AM, R. David Murray <rdmurray@bitdance.com> wrote:
(1) issue 3864: FreeBSD testing hangs consistently. According to the ticket this is a FreeBSD bug fixed in 6.4, so an OS upgrade on the buildslave would probably solve it.
I think the particular issue mentioned in 3864 is fixed, in some sense: test_signal used to hang, but now just plain fails in a reasonable amount of time (~15 seconds) instead. So at least the test_signal failure isn't preventing us from seeing the results of other tests. The big problem now on the FreeBSD buildbot is that test_multiprocessing reliably causes the whole test run to abort with 'Signal 12'. Solving this may be as simple as just getting someone to install a copy of FreeBSD 6.2 on an ssh-accessible machine so that the source of the error can be tracked down.
(2) issue 4970: consistent signal 32 error on the norwitz-x86 Gentoo buildslave in 3.1 and 3.x. This may be due to the box running an old threading library, but it does make one wonder what changed in 3.x that exposed it.
This error has been happening since well before 3.0 was released. Asking for access to Neal's machine is probably the only sensible way to diagnose it. (A less invasive but slower way to debug would be to create a branch especially for this bug and do repeated runs to figure out which part of test_os is causing the failure.)
Another issue that I've seen on the buildbots but that doesn't seem to be showing up right now (is it fixed?) is issue 7251, which Mark is working on.
It's not fixed, but I hope to have time to fix it this weekend. It's just not showing up on some runs because test_multiprocessing kills the buildslave first. :-)
So, overall I think the buildbot fleet is in good shape, and if we can nail issue 6748 I think it will be back to being an important resource for sanity checking our checkins.
Wholeheartedly agreed! Sorting out the FreeBSD test_multiprocessing Signal 12 would be nice, too. I'll open an issue for this, unless someone else gets there first. Mark

Mark Dickinson <dickinsm@gmail.com> writes:
On Fri, Nov 6, 2009 at 3:53 AM, R. David Murray <rdmurray@bitdance.com> wrote:
(1) issue 3864: FreeBSD testing hangs consistently. According to the ticket this is a FreeBSD bug fixed in 6.4, so an OS upgrade on the buildslave would probably solve it.
I think the particular issue mentioned in 3864 is fixed, in some sense: test_signal used to hang, but now just plain fails in a reasonable amount of time (~15 seconds) instead. So at least the test_signal failure isn't preventing us from seeing the results of other tests.
The big problem now on the FreeBSD buildbot is that test_multiprocessing reliably causes the whole test run to abort with 'Signal 12'. Solving this may be as simple as just getting someone to install a copy of FreeBSD 6.2 on an ssh-accessible machine so that the source of the error can be tracked down.
I could arrange ssh access to the build slave if that would help anyone who wants to look into that. Just contact me directly. In terms of the overall release, I'm also fine with upgrading the build slave to 6.4. Or I could jump up to 7.2 instead. When I first brought the build slave up, 7.x wasn't finalized yet - not sure now which release is more prevalent in use. -- David

On Fri, Nov 6, 2009 at 10:54 AM, David Bolen <db3l.net@gmail.com> wrote:
Mark Dickinson <dickinsm@gmail.com> writes:
The big problem now on the FreeBSD buildbot is that test_multiprocessing reliably causes the whole test run to abort with 'Signal 12'. Solving this may be as simple as just getting someone to install a copy of FreeBSD 6.2 on an ssh-accessible machine so that the source of the error can be tracked down.
I could arrange ssh access to the build slave if that would help anyone who wants to look into that.
I suspect it would help a lot! Thanks for the offer. Jesse would likely be able to pin down the cause of the failure faster than I would, but I don't know how many cycles he has available.
Just contact me directly.
I'll do that, if/when I find time.
In terms of the overall release, I'm also fine with upgrading the build slave to 6.4. Or I could jump up to 7.2 instead. When I first brought the build slave up, 7.x wasn't finalized yet - not sure now which release is more prevalent in use.
Not sure either, but it's certainly useful to be able to test on FreeBSD 6.x. Ideally, we'd have buildslaves for both... Mark

Mark Dickinson <dickinsm@gmail.com> writes:
Not sure either, but it's certainly useful to be able to test on FreeBSD 6.x. Ideally, we'd have buildslaves for both...
Well, let me plan on leaving the current slave at 6.x (but working on getting it to 6.4). I could probably see about providing an additional 7.x slave - the only real issue is that at the moment they'd both be VMs on the same physical host, which is already shared with my Windows slave. I think a recent posting by Martin mentioned being able to interlock one or more slaves so they didn't execute in parallel, so that could help clean up the contention, though I must admit to not knowing for sure if serial execution would yield a faster overall result among all the slaves, though it feels likely. -- David

Mark Dickinson wrote:
On Fri, Nov 6, 2009 at 10:54 AM, David Bolen <db3l.net@gmail.com> wrote:
In terms of the overall release, I'm also fine with upgrading the build slave to 6.4. Or I could jump up to 7.2 instead. When I first brought the build slave up, 7.x wasn't finalized yet - not sure now which release is more prevalent in use.
Not sure either, but it's certainly useful to be able to test on FreeBSD 6.x. Ideally, we'd have buildslaves for both...
http://security.freebsd.org/#sup lists the releases currently with active security updates and their EOL dates. Upgrading the 6.2 buildbot to 6.4 would seem a good idea to me. Unfortunately FreeBSD's binary update capability only became available with 6.3... :-( A separate 7.2 buildbot (which could then be binary updated to 7.3 on release) would also seem a good idea. 8.0 is at RC3 and could be expected to be finalised fairly soon. My understanding is there won't be another 6.x release, and I'm inferring that there will be one more 7.x release (7.3). Andrew. -- ------------------------------------------------------------------------- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au (pref) | Snail: PO Box 370 andymac@pcug.org.au (alt) | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia

Andrew MacIntyre <andymac@bullseye.apana.org.au> writes:
Upgrading the 6.2 buildbot to 6.4 would seem a good idea to me. Unfortunately FreeBSD's binary update capability only became available with 6.3... :-(
No biggie - I was just planning on installing a new VM from scratch anyway, and then just cutting over VMs to switch over. -- David

There are non-stable buildbots that are failing consistently, but this message is about something else. Now that the biggest stability issues have been addressed some less-noisy stability issues are visible. The two that I have noticed most often are test_httpsservers, which hangs occasionally, and test_multiprocessing, which fails in various assertions occasionally. Since the buildbots are often slow and/or heavily loaded, I tried increasing DELTA in test_multiprocessing, but while it did seem to help some as evidenced by how how many times it ran in my buildbot using -F before and after, it did not prevent failures. --David (RDM)

On Fri, Nov 6, 2009 at 3:27 AM, Mark Dickinson <dickinsm@gmail.com> wrote:
On Fri, Nov 6, 2009 at 3:53 AM, R. David Murray <rdmurray@bitdance.com> wrote:
(1) issue 3864: FreeBSD testing hangs consistently. According to the ticket this is a FreeBSD bug fixed in 6.4, so an OS upgrade on the buildslave would probably solve it.
I think the particular issue mentioned in 3864 is fixed, in some sense: test_signal used to hang, but now just plain fails in a reasonable amount of time (~15 seconds) instead. So at least the test_signal failure isn't preventing us from seeing the results of other tests.
The big problem now on the FreeBSD buildbot is that test_multiprocessing reliably causes the whole test run to abort with 'Signal 12'. Solving this may be as simple as just getting someone to install a copy of FreeBSD 6.2 on an ssh-accessible machine so that the source of the error can be tracked down.
Sorry I haven't been watching the bots; the signal 12 is new, I don't have direct access to a fbsd box to poke at it. I'm fairly alarmed that it's triggering that. Let me know what I can do to help - fbsd support of certain lower level shared semaphore stuff has been spotty up until the most recent versions.
Wholeheartedly agreed! Sorting out the FreeBSD test_multiprocessing Signal 12 would be nice, too. I'll open an issue for this, unless someone else gets there first.
If/when you do, please add me on the noisy list. I remember there being some patche(s) pending for FBSD around improved shared semaphore support, but I can't find the bug report right now. jesse

On Fri, Nov 6, 2009 at 12:27 AM, Mark Dickinson <dickinsm@gmail.com> wrote:
On Fri, Nov 6, 2009 at 3:53 AM, R. David Murray <rdmurray@bitdance.com> wrote:
(2) issue 4970: consistent signal 32 error on the norwitz-x86 Gentoo buildslave in 3.1 and 3.x. This may be due to the box running an old threading library, but it does make one wonder what changed in 3.x that exposed it.
This error has been happening since well before 3.0 was released. Asking for access to Neal's machine is probably the only sensible way to diagnose it. (A less invasive but slower way to debug would be to create a branch especially for this bug and do repeated runs to figure out which part of test_os is causing the failure.)
IIRC, I spent quite a bit of time trying to nail this down. I don't remember finding any useful information on the cause (beyond narrowing it to some tests). As Mark said, this has been happening for a long time. I'm reticent to provide access to the machine, as it's not really mine. I'm not even sure I have access, I haven't logged in for a long time. I'd just like to say thanks again to everyone for making the buildbots more green and also improving the general testing infrastructure for Python. n

Neal Norwitz wrote:
I'd just like to say thanks again to everyone for making the buildbots more green and also improving the general testing infrastructure for Python.
I'm *really* liking the new assertions in unittest. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

R. David Murray schrieb:
The buildbot waterfall is much greener now. Thanks to all who have contributed to making it so (and it hasn't just been Mark and Antoine and I, though we've been the most directly active (and yes, Mark, you did contribute several fixes!)). [...] In the 'unstable' builder fleet, Antoine's new builder seems to be stable across the board, while mine fails consistently on 3.1 and 3.x because of the test_telnetlib bug. Thomas Heller's OS X buildbot is consistently having lots of test failures (the same ones each time I've checked).
My buildbot is behind our companys firewall. Well, I was able to fix the test_smtpnet test by additionally opening port 465 in the firewall; however I'm not really sure I should do that. I had to open another port already although that one is probably less critical. For the other test failures, I have no idea where they come from. -- Thanks, Thomas

R. David Murray schrieb:
So, overall I think the buildbot fleet is in good shape, and if we can nail issue 6748 I think it will be back to being an important resource for sanity checking our checkins.
Yay! Thanks to all of you!
By the way, Georg set up the IRC interface on the #python-dev channel, so you can hang out there if you want to get realtime reports of which buildbots have going from success to failure and vice versa.
JFTR, I didn't set up the IRC bot (I assume that credit goes to Martin, even if it's only one line in the buildbot config :). I just tried to get it to say something :) Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.

On Nov 6, 2009, at 6:34 PM, Georg Brandl wrote:
R. David Murray schrieb:
So, overall I think the buildbot fleet is in good shape, and if we can nail issue 6748 I think it will be back to being an important resource for sanity checking our checkins.
Yay! Thanks to all of you!
Indeed! It's great to see so much work going into build and test maintenance. Thanks a lot!

On Sun, 8 Nov 2009 at 19:44, "Martin v. L�wis" wrote:
JFTR, I didn't set up the IRC bot (I assume that credit goes to Martin, even if it's only one line in the buildbot config :). I just tried to get it to say something :)
Yes, it was always "on". I don't use IRC regularly, so I don't know whether it's useful.
I think it is. --David (RDM)

Le Thu, 05 Nov 2009 22:53:27 -0500, R. David Murray a écrit :
The buildbot waterfall is much greener now. Thanks to all who have contributed to making it so (and it hasn't just been Mark and Antoine and I, though we've been the most directly active (and yes, Mark, you did contribute several fixes!)).
The buildbots still show occasional oddities. For example, right now in the page "http://www.python.org/dev/buildbot/3.x/", some results have disappeared (the columns for "AMD64 Ubuntu" builders have become empty). Moreover, some buildslaves have gone back in time (they are building r76188 after having built and tested r76195)... I swear the new GIL doesn't include a time machine. Regards Antoine.

The buildbot waterfall is much greener now. Thanks to all who have contributed to making it so (and it hasn't just been Mark and Antoine and I, though we've been the most directly active (and yes, Mark, you did contribute several fixes!)).
The buildbots still show occasional oddities. For example, right now in the page "http://www.python.org/dev/buildbot/3.x/", some results have disappeared (the columns for "AMD64 Ubuntu" builders have become empty).
Yes, I noticed it too. It will go away after some page reloads.
Moreover, some buildslaves have gone back in time (they are building r76188 after having built and tested r76195)... I swear the new GIL doesn't include a time machine.
That's because I resubmitted these changes after restarting the master. Regards, Martin

Martin v. Löwis <martin <at> v.loewis.de> writes:
The buildbots still show occasional oddities. For example, right now in the page "http://www.python.org/dev/buildbot/3.x/", some results have disappeared (the columns for "AMD64 Ubuntu" builders have become empty).
Yes, I noticed it too. It will go away after some page reloads.
It is still happening more or less randomly unfortunately. http://www.python.org/dev/buildbot/3.x/

On Sat, 14 Nov 2009 at 00:09, Antoine Pitrou wrote:
Martin v. Löwis <martin <at> v.loewis.de> writes:
The buildbots still show occasional oddities. For example, right now in the page "http://www.python.org/dev/buildbot/3.x/", some results have disappeared (the columns for "AMD64 Ubuntu" builders have become empty).
Yes, I noticed it too. It will go away after some page reloads.
It is still happening more or less randomly unfortunately. http://www.python.org/dev/buildbot/3.x/
The buildbot pages appear to be pretty messed up now. I get many 404s (ex: the above url, the all stable builders page), although some seem to work (ex: the all builders page), and if I stick an 'all' into the URL for my buildbot page I can get to it, though that's is not the version of the URL linked from the 'all builders' table header. --David (RDM)

"R. David Murray" <rdmurray@bitdance.com> writes:
The buildbot pages appear to be pretty messed up now. I get many 404s (ex: the above url, the all stable builders page), although some seem to work (ex: the all builders page), and if I stick an 'all' into the URL for my buildbot page I can get to it, though that's is not the version of the URL linked from the 'all builders' table header.
Yes, I think this just started happening. I'm guessing that the main site proxies the buildbot URL requests to the buildbot master process, and when it's down you get the 404s from the main server. I figured someone might be working on the master, though perhaps it just burped on its own :-) -- David

Yes, I think this just started happening. I'm guessing that the main site proxies the buildbot URL requests to the buildbot master process, and when it's down you get the 404s from the main server.
I figured someone might be working on the master, though perhaps it just burped on its own :-)
It was actually an Apache misconfiguration (the wrong virtual host would pick up requests, missing the reverse proxy configuration). I have fixed that now. Regards, Martin

Martin v. Löwis schrieb:
Yes, I think this just started happening. I'm guessing that the main site proxies the buildbot URL requests to the buildbot master process, and when it's down you get the 404s from the main server.
I figured someone might be working on the master, though perhaps it just burped on its own :-)
It was actually an Apache misconfiguration (the wrong virtual host would pick up requests, missing the reverse proxy configuration). I have fixed that now.
BTW, I noticed that the "cancel build" and "cancel all builds" buttons result in unhandled exceptions. (One build is cancelled nevertheless, so for "cancel build" it's just an inconvenience.) Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
participants (12)
-
"Martin v. Löwis"
-
Andrew MacIntyre
-
Antoine Pitrou
-
David Bolen
-
Georg Brandl
-
Glyph Lefkowitz
-
Jesse Noller
-
Mark Dickinson
-
Neal Norwitz
-
Nick Coghlan
-
R. David Murray
-
Thomas Heller