[Twisted-Python] buildbot.twistedmatrix.com is down a lot
In the past few days, buildbot.twistedmatrix.com seems to be down all the time, and requires manual restarts. As I write this, it is down right now. Is there something wrong with the hardware involved with buildbot.twistedmatrix.com? -- Craig
On 17 July 2016 at 06:11, Craig Rodrigues
In the past few days, buildbot.twistedmatrix.com seems to be down all the time, and requires manual restarts. As I write this, it is down right now.
Is there something wrong with the hardware involved with buildbot.twistedmatrix.com?
The hardware is fine. For some unknown reason the buildmaster process is terminated. I have restarted it again. -- Adi Roiban
It's OOMing -- I think the upgrade to Eight trunk introduced some sort of
memory usage regression or we've done something wrong -- I've unfortunately
not had time to investigate.
We could size up the RAM in the meantime I guess?
-Amber
On 17 Jul 2016 08:19, "Adi Roiban"
On 17 July 2016 at 06:11, Craig Rodrigues
wrote: In the past few days, buildbot.twistedmatrix.com seems to be down all the time, and requires manual restarts. As I write this, it is down right now.
Is there something wrong with the hardware involved with buildbot.twistedmatrix.com?
The hardware is fine. For some unknown reason the buildmaster process is terminated.
I have restarted it again. -- Adi Roiban
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
On 17 July 2016 at 07:21, Amber Brown
It's OOMing -- I think the upgrade to Eight trunk introduced some sort of memory usage regression or we've done something wrong -- I've unfortunately not had time to investigate.
We could size up the RAM in the meantime I guess?
-Amber
I can try to revert the github webhooks + github status send and see if we still get these errors. I also don't have too much time to investigate, but I can revert things if it helps. -- Adi Roiban
Yeah, that's a good idea - disable them for now, and we'll see if the OOMs
happen. Then we can investigate them closer if it stops.
On 17 Jul 2016 08:37, "Adi Roiban"
On 17 July 2016 at 07:21, Amber Brown
wrote: It's OOMing -- I think the upgrade to Eight trunk introduced some sort of memory usage regression or we've done something wrong -- I've unfortunately not had time to investigate.
We could size up the RAM in the meantime I guess?
-Amber
I can try to revert the github webhooks + github status send and see if we still get these errors.
I also don't have too much time to investigate, but I can revert things if it helps.
-- Adi Roiban
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
Yeah, that's a good idea - disable them for now, and we'll see if the OOMs happen. Then we can investigate them closer if it stops.
On 17 Jul 2016 08:37, "Adi Roiban"
wrote: On 17 July 2016 at 07:21, Amber Brown
wrote: It's OOMing -- I think the upgrade to Eight trunk introduced some sort of memory usage regression or we've done something wrong -- I've unfortunately not had time to investigate.
We could size up the RAM in the meantime I guess?
-Amber
I can try to revert the github webhooks + github status send and see if we still get these errors.
I also don't have too much time to investigate, but I can revert things if it helps.
There is this ticket https://github.com/twisted-infra/braid/issues/216 to
On 17 July 2016 at 07:38, Amber Brown
On 17 July 2016 at 07:21, Amber Brown
It's OOMing (...)
Have you considered something like monit[1] to detect & restart in cases like this? [1] https://mmonit.com/monit/
On 18 July 2016 at 19:04, James Broadhead
On 17 July 2016 at 07:21, Amber Brown
wrote: It's OOMing (...)
Have you considered something like monit[1] to detect & restart in cases like this?
This might help, but will not help up understand what we are doing wrong :) After disabling the github webhooks, the buildbot look stable... so we might have a clue about what goes wrong. Right now I don't have time to look into this issue, so github hooks are disabled for now from the GitHub UI. -- Adi Roiban
On Jul 20, 2016, at 6:31 AM, Adi Roiban
wrote: On 18 July 2016 at 19:04, James Broadhead
mailto:jamesbroadhead@gmail.com> wrote: On 17 July 2016 at 07:21, Amber Brown mailto:hawkowl@atleastfornow.net> wrote: It's OOMing (...) Have you considered something like monit[1] to detect & restart in cases like this?
This might help, but will not help up understand what we are doing wrong :)
After disabling the github webhooks, the buildbot look stable... so we might have a clue about what goes wrong.
Right now I don't have time to look into this issue, so github hooks are disabled for now from the GitHub UI.
Can someone who's had a direct look at the OOMing process (adi? amber?) report this upstream? It's a real pity that we won't get github statuses for buildbot builds any more; that was a huge step in the right direction. -glyph
On 20 July 2016 at 17:51, Glyph Lefkowitz
On Jul 20, 2016, at 6:31 AM, Adi Roiban
wrote: On 18 July 2016 at 19:04, James Broadhead
wrote: On 17 July 2016 at 07:21, Amber Brown
wrote: It's OOMing (...)
Have you considered something like monit[1] to detect & restart in cases like this?
This might help, but will not help up understand what we are doing wrong :)
After disabling the github webhooks, the buildbot look stable... so we might have a clue about what goes wrong.
Right now I don't have time to look into this issue, so github hooks are disabled for now from the GitHub UI.
Can someone who's had a direct look at the OOMing process (adi? amber?) report this upstream? It's a real pity that we won't get github statuses for buildbot builds any more; that was a huge step in the right direction.
I don't know how to grasp this. By the time I was observing the issue, the buildbot process was already dead. I have recently discovered the Rackspace monitoring capabilities for VM... and set up a memory notification... not sure who will receive the alerts. I have re-enable to GitHub hooks and will start taking a closer look at the buildmaster process.... but maybe 2GB is just not enough for a buildmaster. I have triggered the creation of an image for the current buildbot machine and will consider upgrading the buildbot to 4GB of memory to see if we still hit the ceiling. For my project I have a similar buildmaster based on number of builders and slaves (without github hooks and without linter factories) and in 2 weeks of uptime the virtual memory usage is 1.5GB .... so mabybe 2GB is just not enough for buildbot. -- Adi Roiban
On Jul 20, 2016, at 11:01 AM, Adi Roiban
wrote: On 20 July 2016 at 17:51, Glyph Lefkowitz
mailto:glyph@twistedmatrix.com> wrote: On Jul 20, 2016, at 6:31 AM, Adi Roiban
mailto:adi@roiban.ro> wrote: On 18 July 2016 at 19:04, James Broadhead
mailto:jamesbroadhead@gmail.com> wrote: On 17 July 2016 at 07:21, Amber Brown mailto:hawkowl@atleastfornow.net> wrote: It's OOMing (...) Have you considered something like monit[1] to detect & restart in cases like this?
This might help, but will not help up understand what we are doing wrong :)
After disabling the github webhooks, the buildbot look stable... so we might have a clue about what goes wrong.
Right now I don't have time to look into this issue, so github hooks are disabled for now from the GitHub UI.
Can someone who's had a direct look at the OOMing process (adi? amber?) report this upstream? It's a real pity that we won't get github statuses for buildbot builds any more; that was a huge step in the right direction.
I don't know how to grasp this. By the time I was observing the issue, the buildbot process was already dead.
Yeah, these types of issues are tricky to debug. Thanks for looking into it nonetheless; I was hoping you knew more, but if you don't, nothing to be done.
I have recently discovered the Rackspace monitoring capabilities for VM... and set up a memory notification... not sure who will receive the alerts.
I'll make sure that the relevant people are on the monitoring list.
I have re-enable to GitHub hooks and will start taking a closer look at the buildmaster process.... but maybe 2GB is just not enough for a buildmaster.
Thanks.
I have triggered the creation of an image for the current buildbot machine and will consider upgrading the buildbot to 4GB of memory to see if we still hit the ceiling.
For my project I have a similar buildmaster based on number of builders and slaves (without github hooks and without linter factories) and in 2 weeks of uptime the virtual memory usage is 1.5GB .... so mabybe 2GB is just not enough for buildbot.
Bummer. It does seem like that's quite likely. -glyph
On Jul 20, 2016, at 2:31 PM, Glyph Lefkowitz
wrote: I have recently discovered the Rackspace monitoring capabilities for VM... and set up a memory notification... not sure who will receive the alerts.
I'll make sure that the relevant people are on the monitoring list.
I created 'technical contact' users for you and Amber, with current email addresses, which you can use (and even log in as!) if you edit yourselves under 'user management'. I apparently had one already. You should both have a bogus alert about a MySQL server (since we don't run mysql it seemed a reasonable thing to test). Make sure that's not flagged as spam and we should all be set up to receive alerts :). I also added some basic HTTPS monitoring to it as well, so we should see if it goes down for reasons unrelated to memory. -glyph
On 21 July 2016 at 00:58, Glyph Lefkowitz
On Jul 20, 2016, at 2:31 PM, Glyph Lefkowitz
wrote: I have recently discovered the Rackspace monitoring capabilities for VM... and set up a memory notification... not sure who will receive the alerts.
I'll make sure that the relevant people are on the monitoring list.
I created 'technical contact' users for you and Amber, with current email addresses, which you can use (and even log in as!) if you edit yourselves under 'user management'. I apparently had one already. You should both have a bogus alert about a MySQL server (since we don't run mysql it seemed a reasonable thing to test). Make sure that's not flagged as spam and we should all be set up to receive alerts :).
I also added some basic HTTPS monitoring to it as well, so we should see if it goes down for reasons unrelated to memory.
OK. I have received the mysql error I can see that when there we got more builds there is significant increase in memory usage... but will recover once moved to idle. For now the VM still has 2GB ... and GitHub webhooks are still enabled Regards -- Adi Roiban
On Jul 21, 2016, at 5:49 AM, Adi Roiban
wrote: I can see that when there we got more builds there is significant increase in memory usage... but will recover once moved to idle.
Cool. Is there something we can do to limit the global concurrency of the builds to preserve resources on the buildmaster, then? Or: perhaps we could move the buildbot to Carina, which has 4G of RAM and won't impact our hosting budget? -glyph
participants (5)
-
Adi Roiban
-
Amber Brown
-
Craig Rodrigues
-
Glyph Lefkowitz
-
James Broadhead