[Twisted-Python] buildbot.twistedmatrix.com is down a lot

In the past few days, buildbot.twistedmatrix.com seems to be down all the time, and requires manual restarts. As I write this, it is down right now.
Is there something wrong with the hardware involved with buildbot.twistedmatrix.com?
-- Craig

On 17 July 2016 at 06:11, Craig Rodrigues rodrigc@crodrigues.org wrote:
In the past few days, buildbot.twistedmatrix.com seems to be down all the time, and requires manual restarts. As I write this, it is down right now.
Is there something wrong with the hardware involved with buildbot.twistedmatrix.com?
The hardware is fine. For some unknown reason the buildmaster process is terminated.
I have restarted it again.

It's OOMing -- I think the upgrade to Eight trunk introduced some sort of memory usage regression or we've done something wrong -- I've unfortunately not had time to investigate.
We could size up the RAM in the meantime I guess?
-Amber
On 17 Jul 2016 08:19, "Adi Roiban" adi@roiban.ro wrote:
On 17 July 2016 at 06:11, Craig Rodrigues rodrigc@crodrigues.org wrote:
In the past few days, buildbot.twistedmatrix.com seems to be down all the time, and requires manual restarts. As I write this, it is down right now.
Is there something wrong with the hardware involved with buildbot.twistedmatrix.com?
The hardware is fine. For some unknown reason the buildmaster process is terminated.
I have restarted it again.
Adi Roiban
Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

On 17 July 2016 at 07:21, Amber Brown hawkowl@atleastfornow.net wrote:
It's OOMing -- I think the upgrade to Eight trunk introduced some sort of memory usage regression or we've done something wrong -- I've unfortunately not had time to investigate.
We could size up the RAM in the meantime I guess?
-Amber
I can try to revert the github webhooks + github status send and see if we still get these errors.
I also don't have too much time to investigate, but I can revert things if it helps.

Yeah, that's a good idea - disable them for now, and we'll see if the OOMs happen. Then we can investigate them closer if it stops.
On 17 Jul 2016 08:37, "Adi Roiban" adi@roiban.ro wrote:
On 17 July 2016 at 07:21, Amber Brown hawkowl@atleastfornow.net wrote:
It's OOMing -- I think the upgrade to Eight trunk introduced some sort of memory usage regression or we've done something wrong -- I've unfortunately not had time to investigate.
We could size up the RAM in the meantime I guess?
-Amber
I can try to revert the github webhooks + github status send and see if we still get these errors.
I also don't have too much time to investigate, but I can revert things if it helps.
-- Adi Roiban
Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

On 17 July 2016 at 07:38, Amber Brown hawkowl@atleastfornow.net wrote:
Yeah, that's a good idea - disable them for now, and we'll see if the OOMs happen. Then we can investigate them closer if it stops.
On 17 Jul 2016 08:37, "Adi Roiban" adi@roiban.ro wrote:
On 17 July 2016 at 07:21, Amber Brown hawkowl@atleastfornow.net wrote:
It's OOMing -- I think the upgrade to Eight trunk introduced some sort of memory usage regression or we've done something wrong -- I've unfortunately not had time to investigate.
We could size up the RAM in the meantime I guess?
-Amber
I can try to revert the github webhooks + github status send and see if we still get these errors.
I also don't have too much time to investigate, but I can revert things if it helps.
There is this ticket https://github.com/twisted-infra/braid/issues/216 to track the progress and changes.

On 17 July 2016 at 07:21, Amber Brown hawkowl@atleastfornow.net wrote:
It's OOMing (...)
Have you considered something like monit[1] to detect & restart in cases like this?

On 18 July 2016 at 19:04, James Broadhead jamesbroadhead@gmail.com wrote:
On 17 July 2016 at 07:21, Amber Brown hawkowl@atleastfornow.net wrote:
It's OOMing (...)
Have you considered something like monit[1] to detect & restart in cases like this?
This might help, but will not help up understand what we are doing wrong :)
After disabling the github webhooks, the buildbot look stable... so we might have a clue about what goes wrong.
Right now I don't have time to look into this issue, so github hooks are disabled for now from the GitHub UI.

On Jul 20, 2016, at 6:31 AM, Adi Roiban adi@roiban.ro wrote:
On 18 July 2016 at 19:04, James Broadhead <jamesbroadhead@gmail.com mailto:jamesbroadhead@gmail.com> wrote: On 17 July 2016 at 07:21, Amber Brown <hawkowl@atleastfornow.net mailto:hawkowl@atleastfornow.net> wrote: It's OOMing (...)
Have you considered something like monit[1] to detect & restart in cases like this?
This might help, but will not help up understand what we are doing wrong :)
After disabling the github webhooks, the buildbot look stable... so we might have a clue about what goes wrong.
Right now I don't have time to look into this issue, so github hooks are disabled for now from the GitHub UI.
Can someone who's had a direct look at the OOMing process (adi? amber?) report this upstream? It's a real pity that we won't get github statuses for buildbot builds any more; that was a huge step in the right direction.
-glyph

On 20 July 2016 at 17:51, Glyph Lefkowitz glyph@twistedmatrix.com wrote:
On Jul 20, 2016, at 6:31 AM, Adi Roiban adi@roiban.ro wrote:
On 18 July 2016 at 19:04, James Broadhead jamesbroadhead@gmail.com wrote:
On 17 July 2016 at 07:21, Amber Brown hawkowl@atleastfornow.net wrote:
It's OOMing (...)
Have you considered something like monit[1] to detect & restart in cases like this?
This might help, but will not help up understand what we are doing wrong :)
After disabling the github webhooks, the buildbot look stable... so we might have a clue about what goes wrong.
Right now I don't have time to look into this issue, so github hooks are disabled for now from the GitHub UI.
Can someone who's had a direct look at the OOMing process (adi? amber?) report this upstream? It's a real pity that we won't get github statuses for buildbot builds any more; that was a huge step in the right direction.
I don't know how to grasp this. By the time I was observing the issue, the buildbot process was already dead.
I have recently discovered the Rackspace monitoring capabilities for VM... and set up a memory notification... not sure who will receive the alerts.
I have re-enable to GitHub hooks and will start taking a closer look at the buildmaster process.... but maybe 2GB is just not enough for a buildmaster.
I have triggered the creation of an image for the current buildbot machine and will consider upgrading the buildbot to 4GB of memory to see if we still hit the ceiling.
For my project I have a similar buildmaster based on number of builders and slaves (without github hooks and without linter factories) and in 2 weeks of uptime the virtual memory usage is 1.5GB .... so mabybe 2GB is just not enough for buildbot.

On Jul 20, 2016, at 11:01 AM, Adi Roiban adi@roiban.ro wrote:
On 20 July 2016 at 17:51, Glyph Lefkowitz <glyph@twistedmatrix.com mailto:glyph@twistedmatrix.com> wrote:
On Jul 20, 2016, at 6:31 AM, Adi Roiban <adi@roiban.ro mailto:adi@roiban.ro> wrote:
On 18 July 2016 at 19:04, James Broadhead <jamesbroadhead@gmail.com mailto:jamesbroadhead@gmail.com> wrote: On 17 July 2016 at 07:21, Amber Brown <hawkowl@atleastfornow.net mailto:hawkowl@atleastfornow.net> wrote: It's OOMing (...)
Have you considered something like monit[1] to detect & restart in cases like this?
This might help, but will not help up understand what we are doing wrong :)
After disabling the github webhooks, the buildbot look stable... so we might have a clue about what goes wrong.
Right now I don't have time to look into this issue, so github hooks are disabled for now from the GitHub UI.
Can someone who's had a direct look at the OOMing process (adi? amber?) report this upstream? It's a real pity that we won't get github statuses for buildbot builds any more; that was a huge step in the right direction.
I don't know how to grasp this. By the time I was observing the issue, the buildbot process was already dead.
Yeah, these types of issues are tricky to debug. Thanks for looking into it nonetheless; I was hoping you knew more, but if you don't, nothing to be done.
I have recently discovered the Rackspace monitoring capabilities for VM... and set up a memory notification... not sure who will receive the alerts.
I'll make sure that the relevant people are on the monitoring list.
I have re-enable to GitHub hooks and will start taking a closer look at the buildmaster process.... but maybe 2GB is just not enough for a buildmaster.
Thanks.
I have triggered the creation of an image for the current buildbot machine and will consider upgrading the buildbot to 4GB of memory to see if we still hit the ceiling.
For my project I have a similar buildmaster based on number of builders and slaves (without github hooks and without linter factories) and in 2 weeks of uptime the virtual memory usage is 1.5GB .... so mabybe 2GB is just not enough for buildbot.
Bummer. It does seem like that's quite likely.
-glyph

On Jul 20, 2016, at 2:31 PM, Glyph Lefkowitz glyph@twistedmatrix.com wrote:
I have recently discovered the Rackspace monitoring capabilities for VM... and set up a memory notification... not sure who will receive the alerts.
I'll make sure that the relevant people are on the monitoring list.
I created 'technical contact' users for you and Amber, with current email addresses, which you can use (and even log in as!) if you edit yourselves under 'user management'. I apparently had one already. You should both have a bogus alert about a MySQL server (since we don't run mysql it seemed a reasonable thing to test). Make sure that's not flagged as spam and we should all be set up to receive alerts :).
I also added some basic HTTPS monitoring to it as well, so we should see if it goes down for reasons unrelated to memory.
-glyph

On 21 July 2016 at 00:58, Glyph Lefkowitz glyph@twistedmatrix.com wrote:
On Jul 20, 2016, at 2:31 PM, Glyph Lefkowitz glyph@twistedmatrix.com wrote:
I have recently discovered the Rackspace monitoring capabilities for VM... and set up a memory notification... not sure who will receive the alerts.
I'll make sure that the relevant people are on the monitoring list.
I created 'technical contact' users for you and Amber, with current email addresses, which you can use (and even log in as!) if you edit yourselves under 'user management'. I apparently had one already. You should both have a bogus alert about a MySQL server (since we don't run mysql it seemed a reasonable thing to test). Make sure that's not flagged as spam and we should all be set up to receive alerts :).
I also added some basic HTTPS monitoring to it as well, so we should see if it goes down for reasons unrelated to memory.
OK. I have received the mysql error
I can see that when there we got more builds there is significant increase in memory usage... but will recover once moved to idle.
For now the VM still has 2GB ... and GitHub webhooks are still enabled
Regards

On Jul 21, 2016, at 5:49 AM, Adi Roiban adi@roiban.ro wrote:
I can see that when there we got more builds there is significant increase in memory usage... but will recover once moved to idle.
Cool. Is there something we can do to limit the global concurrency of the builds to preserve resources on the buildmaster, then?
Or: perhaps we could move the buildbot to Carina, which has 4G of RAM and won't impact our hosting budget?
-glyph
participants (5)
-
Adi Roiban
-
Amber Brown
-
Craig Rodrigues
-
Glyph Lefkowitz
-
James Broadhead