Monit has been serving us pretty well for the last 7 months or so in terms of keeping an eye on our mongrels and kicking them back in line if they act up. At the moment, we are running them in clusters of 3 with only 6 clusters in our current production set up and the monit restart all, for the most part, works fine when rolling them after a deploy. It is a completely different situation with 10 clusters which we are experimenting in a situation where Apache+mod_proxy sits on a separate server from the mongrels–it truly is wonderful to see Apache perform under load when it has all the resources it needs.
The problem seems to be with how resource hungry and ponderous Rails can be when firing up a mongrel, sucking up 25%+ user CPU and making the system gobble up another 10%+. This is enough so that when monit is trying to bring up or down 30 mongrels some of the pack gets left out, failing to either shutdown or start up. Now there is a monkey patch out there that addresses this very issue but I am a little wary in patching mongrel_cluster in our production environment as it might cause me headaches later with upgrades. So what is my solution?
Pressed for time it is a little kludgey and demonstrates some truly sloppy bash scripting but…it works.
#!/bin/bash monit stop all -g pack_01 echo "Stopping 8100-02" sleep 12s monit stop all -g pack_02 echo "Stopping 8103-05" sleep 12s monit stop all -g pack_03 echo "Stopping 8106-08" sleep 12s monit stop all -g pack_04 echo "Stopping 8109-11" sleep 12s monit stop all -g pack_05 echo "Stopping 8112-14" sleep 12s monit stop all -g pack_06 echo "Stopping 8115-17" sleep 12s monit stop all -g pack_07 echo "Stopping 8118-20" sleep 12s monit stop all -g pack_08 echo "Stopping 8121-23" sleep 12s monit stop all -g pack_09 echo "Stopping 8124-26" sleep 12s monit stop all -g pack_10 echo "Stopping 8127-29"
So what I’ve done in the monitrc file is defined each mongrel as belonging to a group that reflects its cluster. Then I issue a stop to each group with a 12 second delay between each so that Rails and monit can navigate around each other with out either flipping out and forgetting to do their respective jobs. Starting is the same as above.
Does it work? Well, I have tried it under load several times (siege with a concurrency of 30 and the mongrels bloated up to our obese resting state) and it seems to work like a charm. I added the echos when our dev team expressed concern cap would time out while the script took its time to run all the way through.
File under: Ugly, Functional.


Comments
James, Dale
james, Mike
james, Mike, james [...]
james, Mike
james, Mike
james, Kyle Daigle