I could easily mark this as the worst morning in as far back as I can remember. Without the first cup of coffee I sat down to scan our servers like I do everyday, just looking for anything out of the ordinary, like services that failed to run. For the most part it is a ten minute job that rarely varies day to day. This morning was an exception.
Nearly every nightly job failed. Worse than that there was an hour and ten minute hole in the logs, 0155 to 0305 was completely unaccounted. I scanned every log from authentication to our application logs and every single one of them showed this hole but checking our external monitoring service showed that we had zero downtime. What the hell happened?
A cold hand of desperation and fear gripped my stomach leaving me dizzy. I ran chkrootkit but came up clean so I mentally prepared myself to rebuild the server and possibly be eviscerated by my bosses. How would I explain this? How could I protect us from it happening again, that is if I still have my job?
Sitting helpless I realized, “Spring Ahead”.
Monit has been serving us pretty well for the last 7 months or so in terms of keeping an eye on our mongrels and kicking them back in line if they act up. At the moment, we are running them in clusters of 3 with only 6 clusters in our current production set up and the monit restart all, for the most part, works fine when rolling them after a deploy. It is a completely different situation with 10 clusters which we are experimenting in a situation where Apache+mod_proxy sits on a separate server from the mongrels–it truly is wonderful to see Apache perform under load when it has all the resources it needs.
The problem seems to be with how resource hungry and ponderous Rails can be when firing up a mongrel, sucking up 25%+ user CPU and making the system gobble up another 10%+. This is enough so that when monit is trying to bring up or down 30 mongrels some of the pack gets left out, failing to either shutdown or start up. Now there is a monkey patch out there that addresses this very issue but I am a little wary in patching mongrel_cluster in our production environment as it might cause me headaches later with upgrades. So what is my solution?
Pressed for time it is a little kludgey and demonstrates some truly sloppy bash scripting but…it works.
#!/bin/bash
monit stop all -g pack_01
echo "Stopping 8100-02"
sleep 12s
monit stop all -g pack_02
echo "Stopping 8103-05"
sleep 12s
monit stop all -g pack_03
echo "Stopping 8106-08"
sleep 12s
monit stop all -g pack_04
echo "Stopping 8109-11"
sleep 12s
monit stop all -g pack_05
echo "Stopping 8112-14"
sleep 12s
monit stop all -g pack_06
echo "Stopping 8115-17"
sleep 12s
monit stop all -g pack_07
echo "Stopping 8118-20"
sleep 12s
monit stop all -g pack_08
echo "Stopping 8121-23"
sleep 12s
monit stop all -g pack_09
echo "Stopping 8124-26"
sleep 12s
monit stop all -g pack_10
echo "Stopping 8127-29"
So what I’ve done in the monitrc file is defined each mongrel as belonging to a group that reflects its cluster. Then I issue a stop to each group with a 12 second delay between each so that Rails and monit can navigate around each other with out either flipping out and forgetting to do their respective jobs. Starting is the same as above.
Does it work? Well, I have tried it under load several times (siege with a concurrency of 30 and the mongrels bloated up to our obese resting state) and it seems to work like a charm. I added the echos when our dev team expressed concern cap would time out while the script took its time to run all the way through.
File under: Ugly, Functional.