Archive for the 'Sysadmin' Category

Springtime Hack

I could easily mark this as the worst morning in as far back as I can remember. Without the first cup of coffee I sat down to scan our servers like I do everyday, just looking for anything out of the ordinary, like services that failed to run. For the most part it is a ten minute job that rarely varies day to day. This morning was an exception.

Nearly every nightly job failed. Worse than that there was an hour and ten minute hole in the logs, 0155 to 0305 was completely unaccounted. I scanned every log from authentication to our application logs and every single one of them showed this hole but checking our external monitoring service showed that we had zero downtime. What the hell happened?

A cold hand of desperation and fear gripped my stomach leaving me dizzy. I ran chkrootkit but came up clean so I mentally prepared myself to rebuild the server and possibly be eviscerated by my bosses. How would I explain this? How could I protect us from it happening again, that is if I still have my job?

Sitting helpless I realized, “Spring Ahead”.

Of Monit and Mongrels, Quick Thoughts

Monit has been serving us pretty well for the last 7 months or so in terms of keeping an eye on our mongrels and kicking them back in line if they act up. At the moment, we are running them in clusters of 3 with only 6 clusters in our current production set up and the monit restart all, for the most part, works fine when rolling them after a deploy. It is a completely different situation with 10 clusters which we are experimenting in a situation where Apache+mod_proxy sits on a separate server from the mongrels–it truly is wonderful to see Apache perform under load when it has all the resources it needs.

The problem seems to be with how resource hungry and ponderous Rails can be when firing up a mongrel, sucking up 25%+ user CPU and making the system gobble up another 10%+. This is enough so that when monit is trying to bring up or down 30 mongrels some of the pack gets left out, failing to either shutdown or start up. Now there is a monkey patch out there that addresses this very issue but I am a little wary in patching mongrel_cluster in our production environment as it might cause me headaches later with upgrades. So what is my solution?

Pressed for time it is a little kludgey and demonstrates some truly sloppy bash scripting but…it works.

#!/bin/bash
monit stop all -g pack_01
echo "Stopping 8100-02"
sleep 12s
monit stop all -g pack_02
echo "Stopping 8103-05"
sleep 12s
monit stop all -g pack_03
echo "Stopping 8106-08"
sleep 12s
monit stop all -g pack_04
echo "Stopping 8109-11"
sleep 12s
monit stop all -g pack_05
echo "Stopping 8112-14"
sleep 12s
monit stop all -g pack_06
echo "Stopping 8115-17"
sleep 12s
monit stop all -g pack_07
echo "Stopping 8118-20"
sleep 12s
monit stop all -g pack_08
echo "Stopping 8121-23"
sleep 12s
monit stop all -g pack_09
echo "Stopping 8124-26"
sleep 12s
monit stop all -g pack_10
echo "Stopping 8127-29"

So what I’ve done in the monitrc file is defined each mongrel as belonging to a group that reflects its cluster. Then I issue a stop to each group with a 12 second delay between each so that Rails and monit can navigate around each other with out either flipping out and forgetting to do their respective jobs. Starting is the same as above.

Does it work? Well, I have tried it under load several times (siege with a concurrency of 30 and the mongrels bloated up to our obese resting state) and it seems to work like a charm. I added the echos when our dev team expressed concern cap would time out while the script took its time to run all the way through.

File under: Ugly, Functional.

EC2, MySQL, Replication, Recovery, and You! (Hammer Time!)

I finally cobbled together an incredibly ugly but functional script for recovering or setting up a slave. The pure hideousness stems from the brute force, lack of error checking, cram that data down the db’s throat method that I am leveraging. See, I know just enough to get the job done but not nearly enough to do it with any elegance, flair, or care and concern for stability. Running with scissors, at night, with a blindfold, through a roomful of children’s toys and cats is my style.

Anyways, here we go…

This script is executed on the slave instance and will fetch the most recent copy of the db from the master, stop the slave, drop the db, recreate the db, read in the backup, issue the change master command, start the slave, and then display the slave status after a minute.

#!/bin/bash
# Recover slave post crash

# run backup from master
# transfer it to the slave
echo "Getting backup, this may take a while."
ssh master "/scripts/slave_recovery.sh WHATSLAVE"

echo

# untar backup
echo "Expanding backup and getting ready to import."
cd /mnt/tmp/recovery
recover=$(ls | grep yourdb)
tar -xf $recover

# set variables
recodir=${recover:0:21}
mastfle=$(ls $recodir/ | grep master)
fullbin=$(cat $recodir/$mastfle | grep A.)
binlog=${fullbin:2}
fullpos=$(cat $recodir/$mastfle | grep B.)
positn=${fullpos:2}

echo "Here's what I have..."
echo $recodir
echo $recover
echo $mastfle
echo $binlog
echo $positn

# stop slave
echo "Stopping slave..."
mysql -e "slave stop;"

# drop database
echo "Dropping the database..."
mysql -e "drop database yourdb;"

# recreate database
echo "Recreating the database..."
mysql -e "create database yourdb;"

# source database from backup
echo "Importing the database..."
mysql yourdb < $recodir/$recodir.sql

# issue change master command
echo "Issuing the change master command..."
mysql -e "CHANGE MASTER TO MASTER_HOST='master', MASTER_USER='USERNAME', MASTER_PASSWORD='PASSWORD', MASTER_LOG_FILE='$binlog', MASTER_LOG_POS=$positn;"

# start slave
echo "I am starting the slave..."
mysql -e "slave start"

# clean up
rm -r *yourdb*

# check status
echo "I'm waiting one minute the checking the status of the slave..."
sleep 1m
mysql -e "show slave status \G;"

echo
echo "I am all done."

Now, you might have noticed that on the seventh line I call another script on the master and you might have noticed a variable trailing it. WHATSLAVE is whatever you called your slaves in the host file on the master in my unimaginative case it is slavea and slaveb but you could have Tom, Dick, and Harry, or the names of your favorite Hostess snackcake characters.

#! /bin/bash
# This script runs on the master and is built off the backup script

# set date variables
DAYNOW=$(date +%j)
TIMENOW=$(date +%H%M)

# grab info about the binlog and position of the database

status1=$(mysql -e 'show master status \G' | grep mysql)
status2=$(mysql -e 'show master status \G' | grep Position)
sql=${status1:18}
posit=${status2:18}

mkdir /mnt/tmp/backup/slave-yourdb-$DAYNOW-$TIMENOW

echo A.$sql >> /mnt/tmp/backup/slave-yourdb-$DAYNOW-$TIMENOW/master-$DAYNOW-$TIMENOW.txt
echo B.$posit >> /mnt/tmp/backup/slave-yourdb-$DAYNOW-$TIMENOW/master-$DAYNOW-$TIMENOW.txt

# dump database
mysqldump yourdb > /mnt/tmp/backup/slave-yourdb-$DAYNOW-$TIMENOW/slave-yourdb-$DAYNOW-$TIMENOW.sql

# tar SQL dump
cd /mnt/tmp/backup

tar -chf - slave-yourdb-$DAYNOW-$TIMENOW | gzip - > slave-yourdb-$DAYNOW-$TIMENOW.tar.gz

rm -r /mnt/tmp/backup/slave-yourdb-$DAYNOW-$TIMENOW/

# copy tar to slaves
scp /mnt/tmp/backup/slave-yourdb-$DAYNOW-$TIMENOW.tar.gz root@$1:/mnt/tmp/recovery/slave-yourdb-$DAYNOW-$TIMENOW.tar.gz
#clean up
rm /mnt/tmp/backup/*.gz*
echo "I'm all done!"

This is just our basic backup script but rather than trying to pass all the variables through ssh I decided to be lazy and just execute the script remotely.

Some of the things I would like to add would be more flexibility in reading the backup name and error checking. Down the line I want to see if I can just backup the schema and import that into the slave db so that I don’t loose all that time reading the db back in (500MB+ can take awhile) and it would help with rapid recovery from data migrations. If you have any comments or suggestions, particularly if they trip your “WTF is wrong with this guy?” sensor I’m all ears.

EC2, MySQL, Backup Recovery, and You! (redux)

Here we go again…

On the heels of the replication monitor, I’ve gone back and fine-tuned the fetch script to let you look back two days in the archives. Now, it is a bit janky because I am setting the days for the first array rather than parsing the actual buckets in S3 but my sed/awk skills are less than none. However, I suppose that the next version could be set up to ask how many days you want to look back easily enough.

#!/bin/bash
# set the environment
export AWS_ACCESS_KEY_ID=xxxyyyzzz
export AWS_SECRET_ACCESS_KEY=xxxyyyzzz
export SSL_CERT_DIR=/opt/s3sync/certs

DAYLST[0]=$(date +%j --date='2 days ago')
DAYLST[1]=$(date +%j --date='1 days ago')
DAYLST[2]=$(date +%j)

DAYNUM=${#DAYLST[@]}

echo

echo "Here are the available days for backup recovery."

echo

# echo each element in array
# for loop
for (( i=0;i<$DAYNUM;i++)); do
echo $i -  ${DAYLST[${i}]}
done

echo

echo -e "What day did you want to parse? \c"
read selectday
listday=${DAYLST[$selectday]}

echo "Ok, I'm going to get the backups from $listday."
echo

echo -e "How many did you want? \c"
read count

echo

# Get the list of backups on the server using s3cmd
dbsets=$(ruby s3cmd.rb list your_db_backups:$listday | tail -n $count)
ARRAY=($dbsets)
# get number of elements in the array
ELEMENTS=${#ARRAY[@]}

# echo each element in array
# for loop
for (( i=0;i<$ELEMENTS;i++)); do
echo $i -  ${ARRAY[${i}]:4}
done

# Prompt user for which backup they want to recover
echo

echo -e "Which backup set would you like to recover? \c"

read numbackup
backup=${ARRAY[$numbackup]:4}

echo "I am fetching your backup $backup now..."
echo

ruby s3cmd.rb get your_db_backups/$listday:$backup /tmp/$backup
cd /tmp
tar -xf $backup
sqlset=${backup:0:14}
mv $sqlset /root

echo "Your backup can be found here /root/$sqlset"

Still on the agenda is getting a slave to recover unassisted after a failure is detected but as my shell scripting abilities improve the possibility of it being realized grows.

EC2, MySQL Replication, Monitoring, and You!

So in a full turn of events I’ve gone back to replication as the the most cost effective solution for creating a high availability environment for MySQL in EC2. The problem of the development team issuing schema changes frequently and without notification hasn’t changed but I have gotten a little more sophisticated about how to deal with them kicking the slave in the teeth when they issue schema changes with impunity.

What I’ve done is build off the backup scripts I have written about prior–especially since they work so well and created the beginnings of a metascript to over see the slaves–it is aptly named slaver. This metascript checks the state of the slave and acts based on it is state: slave up, run backups, or slave down, issue notifications.

#!/bin/bash
### Slaver v0.0.1
### this script is intended to check on the status of the slave
### if the slave is down (IO or SQL) it will send an email out

### set the variables that we are checking for ###
slaver1=$(mysql -Bse ’show slave status \G;’ | grep Slave_IO_Running)
slaver2=$(mysql -Bse ’show slave status \G;’ | grep Slave_SQL_Running)
IO=${slaver1:29}
SQL=${slaver2:29}
COMBO=$IO-$SQL
count=$(mysql -Bse ’show slave status \G;’| grep -c Yes)

### this is sanity check testing stuff ###

#count=1
#echo $IO
#echo $SQL
#echo $COMBO
#echo $count
### this is sanity check testing stuff ###

### run the exception check ###
if [[ "$count" == "2" ]] ; then
/opt/s3sync/db_backup.sh
exit
else
### create status file and mail it ###
date >> status.txt
mysql -Bse ’show slave status \G;’ >> status.txt
mutt you@yourhome.com -s “Slave Status :: DOWN” < status.txt
rm status.txt
fi

The next pieces to build out will be freezing the automated deletion of the old backup sets and attempting recovery of the slave if it is down. To get started on the latter I made some changes to the backup routine on the master:

#! /bin/bash

# Hourly cron job to upload to current bucket
# This is built off what we are currently running

# set date variables
DAYNOW=$(date +%j)
TIMENOW=$(date +%H%M)
# set the environment
export AWS_ACCESS_KEY_ID=xxxyyyzzz
export AWS_SECRET_ACCESS_KEY=xxxyyyzzz
export SSL_CERT_DIR=/opt/s3sync/certs

sleep 1m

# grab info about the binlog and position of the database

status1=$(mysql -e ’show master status \G’ | grep mysql)
status2=$(mysql -e ’show master status \G’ | grep Position)
sql=${status1:18}
posit=${status2:18}

mkdir /mnt/tmp/backup/you-$DAYNOW-$TIMENOW

echo A.$sql >> \
/mnt/tmp/backup/you-$DAYNOW-$TIMENOW/master-$DAYNOW-$TIMENOW.txt
echo B.$posit >> \
/mnt/tmp/backup/you-$DAYNOW-$TIMENOW/master-$DAYNOW-$TIMENOW.txt

# dump database
mysqldump geezeo > \
/mnt/tmp/backup/you-$DAYNOW-$TIMENOW/you-$DAYNOW-$TIMENOW.sql

# tar SQL dump
cd /mnt/tmp/backup

tar -chf - you-$DAYNOW-$TIMENOW | gzip - > you-$DAYNOW-$TIMENOW.tar.gz

rm -r /mnt/tmp/backup/you-$DAYNOW-$TIMENOW/

# copy tar to S3
cd /opt/s3sync
ruby s3sync.rb -vr –ssl /mnt/tmp/backup/ you_db_backups:$DAYNOW

#clean up
rm /mnt/tmp/backup/*.gz*

The key piece here is the capturing of binlog number and position with those two pieces captured it becomes much easier to automate a recovery of the slave from the master’s backup.

More to follow…

EC2: Pound + Apache, Mongrel Cluster, and MySQL Cluster

Alternately, I should be titling this my 36 hour nightmare. Last week, high off the presentation, I built out and deployed the following configuration.

EC2 Cluster

Everything was nice and tight and after loading QA data it ran like a champ but the problem was that QA data was pretty thin being only a fraction of the size of the production data. When we loaded production data into it, which by the way took nearly an hour to import,performance in the Cluster ground to a halt and we were faced with MySQL timing out the mongrels. Needless to say that after another 36 hours of work we abandoned this model and are looking at plain old replication for our data backed.

What could have given us all that grief? A couple of things spring to mind. The instances have 1.7GB of RAM and a single core process which for now works like a champ for a single MySQL server but for whatever reason it is not enough for a cluster under load. Also, running both SQL and Data Node services on the same box was likely less than inspired as the SQL service would spin up chewing into the remaining RAM and would often dominate the CPU. However, when we launch the cluster we were running some grossly inefficient queries with little or no indexing in the tables. A huge issue.

So we pulled back. At the moment we are still running the three legged system (one instance running Pound, Apache, Monit, and Mongrels, one Harvester, and one MySQL instance) but we made significant changes to the DB so that all the bloated joins that Ruby likes to make are hitting indexed tables as well as tweaking my.cnf to boost key buffer to 30% of RAM. Things seem better and we bought ourselves a little breathing room but we are still hitting the limit of the number of mongrels we can run on a single instance, 10 seems to be the upper threshold for stability, so we need to work out a method for building out a replicated set that will auto-recover after the countless data migrations that the dev team performs.  That will be fun!





Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States