A Public Apology

The problems that plagued LISHost yesterday are about 99% cleared up. I still have 2 databases to restore, one for LISFeeds, and one other that seems to have some serious issues. So I'd like to publicly apologize to everyone for all the problems.

My apologies again if your site was affected and my apologies if a site you regularly read was down yesterday as well. Yesterday was a tough day and before some of the details fade from my already cloudy memory, I thought I'd write just a bit.

I started my day just like any other day around 6am by checking my email. This Thursday was different than most because I had 2 emails from panicked people in Europe saying their sites just went down. Upon further review….. sure enough, MySQL was behaving badly. Last weekend I had moved the MySQL data directories over to a new hard drive on the server to speed things up a bit. During testing I had run across a problem where MySQL moved fine, but Apache would look for MySQL in the old location, causing all database drive sites to lose their databases. It wasn't something that seemed to be very common, but I did eventually figure out what to do. It turned out that MySQL has some special files that help it cache things and by rebuilding those files the problem was taken care of. So I moved MySQL on Easter Sunday and the big move went perfectly. All my planning and testing had paid off, or so I thought.

About 5:00am EST yesterday MySQL had troubles. It decided it needed to restart itself, and when it did, it forgot where it was. Or Apache forgot where it was, or it forgot to tell Apache where it went, but in any case, something went wrong. Either a main MySQL table got corrupted, or one of those special files, or something else got corrupted, and MySQL essentially got lost. It took me a while to figure out what was going on, and once I did, the solution turned out to be more of a problem than the problem. After some poking and prodding one of two things happened. Either 1) There was a disc error or 2) the script I ran assumed all was well with MySQL and really screwed things up because of the corrupted tables or files. In any case, within about 3 seconds half of the most important files MySQL needed had either emptied or been deleted. As much as I'd love to go back and try to recreate exactly what happened so I know for sure, that would require breaking everything again, so I'm going to skip that part.

What a feeling that was! Imagine realizing you just lost 3 days of hard work for about 50 people, there isn't a profanity strong enough that I could've used to express what I was feeling. It took a good half hour, or maybe more until I figured out just what it was going to take to get all the backups loaded back into MySQL. It wasn't going to be easy. I couldn't just blow out the entire installation because some databases weren't affected, and the backups wouldn't really allow me to restore things if I did it the easy way. So each database needed to be tested. If it was bad, I needed to delete the files by hand, run CREATE, then DROP, then CREATE again from the MySQL command line, then untar the backup and then run it back in. Then if there were no errors, I could test the database, and move on. Unfortunately, for reasons I don't fully understand, MySQL didn't like some of the backups. Even though they came from the exact same server just a few days earlier, it seemed that some files were written in a way that MySQL no longer liked. This is something I need to explore so I can understand the problem. Because of this some of those backup files needed to be edited by hand, and then loaded.

To make matters even worse, we had one domain come under intense spammer attack, which essentially froze the server on me and required a reboot at a terribly inappropriate time. Luckily no damage was done, and once I fought them off everything was ok.

So finally around 10pm last night almost everything was back in place. I still need to rebuild LISFeeds, and one more database, but I think all is well on the server front now.

Lessons Learned:

More/Better Backups: MySQL should've been backing itself up everyday. I should've changed the schedule as soon as I added the new drives. Joe suggested we do it hourly, and if space and server power allow, I will do that as well. I was reading up on MySQL backup stuff and the options seem surprisingly limited. I also need to do some dumps without data to make the files more manageable.

Don't panic: Like the HG2G says, don't panic. I tried too many things too quickly first thing and lost track of what was working and what wasn't.

Backups need to be designed for quick restore. The user directories are backed up in a different way that I think will make it easy to restore from a failure, and I need to change MySQL to follow something similar. The current backup method I designed so everyone could grab a copy of their files, but overlooked my needs.

People are incredibly forgiving and patient: With only one exception everyone who lost data was very understanding and patient. I had most of the sites restored within a few hours, and everyone back to normal in no more than 15 hours, so it wasn't the worst failure of all time, but it was still pretty bad as far as I'm concerned.

Backup immediately: Had I run a quick back up before I started tinkering I might have avoided some problems. I didn't because I didn't think the problems were serious enough to warrant a fresh backup. I was obviously wrong, and I should've erred on the side of caution.

With Friends Like These: You know who your true friends are because they send you sarcastic nasty ecards when you're busy working your ass off trying to save their databases.

Email is Not Conversation: Gmail fails miserably when faced with what I through at it yesterday. Well over 100 emails from about 80 or 90 people, many with the same subject. Gmail grouped many of them together, but not all. It made finding email I wanted to save a pain, email I wanted to reply to difficult, and browsing through what I've done almost impossible. Email wasn't broken and it didn't need to be fixed with a new interface.

I like what I do: This was a bad day that will live on for a long time on my top 10 bad days list, but I never felt like I wanted to pull the plug even for a second

Comments

No worries Blake. Mr. Shoe and I came across the mysql error, which caused some temporary panic on Mr. Shoe's part (mostly along the lines of, "Damn, what did I break when I added that last module?") But never for a minute were we annoyed with you.


Actually, we were kind of psyched at how quickly the problem was taken care of.

ah the joys of being an ISP:)

i'm assuming your 'cache' problem was the innodb cache?

what are you using to backup your mysql db's? i've found a cron'd script that'll do a mysqldump (that turns off foreign key constraints, for bit, and then turns them back on, etc) works best. tar and gzip it and it's safe.

the only situation where that doesn't work is a heavily used db. but to get around it create a slave to the master db, and pull the backups from the slave. that way it doesn't tie up the master-db while doing the backup for writes.