MySQL Consulting and NoSQL Consulting: MySQL DBA: disaster recovery

Showing posts with label disaster recovery. Show all posts

Monday, December 06, 2010

Disaster @ Tumblr

Tumblr has been down for more than 12 hours due to an issue with their database cluster. Here is the comment I left on GigaOm.com

This is the freshest lesson for entrepreneurs and startups:
- Learn to value your data
- Implement a high availability plan
- Plan a disaster recovery strategy

“Tumblr likely has the resources to recover…”

I really hope that holds out true but remember, data is the only irreplaceable asset of an organization. Once it’s gone, it’s gone.

When I was handling the disaster at Fotolog (massive database corruption when our SAN crashed), I couldn’t find any company or consulting firm ready to handle the situation and help with data recovery. It was a miracle that I came across the concept of DUDE (Data Unloading by Data Extraction) and started writing InnoDB data recovery programs in sheer desperation. In case of Fotolog, we had all basic infrastructure in place for redundancy and high availability. The component that caused the disaster was the one we relied most upon: “the financial grade strength SAN.”

The point I am trying to make is having access to cash in the bank + large userbase + really smart engineers doesn’t provide any guarantee that your data will be safe in case of a disaster.

Times like these can be of incredible stress on those handling the situation. I feel for folks at Tumblr and hoping for a speedy recovery.

Good luck Tumblr guys! You’re in my thoughts.

Frank

Friday, January 25, 2008

When no disaster recovery plan helps

Regardless of how "prepared" and "ready" one feel for a disaster, it will, in one form or another, inevitably happen. The best thing you could do is continuously revise and test your disaster recovery plan, strengthening it each time against any kind of disaster you can think of. Things generally go wrong when you least expect them to go wrong.

I was getting chills reading about Charter Communications, a St. Louis based ISP, accidentally deleted 14,000 active email accounts along with any attachments that they carried. All the deleted data of active customers is irretrievable. As someone who is responsible for data of one of the top 15 heaviest trafficked site in the world, according to Alexa, I know, I'd HATE to be in shoes of the person responsible for this.

As I was reading the news story, I was constantly thinking about the title of Jay and Mike's 2006 presentation: "What Do You Mean There's No Backup?"

Once a disaster happens, you can immediately think of the possible ways it could have been avoided. The real challenge is implementing ways of avoiding all types of disasters before they happen.

For instance, to protect against such a disaster, or at the very least, be able to recover from its effects, Charter communications could have:

1. fully tested the script on a QA/test box to ensure no test records of active users are deleted.
2. created a backup of the data by creating a file system snapshot just before running the script. That way deleted data can be recovered. Depending on your operating system/storage system, there are a lot of tools available that let you take file system snapshots such as fssnap (Solaris), LVM (Linux).
3. had a recoverable backup. There are a lot of cases out there where either no backup exists or the one that does exist, turns out to be corrupt. With a periodic backup, Charter could have, for instance, just announced to their customers that they lost their new emails since last week, instead of dropping the ball and saying that *all* their email is lost. Even having an off-site backup in this case would help if selective restore from that backup was possible.

BTW, Just a few days ago, I was testing a random sample of backups and found backups of a database to be corrupt. That triggered a system wide check of backups. The best way I have found is to have a list of backups from all databases sent to me by email. My report contains information about backups running at the time the script was generated and the backups that were created the previous night.

4. If the data deleted was on a database such as MySQL, recovery from this disaster would be possible by keeping a slave intentionally behind.

What are some of the other ways you can think of to avoid a disaster or to execute a recovery plan?

There are many ways a disaster like this can be triggered. A few, seemingly bizarre but very real, that come to my mind:

- What if you accidentally re-run a previously executed DELETE command, stored in your mysql client history, in a hurry, on the wrong server? Or you re-ran a disastrous command in your shell history in the wrong directory?

- What if you used a list of IDs generated from your QA/test machine to delete users from production machines/databases? Oh and the IDs were generated from an auto-increment column?

Can you think of more?

Sure, there are ways to prevent against each kind of disaster. The question then is: Are you prepared against 'all' of them?

The disaster recovery plan of your company may help your steer out of such a disaster, but in the case of Charter, their DR plan didn't cover this. They do, however, have plans to reimburse their customers $50. Don't know if that'd be sufficient to keep customers from switching.

If you are someone responsible for administering and executing disaster recovery plan(s) for your company, you may find my "Disaster is Inevitable--Are you ready" session at the MySQL Conference 2008 interesting. Plus, we can have great conversation afterward. :)

See also: disaster recovery, disaster recovery journal, mysql conference

Tuesday, January 22, 2008

Speaking at MySQL Conference 2008

I have been meaning to blog about this for quite some time but time seems to be the most scarce resource in my life.

This year, I will be presenting three sessions at MySQL Conference:

Disaster Is Inevitable—Are You Prepared?

What’s the worst database disaster you expect to happen? Are you prepared? Does your architecture support quicker recovery or do you have recovery bottlenecks built throughout your architecture? What will happen if InnoDB crashed beyond repair or if you have a massive irrepairable data corruption? What can you do to better prepare for the disaster, when it does happen? Do you have data restoration tools and procedures in place in case you need to resort to extreme measures? Join us in this eye opening, heart-racing, real-life inspired presentation by Fotolog’s Director of Database Infrastructure, Farhan “Frank” Mashraqi” to find out answers to these questions and more.

Optimizing MySQL and InnoDB on Solaris 10 for World's Largest Photo Blogging Community

Fotolog is a top 19 Internet destination with more than 12 million members, 315 million photos and more than 3 billion page views a month. In just a few years Fotolog has become a social phenomenon in Europe and South America. Through modifications to its data architecture, Fotolog was able to serve four times the number of users using the same number of database servers. A non-conventional, hybrid presentation that conveys the importance of scalability, performance tuning and schema optimizations in a practical way.

The Power of Lucene

Lucene is a powerful, high-performance, full-featured text search engine library that is written entirely in Java and provides a technology suitable for all size applications requiring full-text search in heterogeneous environments.

In this presentation, learn how you can use Lucene with MySQL to offer powerful searching capabilities within your application. The presentation will cover installation, usage, optimization of Lucene, and how to interface a Ruby on Rails application with Lucene using a custom Java server. This session is highly recommended for those looking to add full-text cross-platform, database independent search capability to their application.

Registration is now open. See you all soon!