IX Outage – Post Issue Technical Report

FacebookTwitterGoogle+LinkedInDiggEmail

As many of you know, last week we experienced a major outage, affecting multiple email and database servers for just under 5 days.

It was, without a doubt, the worst technical issue we’ve ever experienced as a company.

Thankfully, all systems were brought back up without any loss of data.  Although we’re happy that we could restore all your data, we know the duration of the incident, as well as the incident itself, is unacceptable.

To make certain that something like this never happens again, we launched a full investigation to determine the root cause of the issue, outline the steps we took to handle it, and help us take preventative measures against this type of problem in the future.

I want to share this report with you, both to satisfy the curiosities of our more tech-savvy customers, and to illustrate the amount of time, work, and research it took to resolve this difficult issue.  I also want to share the steps we’ll be taking to prevent this issue in the future, which are included at the end of the report.

So, here is the post-issue follow-up my system administrators delivered to me this morning. I think you’ll find it informative, and I hope it sheds some light on the mess that was last week:

Incident Name: Storage Outage – sas3

Incident Date: 2014-03-02
Report Date: 2014-03-14

Services Impacted:

Storage sas3 on dbmail01
93 Shared vms (mail and mysql of cp9-11), resources of 30366 accounts were affected.

Incident Root Cause:

The affected SAN is comprised of an array of drives in a RAID 50 configuration that consists of two RAID 5 groups (one parity group consisting of the even numbered drives and one parity group consisting of the odd numbered drives).   The array can handle two disk failures at the same time as long as they are not part of the same parity group.  In the case of our outage, drive 6 failed and drive 10 was added to the RAID to rebuild the group.  During the rebuild process, drive 0 failed causing us to lose the even-numbered parity group.  This occurred just before 4AM EST on 3/2/2014 and caused the RAID to go into an unrecoverable state. We contacted our hardware vendor’s support line before acting, because there was a large potential for data loss, and were escalated to their engineering team.  Total call time was 10 hours.

raidfailure

Response:

In order to regain access to the data, we had to manually disable slots 10 and 15 (the spare drives) so that the RAID would not attempt to rebuild.  Next, we reseated drive 6 which brought it online, but not as part of the RAID.  This allowed the entire RAID to come back online in a degraded state with drive 0 active.  Because drive 0 was still failing, we knew the RAID was in a very fragile state and that we had to move forward with great care or we would risk losing data.

Our hardware vendor showed us a binding procedure that allowed us to move the affected volumes from the storage system.  We learned that if we triggered another failure in drive 0 at any point during this process, the RAID would go offline and we could lose access to the data.  With this in mind, we began to migrate the volumes, one at a time (that way, it would reduce the stress on drive, thereby reducing the chance of it failing again).  We were methodical and deliberate in the way we approached this and thankfully, we were successful in migrating all data from the storage without triggering another failure in drive 0.  The process completed, and all customers were back online as of 3/6/2014 just after 6PM EST.  The whole process took almost 5 days.

Timeline:

You can find a timeline of events on our status blog–from the initial outage to the final server’s reactivation.

What We’re Doing To Prevent This:

Improve Monitoring

Currently, our automated hardware checks are set to notify us when a storage system has an issue of any type.  While sophisticated, it’s not specific enough to tell us what the actual problem is.  For instance, if a drive fails, we get a general notification, rather than a ‘drive x has failed’ message.  We are looking into using more specific, granular notifications for individual disks.

Proactive Hardware Replacement

It may be possible to check via SNMP for things like disk errors on specific disks before they actually fail out of the RAID and trigger a rebuild. This will result in less drive failures and less rebuilds.

Switch All Arrays to More Stable RAID

Our RAID currently rebuilds on storage arrays using RAID 50. Although this is standard, it can take more than eight hours to complete a rebuild. This is an 8-hour window where we risk losing two drives from the same parity group.  We can decrease this risk by moving to a significantly faster RAID 10 setup, which can rebuild in about 3 hours.

Thanks for reading and again, we’re so sorry about this inconvenience. If you have questions, feel free to ask them in the comments.

Conclusion

Along with the steps we’ll be taking to improve our hardware, we also need to work on improving our speed and accuracy when it comes to identifying affected services and affected customers. This information should then make it to the blogs quickly so the customers know we are aware they are affected, and that we are working to fix it. The faster we can do this, the easier it is to pinpoint problems and implement fixes.

During this current outage, we took too long to get detailed technical information (especially ETAs) communicated to you. We think it’s better to get some good information out – even if we have to admit that they’re rough estimates and it’s the best we have right now – than to delay providing any real information. Many of you agreed.

Now that we understand the details of this issue, we can sleep a little better knowing that this won’t be likely to happen again and we’re all on the same page about what went wrong. We’re incredibly grateful for your trust, and we will continue to work tirelessly to regain and keep it.

FacebookTwitterGoogle+LinkedInDiggEmail

Happy Holidays from IX!

FacebookTwitterGoogle+LinkedInDiggEmail

Raise your hand if you spent the last few days running around like crazy, buying, wrapping and giving gifts for what feels like forever. A lot of really tired hands just went up at IX!

We do all this running around, buying presents, making food, trying and failing to avoid crowds, because we want to see the people we love smile!

The “smile countdown” has officially begun. And this year, the IX team got together to make 127 disadvantaged kids smile, and smile they did!

Supporting the Franklin County Children Services Holiday Wish charity has become an IX family tradition, and this was our most successful year to date!

Have you ever seen what a IX-Mas Tree made out of gifts looks like?  Watch below.

Enjoy these times, savor the smiles, and don’t forget to smile yourself!

Happy Holidays and a Happy New Year from IX Web Hosting!

Your Team at IX

PS:  Leave us your comments below! We’d love to hear what you think!

FacebookTwitterGoogle+LinkedInDiggEmail

Get Your Site Featured with IX’s New Website Directory

FacebookTwitterGoogle+LinkedInDiggEmail

Earlier this year, we mentioned a few of the features that would be launching alongside MyIX, and one of the first ones is ready for you to see:

Introducing the IX Web Hosting Website Directory!

With this new catalog, we’ve made it possible to see how similar sites and competitors in your industry are creating their websites. You can also browse each others’ sites to get some ideas for your own, as well as cross-promote your services to help improve your SEO.

Additionally, we will pick new websites from our network every week to be featured on our site directory. Those customers will get a badge to add to their site so they can show off their cool new status!

image05

And guess what? YOU can be part of this new feature!

Check out the sites we currently have featured at http://www.ixwebsites.com/. While you’re there, browse through the different categories and see if anyone has a site that has a similar focus to yours.

And, if you want to add your site to our directory, just login to your control panel at my.ixwebhosting.com and follow these steps:

1. Click “My Profile” on the top tool bar.

image03

2. Click “Website Information” on the left.

image00

3. Click “Add Website Profile” at the top and fill out the information about your site!

image01

The more detail you provide, the more information we can be list for your site in the directory!

So go ahead and check it out now! You can be one of the first to get your site added and viewable by other IX customers.

Have fun!

FacebookTwitterGoogle+LinkedInDiggEmail

Keeping the Crazy out of Your Comments

FacebookTwitterGoogle+LinkedInDiggEmail

Recently, Google announced that they are rolling out comment moderation abilities to a handful of YouTube users, with plans for a widespread release in the future.

This is a welcome addition to the YouTube community, and anyone who has ever been brave enough to view comments on even the most innocent of YouTube videos knows exactly why. For beneath nearly every YouTube video lies some of the most ignorant, racist, and profane collection of nonsense you could ever imagine.

So, I started thinking… if a universally popular site like YouTube is experiencing these problems, what about our customers here at IX? What is an upstart blogger or online business owner supposed to do if they want user feedback, but want to avoid this type of online nastiness?

The Problem

Comment spam a very odd phenomenon that takes place on the internet. The ability to anonymously say anything you want, coupled with the absence of accountability, turns some people into raving maniacs who choose to take out their frustrations and insecurities anywhere they can.

This problem isn’t only affecting YouTube, either. Take a look at the comment section of any website that handles movie reviews, news, politics, or gaming. You’re sure to find your fair share of senseless aggression and filth.

So, the question is, how do website owners defend themselves from these malicious parties?

Know Your Enemy

First, let’s identify where all of this garbage is coming from. There are 3 major categories of inappropriate comment sources that you’ll want to filter out:

1. Trolls – Trolls are people that get a kick out of ruffling feathers. They will purposely try and start a online ruckus by presenting a ridiculous opinion in a serious manner, or play devil’s advocate just to rile people up. They rarely believe what they write, but they take pleasure in getting under people’s skin.

2. Bots – If you ever see a comment that sounds like an advertisement, it’s most likely a spam bot. Much like spam emails, these automated bots leave comments on popular videos and blogs with promises of great wealth for minimal effort. Here’s an example:

spammer

Note: Finding this bot spam took me literally 30 seconds. It was the most recent comment on the popular Gangnam Style music video on YouTube.

3. Genuine Craziness – Even though a lot of comment section trash is generated by people just being jerks to rile people up, there are some folks out there who have an agenda and legitimately believe in what they’re saying. These people can be particularly off-putting to level-headed comment posters because they’re typically angry, offensive, illogical, and relentless.

Filter out the Filth

Now that you know where all the junk is coming from, here are some methods to keep it out of your comment sections and message boards.

Turn off Anonymous Comments – If your blog software or comment section of your site has a setting that allows people to comment anonymously, make sure to switch it off. This will cut down on a very large amount of spam bots and trolls.

Captcha – This free program prompts the user to type in a series of letters or numbers before they make a comment to ensure that they are a live human being and not an automated spam bot. You may have seen these at one time or another on an online store or when signing up for a newsletter.

recaptcha

 

Captcha plugins are available for popular blogging and CMS software like WordPress, MediaWiki, and ASP.NET. Check out their website for more information.

Text Filters – Check to see if your blog or forum software of choice has a plugin that will filter out profanity, sexual references, and racial slurs. Some of these filters will just replace offending words with asterisks, while others will restrict an offending message from being posted at all. This should drive away or at least censor people who want to simply coat your comment section will four-letter words.

Don’t Over-Censor

Not all confrontation is necessarily bad when it comes to online comments. While filtering out the bad apples makes a comment area a better, more peaceful place, don’t be tempted to start banning people because of minor disagreements or legitimate gripes.

For instance, if you sell a product, and someone leaves a message on your comment board that says, “Your product broke after 3 days. Not a happy customer,” you might be tempted to remove the content or ban the user so they don’t spread any more ‘bad press’ about your company. However, this is a huge mistake and can backfire for a number of reasons.

For one, your customers are your revenue base, so it’s best to keep them happy. Ignoring or deleting users that have real complaints doesn’t fix the underlying problem, and will only cause more headaches in the future.

Secondly, if you’re caught intentionally ignoring or censoring displeased customers, you’ll soon find yourself in a firestorm of bad press. Nothing shows your customer base that you don’t care about them more than deleting bad reviews to skew public reception into your favor. People will take to Facebook, Twitter, Yelp, and anywhere else that people will listen to decry your name and shady business practices. When it’s all said and done, the act of censoring a negative review can hurt far more than the negative review itself.

It’s best to address gripes and complaints head on, and in public. Make your own comment profile and respond to customers’ concerns directly. Let your users and customers see that you’re not afraid to address the issue and help someone who’s unhappy. It demonstrates your willingness to work with people and your openness to criticism.

Conclusion

These tips should help you ward off the majority of cyber-crazies, but there isn’t an everlasting cure-all. You have to be a constant gardener when it comes to your website’s forums. So, periodically check your comment boards and user forums from time to time to make sure they haven’t been consumed by the internet’s darker side.

Thanks for reading, and good luck!

FacebookTwitterGoogle+LinkedInDiggEmail

IX’s New Kitty Commercial

FacebookTwitterGoogle+LinkedInDiggEmail

Everyone knows the internet wouldn’t survive without its two most vital ingredients:

1. Web hosting. No web hosting means no web sites, and no web sites essentially means no internet.

2. Cats. It’s common knowledge that the internet is dependent on people sharing cute cat pictures and videos in order to remain relevant.

So, when our VP of Operations adopted an abandoned kitten that was rescued from an empty storage facility, we took the opportunity to combine both major components of the internet into one adorable commercial.

Click here to watch!

http://www.youtube.com/watch?v=eyfkmfiTjDQ

FacebookTwitterGoogle+LinkedInDiggEmail

We're Always There When You Need Us The Most!

Your Dedicated Support

At IX, we take care of our customers. And dedicated support is one of the ways we prove to you again and again that we are here to help you every step of the way, regardless of your skill level. With IX dedicated support, you get a support technician personally assigned to assist you. You get their name, number, email, social media connections, and work schedule! It's just one more facet of our service which proves our deeply rooted belief that being a great hosting provider requires more than just cutting-edge technologies, but the best in support and service.