IX Outage – Post Issue Technical Report

FacebookTwitterGoogle+LinkedInDiggEmail

As many of you know, last week we experienced a major outage, affecting multiple email and database servers for just under 5 days.

It was, without a doubt, the worst technical issue we’ve ever experienced as a company.

Thankfully, all systems were brought back up without any loss of data.  Although we’re happy that we could restore all your data, we know the duration of the incident, as well as the incident itself, is unacceptable.

To make certain that something like this never happens again, we launched a full investigation to determine the root cause of the issue, outline the steps we took to handle it, and help us take preventative measures against this type of problem in the future.

I want to share this report with you, both to satisfy the curiosities of our more tech-savvy customers, and to illustrate the amount of time, work, and research it took to resolve this difficult issue.  I also want to share the steps we’ll be taking to prevent this issue in the future, which are included at the end of the report.

So, here is the post-issue follow-up my system administrators delivered to me this morning. I think you’ll find it informative, and I hope it sheds some light on the mess that was last week:

Incident Name: Storage Outage – sas3

Incident Date: 2014-03-02
Report Date: 2014-03-14

Services Impacted:

Storage sas3 on dbmail01
93 Shared vms (mail and mysql of cp9-11), resources of 30366 accounts were affected.

Incident Root Cause:

The affected SAN is comprised of an array of drives in a RAID 50 configuration that consists of two RAID 5 groups (one parity group consisting of the even numbered drives and one parity group consisting of the odd numbered drives).   The array can handle two disk failures at the same time as long as they are not part of the same parity group.  In the case of our outage, drive 6 failed and drive 10 was added to the RAID to rebuild the group.  During the rebuild process, drive 0 failed causing us to lose the even-numbered parity group.  This occurred just before 4AM EST on 3/2/2014 and caused the RAID to go into an unrecoverable state. We contacted our hardware vendor’s support line before acting, because there was a large potential for data loss, and were escalated to their engineering team.  Total call time was 10 hours.

raidfailure

Response:

In order to regain access to the data, we had to manually disable slots 10 and 15 (the spare drives) so that the RAID would not attempt to rebuild.  Next, we reseated drive 6 which brought it online, but not as part of the RAID.  This allowed the entire RAID to come back online in a degraded state with drive 0 active.  Because drive 0 was still failing, we knew the RAID was in a very fragile state and that we had to move forward with great care or we would risk losing data.

Our hardware vendor showed us a binding procedure that allowed us to move the affected volumes from the storage system.  We learned that if we triggered another failure in drive 0 at any point during this process, the RAID would go offline and we could lose access to the data.  With this in mind, we began to migrate the volumes, one at a time (that way, it would reduce the stress on drive, thereby reducing the chance of it failing again).  We were methodical and deliberate in the way we approached this and thankfully, we were successful in migrating all data from the storage without triggering another failure in drive 0.  The process completed, and all customers were back online as of 3/6/2014 just after 6PM EST.  The whole process took almost 5 days.

Timeline:

You can find a timeline of events on our status blog–from the initial outage to the final server’s reactivation.

What We’re Doing To Prevent This:

Improve Monitoring

Currently, our automated hardware checks are set to notify us when a storage system has an issue of any type.  While sophisticated, it’s not specific enough to tell us what the actual problem is.  For instance, if a drive fails, we get a general notification, rather than a ‘drive x has failed’ message.  We are looking into using more specific, granular notifications for individual disks.

Proactive Hardware Replacement

It may be possible to check via SNMP for things like disk errors on specific disks before they actually fail out of the RAID and trigger a rebuild. This will result in less drive failures and less rebuilds.

Switch All Arrays to More Stable RAID

Our RAID currently rebuilds on storage arrays using RAID 50. Although this is standard, it can take more than eight hours to complete a rebuild. This is an 8-hour window where we risk losing two drives from the same parity group.  We can decrease this risk by moving to a significantly faster RAID 10 setup, which can rebuild in about 3 hours.

Thanks for reading and again, we’re so sorry about this inconvenience. If you have questions, feel free to ask them in the comments.

Conclusion

Along with the steps we’ll be taking to improve our hardware, we also need to work on improving our speed and accuracy when it comes to identifying affected services and affected customers. This information should then make it to the blogs quickly so the customers know we are aware they are affected, and that we are working to fix it. The faster we can do this, the easier it is to pinpoint problems and implement fixes.

During this current outage, we took too long to get detailed technical information (especially ETAs) communicated to you. We think it’s better to get some good information out – even if we have to admit that they’re rough estimates and it’s the best we have right now – than to delay providing any real information. Many of you agreed.

Now that we understand the details of this issue, we can sleep a little better knowing that this won’t be likely to happen again and we’re all on the same page about what went wrong. We’re incredibly grateful for your trust, and we will continue to work tirelessly to regain and keep it.

FacebookTwitterGoogle+LinkedInDiggEmail

Important Information about HIPAA Compliance

FacebookTwitterGoogle+LinkedInDiggEmail

As part of IX Web Hosting’s ongoing compliance initiatives, we have identified recent changes in the Health Insurance Portability and Accountability Act (“HIPAA”) that may impact some of IX’s customers. Specifically, these changes require Covered Entities and all of their Business Associates who create, receive, maintain transmit or have access to protected health information (or the possibility exists that the protected health information in the business associate’s custody or control could be compromised) to independently comply with HIPAA.

To help identify those customers who may be impacted, IX is asking its customers to notify IX if they are considered a Covered Entity or Business Associate under HIPAA and store or transmit electronic protected health information using IX’s services. IX will assume that the recent changes in HIPAA do not impact IX customers who do not identify themselves as a Covered Entity or Business Associate. IX has also updated its Terms of Service with customers to prohibit the use of protected healthcare information on websites that IX hosts.

If you answer “yes” to both of the following questions, please contact legal@ixwebhosting.com no later than 09/23/2013.

  • Is your business a Covered Entity or a Business Associate as defined by the Health Insurance Portability and Accountability Act of 1996 or HIPAA (45 CFR 160.103)?
  • If so, are you maintaining or transmitting Protected Health Information or PHI (defined in 45 CFR 160.103) using any of the IX services?

If you did not answer “yes” to both of the above questions, there is no need for further action

Please see below for some frequently asked questions.

We appreciate your cooperation and look forward to continuing to serve you.

********************************************************


If I answer “yes” to both questions will my service be impacted or change as a result of the new regulations? What if I answer “no”?

If you answer “yes” to both questions your services will change. You must transfer to another hosting provider. Upon request, we can provide you with the name of a provider who is HIPAA compliant and operating in a HIPAA compliant data center. If you answer “no”, there will be no changes to your services at this time.


If I answer “yes” to both questions what will IX do with the information that I provide?

IX will use the information to recommend you transfer your account to a provider who can supply HIPAA compliant services.


Will I be contacted by an IX representative whenever new regulations apply to my business?

Not necessarily. We may contact you if the regulations apply to and affect IX’s provision of services to you.

FacebookTwitterGoogle+LinkedInDiggEmail

Nice Try, Tornado

FacebookTwitterGoogle+LinkedInDiggEmail

No tornados @ IX!

As some of you may have read on our official Facebook page, IX Web Hosting was recently hit with a pretty nasty storm. The high winds lifted the roofs off of surrounding buildings, toppled tractor-trailer rigs, and shattered windows, both vehicle and building alike.

Shots of the surrounding damage

And, while the subsequent power outage has disrupted many businesses in the area, the employees at IX Web Hosting are still here, chugging away to keep this situation from affecting our customers in any way.

Though the storm struck without warning, we were able to make sure our customers’ accounts were completely unaffected by this incident. Our emergency generators are running at full power to make sure our servers are operating without interruption. We are having additional fuel delivered as needed so that we remain up as long as the power remains out.

This means our shared, VPS, and Cloud services are currently up and running, and should remain unaffected until the power outage is over. We will, of course, update you if anything changes or if we have any reason to suspect that this might change.

Fortunately, no IX employees were injured during the storm, and our building missed sustaining any physical damage. As for an ETA for when the local power outage will end, we’ve heard from our power provider, AEP, that power should be restored by 11:00 PM EST. [UPDATE: Power is restored! We had our generators refueled again just-in-case, but we’re back on grid power!]

In the meantime, check out this video we captured of the storm touching down across the street, featuring our videographer, Stephi Kurz, and me, Mike Nichols, correctly identifying the storm as a “thing.”

FacebookTwitterGoogle+LinkedInDiggEmail

Important WordPress Security Update

FacebookTwitterGoogle+LinkedInDiggEmail

If you’re a customer who uses WordPress, you have probably already noticed the issues concerning logging into your WordPress control panel.

We wanted to send out this notification to alert anyone who hasn’t been briefed on the situation, as well as give some additional explanation about what is going on, how we’re handling it, and why we’re handling it in this manner.

  • A global brute force attack on WordPress’ wp-login.php file began on April 11th. This attack affected WordPress users worldwide and was experienced by virtually every web hosting company.

    • A ‘brute force’ attack is when an automated program (sometimes referred to as a ‘botnet’) repeatedly attempts to log into a password protected site by trying different passwords over and over again until it finds the right one.
  • We implemented a server side check to reduce the number of wp-login requests, but found that the attack started to increase the time between login attempts.
  • On April 12th, we noticed the botnet activity ramped up dramatically, and we were forced to block all traffic to wp-login pages. This was a temporary solution that remedied the brute force attack in the following ways:

    • Customer WordPress sites were able to stay up and running
    • All incoming brute force requests were stopped
    • This also kept out any unwanted, malicious intrusions into our customers’ sites
    • By blocking the malicious incoming traffic, it also stopped the slowness issues we were having on our Linux servers.

In the meantime, we began collecting attackers’ IPs so we could start blocking them.

  • On April 13th, we began using the data we’d collected on the attackers’ IPs to begin blocking them from connecting to our servers. This was a slow process that took time to refine and put in place as a permanent solution.
  • On April 16th, we removed the block on each server for wp-login once the new system was implemented across all of our servers. Users should now be able to log into their WordPress sites. Once you log in, we recommend that you change your password to something very strong (e.g. a mixture of upper and lowercase letters, numbers, and special characters like #, $, and &). You can find instructions on how to change your password here: http://codex.wordpress.org/Resetting_Your_Password.

The tactics used in the attack are changing daily (sometimes even hourly), and we are responding with adjustments of our own. While we currently have the situation under control, we are still watching and reacting to the attack to make sure it doesn’t begin affecting our servers again.

Although we can’t announce too many details about our attempts to block the attack (because we don’t want to give too much information to the attackers), we still want you to know that we are aware of the situation, and are working on it. Keep an eye on the status blog for major updates as the situation progresses.

Thank you for your patience as we continue to defend against this attack.

Sincerely,

Lisa Grice

Director of Customer Operations

IX Web Hosting

FacebookTwitterGoogle+LinkedInDiggEmail

The importance of maintaining your web applications

FacebookTwitterGoogle+LinkedInDiggEmail

Web application maintenance is vital to the health and security of not only your website, but your entire hosting account.  Not only does this directly impact your security but also your reputation on the Internet.  Failure to maintain web applications is one of the leading causes of hacked sites.  A hacked site sees a negative impact on their Search Engine Optimization (SEO).

What is a web application?

Generally, a web application is a type of application that is accessible over a network and usually uses a browser as the primary interface.  A more in-depth explanation can be found on Wikipedia, but that is beyond the intended scope of this post.  Web applications can come in many flavors and purposes.  Some are designed to help you manage and display content to your visitors.  Some are used as shopping carts to help you display and sell your products.  Others are designed to display content in a gallery format.  In the next section, I include a list of many of the popular web applications.

Why should I upgrade?

One reason new versions of applications are released is because new features are added.  That’s usually the first thought that comes to mine when you hear about a new version:  What was added?  Sometimes a new version is released and nothing obvious appears to have changed.  Chances are, that version was released to patch one or more security vulnerabilities.   Most applications include a change log file; reading it will explain what has changed from one version to the next.  To help illustrate the importance of these upgrades, I have compiled a list of some of the more common web applications and their corresponding advisory listings at the popular security site, Secunia:

You will notice that many of the advisories listed have been resolved by installing the current version.  While hosting account security hinges on many aspects of your account access, up-to-date software goes a long way towards keeping your account secure.  There are also several tutorials available to suggest configuration changes to make your applications more secure, here are two to get you started:

How do I maintain my site?

The very first thing to do is to backup your site.  It is imperative you back up your own site before every major change.  Do not depend on the hosting company’s regularly scheduled backups!  What would happen if the hosting company’s regularly scheduled backup occurs after you have made changes that did not have the desired effect – you would have nothing with which to revert.

Also, do not depend on plugins to handle your critical backups.  Consider for a moment:  All of my backups use a plugin inside my application.  After applying an update, access to my dashboard is broken.  I cannot restore from my backup without dashboard access.  This makes about as much sense as keeping the spare keys to your locker inside the locker.  Sure the spare keys are safe, but you’ll never open the locker to get your spares if you lose your main key.

To properly backup the site, you will want to download a copy of your web files using an FTP application and also export a SQL dump of the active database.  Both of these actions are outside the scope of this article, but check back in the coming weeks for more info.  Until then, Google is your friend!

Once you have a solid backup, log into the application’s dashboard and update each plugin individually before attempting the core application update.  This order is important as some plugins will need to be updated to be compatible with your application’s newest version.  This can be time consuming, but it is worth the effort of updating one module at a time.  If you succumb to the temptation to update them all at once and your testing shows some aspect of your site is broken, you will not know which plugin was the cause of the error.

When your testing is complete and all plugins are updated, then it is advisable to review your active plugins to ensure their compatibility with the core application’s update.  Personally, I would make a backup of this configuration now.  While not absolutely necessary, it will save you time if the core update breaks your site.  Now follow the update steps recommended by your core application.  Some have buttons to update within the dashboard, some require more intricate steps.

I like to take another full backup of the known good configuration after you have completed your updates of plugins and the core application.

Many web applications offer the ability to sign up for their newsletter.  This is a great way to keep yourself appraised of updates and will help you to continue maintaining your site.  Keep your web hosting account (as well as your visitors) safe!

FacebookTwitterGoogle+LinkedInDiggEmail

We're Always There When You Need Us The Most!

Your Dedicated Support

At IX, we take care of our customers. And dedicated support is one of the ways we prove to you again and again that we are here to help you every step of the way, regardless of your skill level. With IX dedicated support, you get a support technician personally assigned to assist you. You get their name, number, email, social media connections, and work schedule! It's just one more facet of our service which proves our deeply rooted belief that being a great hosting provider requires more than just cutting-edge technologies, but the best in support and service.