IX Outage – Post Issue Technical Report

FacebookTwitterGoogle+LinkedInDiggEmail

As many of you know, last week we experienced a major outage, affecting multiple email and database servers for just under 5 days.

It was, without a doubt, the worst technical issue we’ve ever experienced as a company.

Thankfully, all systems were brought back up without any loss of data.  Although we’re happy that we could restore all your data, we know the duration of the incident, as well as the incident itself, is unacceptable.

To make certain that something like this never happens again, we launched a full investigation to determine the root cause of the issue, outline the steps we took to handle it, and help us take preventative measures against this type of problem in the future.

I want to share this report with you, both to satisfy the curiosities of our more tech-savvy customers, and to illustrate the amount of time, work, and research it took to resolve this difficult issue.  I also want to share the steps we’ll be taking to prevent this issue in the future, which are included at the end of the report.

So, here is the post-issue follow-up my system administrators delivered to me this morning. I think you’ll find it informative, and I hope it sheds some light on the mess that was last week:

Incident Name: Storage Outage – sas3

Incident Date: 2014-03-02
Report Date: 2014-03-14

Services Impacted:

Storage sas3 on dbmail01
93 Shared vms (mail and mysql of cp9-11), resources of 30366 accounts were affected.

Incident Root Cause:

The affected SAN is comprised of an array of drives in a RAID 50 configuration that consists of two RAID 5 groups (one parity group consisting of the even numbered drives and one parity group consisting of the odd numbered drives).   The array can handle two disk failures at the same time as long as they are not part of the same parity group.  In the case of our outage, drive 6 failed and drive 10 was added to the RAID to rebuild the group.  During the rebuild process, drive 0 failed causing us to lose the even-numbered parity group.  This occurred just before 4AM EST on 3/2/2014 and caused the RAID to go into an unrecoverable state. We contacted our hardware vendor’s support line before acting, because there was a large potential for data loss, and were escalated to their engineering team.  Total call time was 10 hours.

raidfailure

Response:

In order to regain access to the data, we had to manually disable slots 10 and 15 (the spare drives) so that the RAID would not attempt to rebuild.  Next, we reseated drive 6 which brought it online, but not as part of the RAID.  This allowed the entire RAID to come back online in a degraded state with drive 0 active.  Because drive 0 was still failing, we knew the RAID was in a very fragile state and that we had to move forward with great care or we would risk losing data.

Our hardware vendor showed us a binding procedure that allowed us to move the affected volumes from the storage system.  We learned that if we triggered another failure in drive 0 at any point during this process, the RAID would go offline and we could lose access to the data.  With this in mind, we began to migrate the volumes, one at a time (that way, it would reduce the stress on drive, thereby reducing the chance of it failing again).  We were methodical and deliberate in the way we approached this and thankfully, we were successful in migrating all data from the storage without triggering another failure in drive 0.  The process completed, and all customers were back online as of 3/6/2014 just after 6PM EST.  The whole process took almost 5 days.

Timeline:

You can find a timeline of events on our status blog–from the initial outage to the final server’s reactivation.

What We’re Doing To Prevent This:

Improve Monitoring

Currently, our automated hardware checks are set to notify us when a storage system has an issue of any type.  While sophisticated, it’s not specific enough to tell us what the actual problem is.  For instance, if a drive fails, we get a general notification, rather than a ‘drive x has failed’ message.  We are looking into using more specific, granular notifications for individual disks.

Proactive Hardware Replacement

It may be possible to check via SNMP for things like disk errors on specific disks before they actually fail out of the RAID and trigger a rebuild. This will result in less drive failures and less rebuilds.

Switch All Arrays to More Stable RAID

Our RAID currently rebuilds on storage arrays using RAID 50. Although this is standard, it can take more than eight hours to complete a rebuild. This is an 8-hour window where we risk losing two drives from the same parity group.  We can decrease this risk by moving to a significantly faster RAID 10 setup, which can rebuild in about 3 hours.

Thanks for reading and again, we’re so sorry about this inconvenience. If you have questions, feel free to ask them in the comments.

Conclusion

Along with the steps we’ll be taking to improve our hardware, we also need to work on improving our speed and accuracy when it comes to identifying affected services and affected customers. This information should then make it to the blogs quickly so the customers know we are aware they are affected, and that we are working to fix it. The faster we can do this, the easier it is to pinpoint problems and implement fixes.

During this current outage, we took too long to get detailed technical information (especially ETAs) communicated to you. We think it’s better to get some good information out – even if we have to admit that they’re rough estimates and it’s the best we have right now – than to delay providing any real information. Many of you agreed.

Now that we understand the details of this issue, we can sleep a little better knowing that this won’t be likely to happen again and we’re all on the same page about what went wrong. We’re incredibly grateful for your trust, and we will continue to work tirelessly to regain and keep it.

FacebookTwitterGoogle+LinkedInDiggEmail

36 Comments to "IX Outage – Post Issue Technical Report"

  1. Hi,

    Thanks for the details.

    I have two questions related to such downtime and the BCP.

    1. After knowing the fact that data migration from fragile RAID will take long time, How IX could have helped small businesses to keep getting their new emails either using a forwarder or by providing mail service on other servers?

    2. The initial argument that engineers / administrators are doing the recovery tasks, so can’t plan when they will be able to restore all services had created more uncertainty, what IX is going to plan on this?

    Hope these points have already addressed and will be shared for us to know and trust. Failure can occur any time due to any worst reason beyond imagination of planning engineers. We as customers (and in some cases public) rely on the ability and procedure of service provider to let our pulse running.

    regards

  2. What will be done to financially compensate me for my loss of business as a result of the outage?

    • IX: John B. - -

      Hello Steve,

      If you’d like to discuss compensation, please get in touch with us via phone or live chat.

  3. Honestly, I am really not confident in ixwebhosting, now. We found the lack of updates during the outage frustrating. You forced customers to consider alternatives (we are) because of the lack of updates to the Status Blog, and no work arounds or help offered. We operate two “small” businesses that are our world. And the shutdown was traumatic.

    • IX: Jared E. - -

      Dan,

      I am very sorry to hear that. I assure you, the ordered list took a while to complete, and estimating the data transfer required statistics from a number of servers. Telling someone, “Your serer will be up at 2pm today” and then updating to 6pm the next day can be catastrophic for those trying to make plans for the downtime. We had to make them as accurate as possible.

      If you’d like to see some info on the server issues, or what we are doing to prevent this in the future, check this out:

      http://www.ixwebhosting.com/blog/2014/03/ix-outage-post-issue-technical-report/

  4. Do you backup your SAN data? There’s no mention of it in this post. If you had lost all the data on the array, would you have been able to retrieve from a backup? If so how long would it have taken to restore everything?

  5. Thank you for this report. I appreciate know what happened, I want you to know that my confidence in IxWebhosting has not diminished. Stuff happens!

  6. James Repine -

    I am sorry, this is the 3rd major outage IX has had since I’ve been with you. The prevailing attitude that yeah, we are sorry for the outage, but if you want any compensation you have to contact us. The last 2 outages resulted in me having to call and I was upgraded to the newest servers and given a total of like 2 years free hosting.

    At this point, I would expect IX to be upfront and proactively offering compensation instead of leaving it up to the customer to contact you.

  7. Adel Assaad -

    Can you shed some more light on the use of SMART and other hardware failure prediction mechanisms before and after the incident. Specifically, were there any SMART failure statuses that could have warned you that the array was on the verge of collapse?

  8. Dear Fathi,

    Thanks for your Message and clarification. I really appreciated the way your team solved the issue. And as well I aware such things happened in this Server Environment. I kept 100% trust on IX since 2009. To be honestly I am speaking now it’s not 100%.
    I have a One Question. In the downtime you have to spend less timing for restore MySQL DBs rather than restoring Mail Servers. But why your Server Administrators put high Priority on Mail Servers ? Still I do not understand this.
    Meanwhile In the downtime of your server do you know how many customers I missed and How many Projects I missed? I would like to know What will be done to financially compensate me for my loss of business as a result of the outage? Your Earlier Response will be highly Appreciated.

    Thanks and Regards,
    Anjula Malshan

  9. I was in charge of a very large corporation’s network. We used monitoring software like Insight Manager (at minimum) and detailed disaster recovery planning, disaster protocol, along with our offsite data storage. An outage such as this simply not happen. Any site facilities could “go away” and we could restore it at another location(and we have done it) within one day.

    Since then, monitoring software has become more robust, and data backup easier with less cost.

  10. I just like to say, I too lost some money (as i had ads out which i never got replies to of course plus my husband was in contract negotiations at that time which was making things awkward for him and his customer to look serious having no email) and i wasted a TON of time trying what was going on and when things would be resolved. The way things were handled to inform your customers was unexceptable. BUT I have been with IX for roughly 10 years and this was the first major outage ever, customer service has always been great. i am confident that this was a ‘lessons learned’ and we will not have to deal with another episode like it in the future.
    I think asking for compensation for money lost is not fair at this time as i am sure that the additional cost of the incident plus loss of business is hard to bare for IX currently. And as the old saying goes.. don’t put all your eggs in one basket, so part of relying on internet access from one source is probably not a great idea either. However, a little thank you for long time custom in the future (of some sort – free renewal/service etc.) would be a nice touch.

  11. Theresa -

    Too little too late. the way I am reading this it says, “something happened that was out of our control and we did nothing wrong!” So untrue. You started this process telling everyone that no data would be lost, it was. You still stating no data was lost…it was I am still missing all emails that came to any of my email servers during the outage! IX Webhosting bungled this from the get go and is still not admitting wrong doing and the compensation that I was awarded? a drop in the bucket. I am currently investigating other hosts and will be migrating all of my sites.

  12. Raphael -

    It is bad luck indeed that 2 drives in the same array would fail at the same time. But you shouldn’t have one giant array running all your shared websites. That 30,000 accounts would be affected by 2 hard drive spindles failing at the same time is bad, bad architecture design. There’s no other way to put it. Improving monitoring won’t fix stupid design.

  13. Raphael Protti -

    It is unfortunate that 2 spindles failing would manage to take down 30,000 websites. It seems to reflect poor/cheap architecture design. Why not use several smaller arrays, each hosting a portion of your services, so that only some clients are affected by such drive failures, but not 30,000 of them. This would make your facility more resilient, no?

  14. Have you considered RAID60 as an alternative to RAID50 and RAID10? It would permit a two drive failure on each parity group. Rebuilds would still be time-consuming, though. At least the capacity lost to redundancy would be only n-2 for RAID60 instead of n/2 for RAID10.

    • @Adel Assaad – We will be able to provide more information once the investigation is complete.
      @Anjula Malshan – Mail servers actually affected the biggest customer base, that’s why the priority was placed on those first. If you wish to inquire about compensation, please place a support ticket to our billing team or call us anytime, we’ll be happy to discuss it with you.
      @Nathan – Our admins are looking at all alternatives to try and prevent further issues.

  15. From the report it seems like that you guys are not doing your job.

    Either you do not have proactive monitoring in place or your systems/storage engineers are sleeping on the job.

    This is not good enough.

    You guys are saying you provide 99.xx% uptime.

    But clearly your systems are not implemented to support it. You have systems not suitable for the false uptime that you are stating on your website.

    I have begun to move my client sites to another hosting provider.

    I suggest you start putting things into action instead of cosmetic changes on your website to fool customers by your markting campaign.

    Thanks and Regards
    Ron

  16. I think you guys did a great job in the recovery. Our business does run on emails, it’s our life blood. It would have been nice to have ix use some forwarding system but, I have to admit, it gave everyone at the office time to get other tasks completed and our phones worked as our back-up. People remembered how to use the phone – Thanks for all your teams hard work.

  17. What is your disaster plan? What would have happened if the SAN was unrecoverable or incident such as fire or flood?

  18. Thank you for your explanation on what occurred. The hardest part for me was not being able to receive emails. If there was a work around for receiving emails while your servers were being rebuilt that would have been the best.

    • @T-We are deeply sorry for the e-mail issues you faced, we appreciate your patience with us.
      @Richard Hyde- We thank you for the support and kind words!
      @Ronal Kumar-We greatly apologize for this problem and wish we could do more to keep you and your clients. We understand your frustration to these circumstances.

  19. I must say, during the entire outage I had about 8 websites go down – all from my own businesses as well as multiple clients. All of their emails were down as well for several days. We had RSVP’s trying to come in from our once a month event. Major contracts we were in the middle of being produced. Yes, it was horrendous. We lost a lot of business and I had a lot of really angry clients and members. But not once, not ONCE did I post a comment on the blogs, or contact chat support demanding to know what happened, or yelling about it. I simply took it all in stride and kept checking the blog every hour on the hour for days on end until all services had been restored. I knew IX was working extremely hard around the clock to fix the issues, and yelling or clogging up the phones and chat system would have been counter productive.

    But… I must say now….. I did get pretty mad when IX billed me for my renewal during the outage… i mean… REALLY???? I got charged for another year’s hosting while I was losing thousands of dollars every day…. yeah….

    And now I need to contact support to discuss re-imbursement. Funny thing though… about 10 days ago I posted a request for tech support via the proper form, and after 8 days still had a ZERO response… I finally just closed the request and figured out a work around myself. With all this combined, I have actually been looking at a change. And I have been with IX since 2007. 7 years of loyal service and lots of referrals to boot.

    Not feeling too forgiving at this point though :(

    • @Dacia – We very much appreciate you sticking with us for so long, we hope that we can do everything possible to keep your trust. We’re understanding of how frustrating it is with the entire issue, and thank you for your understanding. I’ve escalated your ticket to our admins to have them overview this issue and hope to get a resolution for you very quickly.

      • Alan, now that’s the kind of customer service you guys had years ago :) If customer support contacts me regarding this, then my faith just may yet be restored :)

        Now if only I could get you guys to get rid of the really bad foreign chat support that takes 10 minutes to get an automated incorrect response from… you guys would once again be the best! LOL

  20. I must say, during the entire outage I had about 8 websites go down – all from my own businesses as well as multiple clients. All of their emails were down as well for several days. We had RSVP’s trying to come in from our once a month event. Major contracts we were in the middle of being produced. Yes, it was horrendous. We lost a lot of business and I had a lot of really angry clients and members. But not once, not ONCE did I post a comment on the blogs, or contact chat support demanding to know what happened, or yelling about it. I simply took it all in stride and kept checking the blog every hour on the hour for days on end until all services had been restored. I knew IX was working extremely hard around the clock to fix the issues, and yelling or clogging up the phones and chat system would have been counter productive.

    But… I must say now….. I did get pretty mad when IX billed me for my renewal during the outage… i mean… REALLY???? I got charged for another year’s hosting while I was losing thousands of dollars every day…. yeah….

    And now I need to contact support to discuss reimbursement. Funny thing though… about 10 days ago I posted a request for tech support via the proper form, and after 8 days still had a ZERO response… I finally just closed the request and figured out a work around myself. With all this combined, I have actually been looking at a change. And I have been with IX since 2007. 7 years of loyal service and lots of referrals to boot.

    Not feeling too forgiving at this point though :(

  21. WOW….. What a bunch of whiners!!!
    Clearly some of your customers need to crawl out of mommas basement and enter the real world where things don’t always go so well.

    I’ve been with IX web-hosting since 2005 and I don’t remember ever having a hickup like this. That’s pretty good!!

    The jerk bitching about uptime must not understand how to (average)time.

    My site was down too people. Of coarse my emails were fine.
    I cheat and host my emails with google apps but it could be with anybody.

    Come to think of it you all could cheat and from the kind of money you’re taking about losing it would be worth it.

    Lets say you host one site with two separate company’s and handle your domains on a third.
    Site goes down and you log in to change the dns to the other server.
    Might take a while for the dns to populate across the net but hey,wouldn’t it be nice to be up instead of down.
    Yes I know this might open up other issues but at least I’m looking for solutions instead of wasting time venting.

    Anyways shame on you all for pretending this is anyone’s fault.
    Hard-drives are like tires on your car and they all wear out together.
    Every server in the world is a catastrophe waiting to happen no matter how many drives you have.

    Good night Fathi, keep up the good work.

  22. Hi Fathi and IXweb,
    Thanks for your transparency on this. Here is my comment – Not only is RAID 10 faster, but more durable as well. In the disaster scenario that you had and described, a RAID 10 array would not even have gone offline or caused any downtime other than an emergency maintenance advisory. This is because of mirroring redundancy and much greater fault tolerance. It is more expensive to build out since you are only utilizing half of the storage capacity, but the redundancy alone is worth it, not to mention the huge performance gains as well.

    I have 2 RAID 50 arrays set up in my house from 2 Highpoint RAID cards and 16 spindles that I bought from a popular online retailer. This is more than sufficient for the enthusiast home user, but kind of precarious in a commercial setting. RAID 50 stripes RAID 5 member volumes and RAID 5 fault tolerance is not great. If you lose 2 drives in one of the member volumes, then you are going to have a serious issue on your hands. In a RAID 10 array you can simultaneously lose up to half of the drives in the entire array and still replace & rebuild and all the while stay up and running!

    The short story is that RAID 10 is worth the extra hardware cost. As a customer I would even be OK with capacity limits if it meant hosting on RAID 10. The added performance and data protection are more than worth it. I have been with you guys for several years now and I have always appreciated the service and the way that you do business. Please make the right decisions going forward about your storage arrays. :-)

    -Robert / Concur IT Development

  23. Arvind Clemente -

    when you are hosting for 30,000 websites, i think you should have a redundant SAN with fail-over. Observing that you have a 16 drive SAN i am assuming this is not a Enterprise class SAN like EMC, Netapp, Etc. Its imperative that in such an environment redundancy is of the highest importance.

    Even though i was not effected with this outage i am still a client of IXwebhosting for over 8 years with over 6 hosting account. Now i am thinking seriously to move my servers elsewhere.

  24. My concern is for others like me who lost the confidence of our customers as a result of this problem. For example I found out that the server was down in the middle of a presentation. This reflected poorly on me as a competent professional and so this confidence cannot ever be fully regained. I would like to offer a discount to my clients but as the price of hosting has continued to rise they are compelling me to look at other options. I do have a hope that the fixed price which was sold to them and to me has disappeared. And when I bring it up in chat or on the phone no one seems to know what I am talking about. How about giving me the leverage to keep my job and my clients by giving us back the fixed hosting rate which was promised when we signed up with IX. I also don’t know if the up time guarantee is still in effect but the fine print made no tangible promise of compensation so all that I ask is that you allow me to claim the fixed rate which I initially signed up for so that I might save my job and reputation. Thank you for your consideration.

    Jesse Bowen

    Complete Income Tax Services

    The Cross Cultural Healthcare Initiative

    Educational opportunities

    The Ending Global Conflict Foundation

    ITwebPro

    2or3d

    • IX: Omari J. - -

      @Jesse – I apologize about the late response. We can definitely discuss some of your concerns and work with you. Give us a call, contact live chat, or create a ticket to speak with a supervisor about addressing your points.

  25. My websites and email have been dramatically effected by this server issue(s). I can’t even begin to say how incredible frustrating it is to have potential clients email me and later call me asking why I never responded to them. often times I have never received their emails or I have get them days or weeks later. this makes my business look horribly unprofessional and make me look like a terrible communicator. Not to mention how many potential customers I’ve lost because of this issue.

    Customer service has provided me with no real answer as to why this keeps happening and furthermore can not provided me with a real concrete answer on if they are even doing anything to solve it. I honestly was told my one customer service rep that its because your customer emailed you through yahoo and its getting blacklisted. ok, Why is nothing being done to correct that? so I have to preface every customer interaction with please don’t email me from a yahoo account because I wont get it? how is that a good way to do business?

    I appreciate the fact that they are being forthcoming with information about the issue, but I have used IX for the better part of 10 years and I honestly say that I can no longer trust them as a secure hosting provider.

  26. Thanks for the update.
    Two comments about this:
    1. I was informed by your support that you have a weekly backup plan for this array. I think one conclusion from this outage should have been – next time, recover the latest backup on another array and at least have that available. Personally I think a daily backup makes more sense, but even a weekly backup is better than complete downtime for several days. I’m sure a company of your size has available storage for that, or otherwise you can rent some in the cloud in order to allow your customers smooth work through a hardware failure that may not be your fault, but it’s your responsibility.
    It’s still not clear to me, from your explanation, how will you prevent this from happening again. I understand you can’t make a 100% sure that it won’t happen again, but frequent backups / a mirror/DR site should be a good plan for this. None of these measures is mentioned in your technical review, and that concerns me as a client.

    2. Malfunctions occur. I think your company is not to blame for the disk failure. You are to blame, however, for poor communications, and I’m not just talking about the ETA farce. The wait times in the chat, the responses in the blog, the update frequency of the blog, the general fuzziness in communications around this incident were very unprofessional (that’s me choosing my words carefully). The fact you fail to identify how you will change your crisis management and communication process in your report is concerning too – I don’t think you grasp the severity of this specific issue, or how lacking was IX on that front. Every technology experiences problems – but as a customer, I wouldn’t want to be beholden to an organization that doesn’t see fit to communicate with me openly and honestly, but rather ignores me or gives me robot-like answers. Even the fact you didn’t see fit to notify affected clients by E-mail (each of us has given you an external email address when registering to the service, so email servers’ downtime had nothing to do with it) by itself shows how you don’t care about your customers.

    I’d like to see a response for that, and preferably not from your support representatives in the chat.

    • IX: Jared E. - -

      Giri,

      I will address these concerns as best I can.

      1. As far as backups – This would require recreating the server first, then copying the backups over. Since the server drive was completely gone, the server creation would be done manually, and that would take a couple of days. Putting our admins on that task instead of resolving the larger issue would have delayed resolution.

      Prevention is something I do have some more information about. We are upgrading our systems, and also have installed a new type of monitoring tools that will allow us to get more information as an issue is occurring. I can’t give much detail about the hardware or software for security reasons.

      2. I greatly apologize. We had some internal information, but information such as ETAs and an idea of resolution time really were not available until the list was being created. During a good 50+ hours of that outage, I was on this blog, and I was getting new info first. We were going to our admins and getting what they had. We certainly can’t give out anything that would be inaccurate, so we don’t post speculation. As this issue was not one similar to other issues we have had, estimating time could only be done once we had numbers to work off of.

      Generally, our communication is much better, and I and everyone here agreed we need to improve it. I do apologize for the confusion and slow information that you had.

      As far as chat waits, suddenly after the issue we had hundreds of people waiting at any given time. We only have so many employees, and even the ones who were working overtime didn’t make up for 50 times the chats.

      Giri, I do greatly apologize you had to be one of the customers affected by this. I promise you that we are working hard to make sure no one with IX ever experiences anything like this again.

      Edit: I can give a bit more detail on the hardware. We are upgrading from RAID 50 to RAID 10.

  27. if outage will happened any compensation of money back or any other favor…?

    • IX: Antonio S. - -

      Hello Afzal

      I apologize for the delay. Please contact us or submit a help desk ticket and we will be happy to explain compensations for down time.

We're Always There When You Need Us The Most!

Your Dedicated Support

At IX, we take care of our customers. And dedicated support is one of the ways we prove to you again and again that we are here to help you every step of the way, regardless of your skill level. With IX dedicated support, you get a support technician personally assigned to assist you. You get their name, number, email, social media connections, and work schedule! It's just one more facet of our service which proves our deeply rooted belief that being a great hosting provider requires more than just cutting-edge technologies, but the best in support and service.