Follow us

Keep It Running or Fix It Quick?

By | July 9, 2015 in ITIL

Improve availability of an IT service

I was recently involved in a discussion about IT services and how to deliver acceptable levels of availability. This discussion was triggered by a failure of the London air traffic control (ATC) system on 12 December 2014, but the ideas apply to any system, not just safety critical services like air traffic control.

Although the ATC failure did not last long, the impact was enormous, as many flights were diverted, resulting in lots of aircraft being in the wrong place. Airline schedules took a full day to get back to normal, many passengers were stranded, and there was a lot of disruption to travel plans.

There are two ways to improve the availability of an IT service.  One is to reduce the frequency of failure.  The other is to reduce the time needed to recover from it. The ATC system is a safety critical service.  Failure is unacceptable, since it will result in deaths and injuries, and this is why planes had to be grounded. Some of my colleagues argued that since failure of the ATC system is unacceptable, it should have been designed to prevent any possible failure; fast recovery would not have helped as planes would still have been grounded. I, however, argued that in the real world we can never prevent every possible failure, so reduced recovery time will always be essential.

I found support for my view in an article published by The Register, which said that ATC can continue to operate for up to 8 minutes when they lose access to flight plans (which is what happened on 12 December), but that after 8 minutes they must start to divert planes. So a failure that recovers within 8 minutes has negligible impact, and one that lasts even a few minutes longer has a major impact.

It’s Not Just About Air Traffic Control Systems

I have come across similar issues in many other IT services. In one case we designed a service that could fully fail over to a backup location within 300 milliseconds of any hardware or software failure (yes that really is less than 1/3 of a second). Clearly this kind of solution is not going to be needed for the sort of IT services that most of us work with, but it certainly was a viable solution for this particular customer, albeit one that was difficult to design and expensive to provide.

About 20 years ago I was involved in a project to provide laptops to mobile engineers. This service enabled the engineers to collect their calls, and update them, remotely, providing a significant competitive advantage over the previous telephone based system. Management suggested that we needed to make sure the laptops were locked down, to prevent the engineers from making changes that could impact the key business application, but I know something about the way engineers behave, and I didn’t think this would be possible. The solution we designed involved giving every engineer a CD that took about 20 minutes to completely recover the laptop back to the initial working configuration – and, crucially, without erasing any of the data that they had already stored on the laptop. This meant that nothing they did to the laptop could result in extended downtime, unless they actually managed to physically break it.

I often see service level agreements that specify availability in the form of percentage uptime, with figures like 99.95% availability during business hours. The problem with this is that it is almost impossible to design a solution to meet this target. We can predict the likely frequency of predictable hardware failures, but most real IT failures aren’t due to predictable hardware failures, they are caused by complex interactions of people, processes, software and networks. In these circumstances the best we can do is have a good plan to restore service to our users when it does go wrong, and this means getting the designers to focus on recovery time.

How many of your IT services have been designed with recovery time as a key design constraint? How confident are you that you could recover each of your IT services within a time that is acceptable to your customers? How well tested are your recovery plans? If you can’t confidently provide a positive answer to all of these questions, then maybe it’s time to review how you plan to meet your customers’ availability needs.

Image credit

Like this article? You may also like: We Don’t Do People!

Please share your thoughts in the comments or on Twitter, Google+, or Facebook where we are always listening.

Stuart Rance

About Stuart Rance

Stuart is an ITSM and security consultant, working with clients all round the world. He is one of the authors of ITIL 4, as well as an author of ITIL Practitioner, ITIL Service Transition, and Resilia: Cyber Resilience Best Practice. He is also a trainer, teaching standard and custom courses in ITSM and information security management, and an examiner helping to create ITIL and other exams. Now that his children have all left home, he has plenty of time on his hands for contributing to our blog - lucky us!

Leave a Reply

Your email address will not be published.