In my recent blog 5 New Year’s Resolutions for ITSM Practitioners, I recommended that people should think about how they manage availability. I was surprised by the number of people who contacted me to say that they thought I had got this one wrong since availability management isn’t that important. I don’t think I got it wrong and this blog explains why.
Well-run IT services make a huge contribution to a customer’s business, and this means that when those services aren’t available, the negative impact can be huge too. If we don’t plan to deliver the right level of availability, we’re just relying on luck, and, as we all know, hoping for the best isn’t a viable management strategy.
Availability management activities can be broken up into three distinct areas:
How available does a critical IT service need to be? How often have you worked with customers where there are no clear availability requirements beyond “It must be available all the time” or “We want 100% availability”? The biggest problem with this kind of target is that it gives you no guidance on how to design the service and its recovery mechanisms. You should never agree to targets like these. You will fail. Maybe not this week or this month, but eventually there will be some downtime. What’s more, you will not have the tools and the knowledge you need to manage that downtime effectively.
One IT organization that I worked with designed a shiny new service that included clustered servers, RAID disks, dual network paths, and many similar technologies. The solution was designed so that routine hardware failure would have caused the service to stall for 20 to 40 seconds. Not bad, you might think. Nobody could possibly notice such a short interruption of service. Sadly, the availability target was poorly defined.
When I elicited the customer‘s actual requirements, it turned out that they needed the solution to recover within 250mS (that’s a quarter of a second) whenever any component failed. A longer recovery time than this would have had a huge business impact. A major business project had to be put on hold for a year so that a new IT service design could be created!
You must define availability requirements based on a thorough understanding of how the IT service supports the customer’s business processes, and what the impact of failure will be over different time periods.
When you design services to deliver the level of availability your customer needs, regardless of the technology you choose, you need to consider two things: reducing the frequency of failure and reducing the time taken to recover after failure.
Failure happens less often when you use reliable infrastructure, with simple well-understood components. Ideally you should standardize on a small number of infrastructure components that you know well; this can help to reduce the chance of unexpected interactions. You should also use fully tested well-designed software, reusing existing software components wherever possible, rather than starting every project from scratch. The modern trend towards use of containers facilitates this, as new solutions are largely built from existing containers that are already well tested and well understood.
Reducing time to recover is probably the most important aspect of availability design. If you think through all the different ways that IT services could fail, and plan how you would recover from each of them, then you can make a huge difference to availability. The concept of anti-fragile takes this to another level by accepting that failure cannot be prevented and designing services that can rapidly recover from almost any failure.
It isn’t enough to make sure that you include availability management when you first design and deploy services. You have to manage the availability of those services on an ongoing basis to ensure that agreed levels of availability are met. Here are three activities that need to be performed:
Failover and recovery measures that you have invested in may not work when you need them if you don’t test them at regular intervals. I have reviewed many different IT services, to help identify risks. In nearly every review I have discovered countermeasures that would not have worked when they were needed. Sometimes this was for technical reasons, but often it was simply because people did not know the exact steps they were supposed to follow after a failure. A regular schedule of testing verifies that everyone knows exactly what to do. So make sure the testing covers all of your recovery mechanisms, not just failover but also restoring backups, invocation of service continuity plans, and anything else that might be needed.
Organizations that use the anti-fragile approach may actually inject random failures on a regular basis to make absolutely sure that the recovery mechanisms work correctly.
It can be very difficult to measure and report the availability of an IT service in terms that are meaningful to customers. If one user has a faulty laptop, then the service may not be available to that user at a critical time, and this could be just as significant as a failure of a server or a critical application. Similarly, if just one transaction fails, this may be of low significance, or it could have a major impact on the business. Your measurement and reporting should reflect the customers’ experience of availability, not just data about technology components. This measurement and reporting must include the entire end-to-end IT service, not just the application or the server.
Simply producing a report each month that shows when you missed the agreed targets is of very little benefit. Measuring and reporting availability is of most value when it helps you ensure that you meet your targets.
If you monitor availability trends, you can identify IT services that might be in danger of breaching their targets in the future. This will give you time to identify improvements you can make to ensure you do meet the targets. You can also set warning thresholds so that you are aware of services that might breach their targets during the current measurement period. Then you can take measures to ensure targets are not breached. For example, one of my customers has a process that detects any service that has exceeded 50% of its allowed downtime during a quarter. They then take the following actions to ensure the target is met:
This active intervention has enabled them to reliably meet their availability targets for many years.
If you just monitor and report availability, then sometimes you will have to tell your customers that you failed. Alternatively, you could implement some of the ideas in this blog to actively manage the availability of your IT services, and ensure that you reliably meet your customers’ expectations. Which do you think is better? Why not start now by discussing the ideas I bring up here with your IT team and thinking about which you can adopt straight away?