IT organizations often spend huge amounts of time, money, and other resources on managing incidents, but they spend surprisingly little on problem management work that might reduce the number of incidents in the first place. This is often due to poor understanding of the difference between incidents and problems, and insufficient knowledge or understanding of how to manage problems.
Many people confuse incidents and problems, so let’s start by making the distinction clear
So, incident management helps you get the business working again, problem management helps you prevent future incidents, or at least make them less painful when they do happen.
In the bad old days, when IT was a very technically-focused function, most IT teams didn’t distinguish incidents from problems. If something broke, then somebody would work on it until it was mended. IT technicians paid little attention to the business impact of whatever it was they were working on, and the customer just had to wait until it was fixed. But when we learned to distinguish between incidents and problems, this changed. Organizations could set up incident management to focus on doing whatever is needed to get the business working again, leaving problem management to deal with any underlying technical issues. Take, for example, a printer that isn’t working. There is no need for the customer to wait till the printer is mended. When we practice incident management we can simply help the customer route their printout to a different printer. Obviously, this doesn’t get the printer fixed, but that is probably something that doesn’t really concern the customer. We can of course go on to repair the faulty printer. But this is now a problem management activity, with lower priority as the outcome has very little direct impact on the customer.
Before you can start managing your problems, you need to identify them. Here are some common ways that organizations identify problems:
These are some of the ways that problems can be identified, and you should make sure that you take full advantage of these, but also think about all the different ways that a problem could be identified in your organization. For example, think about your suppliers, software developers, or infrastructure teams, and make sure that you actively capture all of their input and use it to log problems.
Problem management has two objectives:
Many organizations only think about the first of these objectives. They do root cause analysis to understand the problem, and then take steps to rectify whatever was causing it. This may take some time, and while a complete technical solution may prove very effective once it is in place, the business is likely to continue suffering in the meantime.
The best organizations that I work withstart by thinking about how to reduce the impact of incidents – the second of our two objectives. They ask themselves, “What should we do if this happens again right now?” This might be a difficult question for technical support staff to answer if they don’t fully understand the problem, but it’s much better than just leaving service desk agents to flounder when the same thing does happen again. Organizations with really effective problem management create workarounds for problems as quickly as they can. They make sure there is a well-documented workaround in place as quickly as can be managed, and they also review the workaround every time it’s used to identify possible improvements. And with the latter, they will also go on to improve the workaround as they learn more about the cause.
So one benefit of thinking about reducing the impact of incidents, before you start analyzing their root causes, is that you reduce any business impact much faster.
But there’s a second benefit – sometimes it turns out that a workaround is so effective that there is actually no need to understand the root cause or fix it. Here’s a perfect example:
One of my friends had a gas leak under the concrete floor in his kitchen. He turned off the gas and called out the emergency gas fitter. This gas fitter ran a pipe around the outside of the kitchen so that the gas cooker could be reconnected and arranged for a structural engineer to diagnose the leak properly, so they would know where to dig up the kitchen floor. The structural engineer called a week later and said it would cost a huge amount of money to dig up the concrete and fix the pipe, but fortunately this was unnecessary because the new gas pipe was perfectly safe and did the job.
This example doesn’t involve IT, but I’ve used it anyway because it is a particularly good illustration of the fact that a safe and effective workaround is often enough. Why waste good money fixing an application when you have a workaround to the problem and it no longer has an impact on the business?
Of course, if your workaround is not sufficient then you do need to investigate the root cause of the problem. There are many different techniques you can use for this. My favourite approaches are timeline analysis and Kepner-Tregoe problem solving.
This is such an easy way to investigate a problem that it barely deserves a name. You simply list everything that happened in time order and then look for patterns. What is important is that you get all the data from multiple sources and then sort it by date and time, regardless of where it came from. So your timeline may have entries from system logs, emails, service desk records, and many other sources. This simple approach is surprisingly effective at building a complete picture of what’s been going on.
I have to declare an interest here, I used to teach this proprietary approach to problem solving, but I do think that it is incredibly effective. This is a very structured approach to problem solving, where you define the problem across a number of different dimensions (what, where, when, extent) and you also bound the problem by identifying what is NOT failing. You can then review the distinctions between these to identify possible causes.
Problem management presents a great opportunity to reduce both the number and the impact of IT incidents on your customers. If you are not already doing problem management you should certainly be thinking about introducing it. And once you do, I think you may find my guide to problem management metrics worth a read.
If you are already doing problem management, then make sure that you have the balance between devising workarounds and investigating root causes right. You don’t always have to understand the cause to resolve the problem, but you do always need to put a workaround in place as quickly as you can. You may even want to think about how to integrate incident and problem management, but whatever approach you take you should make sure that you focus on value. Think first and last about what will be best for your customers and users.