Few IT organizations are really good at problem management; it is often only used for managing the aftermath of major incidents. I think that one of the reasons for this is confusion in the way we distinguish incident management and problem management. We could do a much better job if we changed how we think about these concepts.
I see two big issues with the way we currently define incident and problem management.
1. Failures that have not yet impacted service to users are not well handled by either incident management or problem management.
The ITIL definition of an incident says it is:
"An unplanned interruption to an IT service or a reduction in the quality of an IT service. failure of a configuration item that has not yet impacted service is also an incident. For example failure of one disk from a mirror set."
Even though I wrote this ITIL definition, I really don’t agree with the final sentence. If there has been a component failure that has no impact on any users then we don’t need to follow most steps of the incident management process, and we don’t want the outcome of incident management, which is to restore service to users as quickly as possible.
We can’t use problem management to manage these failures, because ITIL defines a problem as:
"A cause of one or more Incidents. The cause is not usually known at the time a Problem Record is created, and the Problem Management Process is responsible for further investigation."
Something that has not yet had an impact on any users is definitely not a problem, and it’s not helpful to call it an incident either. I think that we should separate these kind of issues, call them faults, and manage them with a "fault management" process. Fault management is well known in engineering as the process that detects, isolates and corrects malfunctions, which is exactly what is needed for this kind of thing.
2. Analysis of incident trends is part of a separate process (problem management), rather than integrated with the underlying process as it would be in every other service management process.
The work that we define as proactive problem management has nothing to do with managing problems. Analysing incident records to spot trends, and proposing changes to resolve the underlying causes of incidents is really doing continual improvement for incident management. Separating this activity from incident management, and from other continual improvement activity, can lead to the following consequences:
If we create a fault management process, as suggested above, and we then move responsibility for continual improvement of incident management to the owners of that process, then we end up with a much simpler approach.
Changes to incident management that we would need are:
The new fault management process would include:
This has a much simpler split of work than the incident / problem management split that we currently use.
There is just one thing that will make this new approach difficult to implement, and that is that few IT organizations are really good at continual service improvement, but that is a topic for another blog.
If this blog has made you think about changing how you do incident and problem management then you may also want to read: