When I read the Event Management Process chapter in the Service Operations book I had a feeling of approval. Finally, they are getting it. But then I was confronted with the Incident Definition and I realized that they still didn’t get it. Let me explain.
The definition of an incident, according to the glossary, is: “A unplanned interruption to an IT service or reduction in the quality of an IT Service”. That part I’m 100% in agreement with. But the definition continues: “Failure of an configuration item that has not yet effected Service is also an incident. For example Failure of one disk from a mirror set”. A failure of one disk from a mirror set clearly will not be an interruption of an IT Service and will also no lead to a reduction in the quality of the IT Service. There will only be an higher risk of an interruption when the other disk fails. So in the definition for incident they added an extra exception. Making a clear definition debatable. BTW since when is an hard disk of a mirror set an configuration item?
OK, you need to address the disk failure and some administrator should look into this. There should be a workorder created: replace hard disk in mirror set and check if the mirror is restored. Incident Management should be only about solving service interruptions urgently. That is why we’ve taken out the non-urgent requests. And that is why in the Event Management process there is the notion of the Alert (p. 41 of the Service Operations book). “The purpose of the alert is to ensure that the person with the skill appropriate to deal with the event is notified”. Interesting enough there is no definition for Alert in the glossary.
The nice aspect of Alert is that you can use it to schedule corrective actions, taking the urgency and subsequent rushing out of the equation. The next day an operator with sufficient skills, documented workinstructions and access rights takes the work order list and replaces the hard disk. Without being rushed, without interrupting the service and within a reasonable time (within 24 hours). If you would have followed the book you would have created an incident with a low priority, thus low on the list, and somewhere between 4 hours and 4 days the hard disk might have been replaced. Since Incident Management is always dealing with new incidents with a possible higher impact you would never know for sure that these low impact incidents will be performed in time. Plus you have to explain to your customer the higher number of incidents that they can not relate to, since they have not experienced the service interruption (since there wasn’t any).