IT organizations of large companies are faced with 5 critical and about 80 less critical alarm messages daily. These are many minor and a few but central IT malfunctions. The latter ones may be business-critical, e.g. the outage of an online banking service. The current Gartner Hype Cycle ITSM, 2020, that annually presents the most important developments and tools on the IT Service Management (ITSM) market, underlines the increasing importance of IT Service Alerting (ITSA). Pertaining tools are linked to ITSM systems and monitoring applications in order to distribute and manage alarm notifications to their receivers automatically. The term coined by Gartner focusses on IT service. However, the term IT alarm management is more common in Europe.
The blog article describes the central role of event management, shows what an ideal alerting process looks like and gives recommendations on what to do now.
False alarms (or false positives) occur every day in IT operation. Usually, administrators receive too many, not too few alarms. Only very few are really relevant. It also often happens that false positives are sent as collective mail to a group of employees, and it remains unclear if and who is responsible and should respond.
The central challenge for professional alarm management is therefore not collecting data. It is rather filtering and classifying information as well as evaluating and processing it in a useful manner. This is what event management does.
This is based on the definition and administration of event monitoring mechanisms and rules. This is where the alarm concept is set. Those configuration items (CIs) are defined that services consist of, i.e. what applications or infrastructure components are relevant for alerting in case of a malfunction as well as any existing dependencies. Generally, the criticality of individual events is defined and their effects on the pertaining services and any consequences. What roles need to be informed in case of a malfunction, e.g. IT operator, network administrator, service owner? What communication channels should be used (depending on the type of malfunction), how are escalation paths defined? Additional important aspects are the affected customer(s) and the location of the devices. In case of a malfunction, this information guarantees to quickly recognize effects (impact analysis) and reasons (root-cause analysis). Step-by-step instructions stored in a solution database reduce MTTR.
Business service modelling is the basis for event correlation and hence for effective alarm management. A central CMDB is important in this context. This is where the service tree is maintained using automated processes.
The added values of an upstream event correlation are clear: If for example the WAN network of an international company fails in Shanghai, the responsible persons receive ONE alarm message. Without correlation, a multitude of affected CIs would generate alarms as well ("Server XYZ down", "Router XYZ down", etc.) so that hundreds of alarms or tickets would be registered. In practice, not only can the number of "false positives“ be reduced drastically; the quality of information and a well structured process ensure fast trouble-shooting.
Want to learn more?
This article is an excerpt from a comprehensive whitepaper that can be downloaded here:
Example of an "ideal" alerting process
As outlined above, only the interaction between monitoring, event management, capacity management, incidentmanagement and alerting modules ensures that technical failures are avoided or that errors are corrected as quickly as possible.
This is demonstrated with the following example:
During online banking activities, customers are unable to access a page on which the PIN for transactions is requested. Therefore, money transfers, standing orders, etc., are not possible. In the background, central event management collects the data and analyzes the causes of the malfunction. At best, the system itself rectifies the malfunction – e.g. by restarting the application. Experience shows that these self-healing mechanisms reduce the number of alarms by approx. 20 percent. If the malfunction has been caused e.g. by a capacity problem (server full) or hardware that failed (overheated server), these problems might be recognized in time and solved using KI-based predictive monitoring.
If alerting is imminent, the system enriches the problem message with important structured information: Where is the affected device, what exactly is the faulty function, what customers are affected, what service level agreements (SLAs) pertain to the service, what is the solution, etc.
Depending on the malfunction, the service technician stored in the system is informed receiving a push message using an app or SMS or other channels.
He uses the app or SMS to acknowledge the problem to signal that he will take care of the malfunction repairing the device with the specific solution provided. If, however, the person in charge cannot be reached over different communication channels, the problem is escalated based on a defined, sequential alarm concept: The next person of the on-call service team will be informed as well as the team leader if required.
In this example, the online banking problem usually is a top priority urgency. Depending on the time, globally operating organizations with distributed service teams are alerted in their respective service times. This ensures a 7x24 service based on the "follow the sun principle".
Whereas critical malfunctions are always reported using an app, SMS or voice, the system uses for example e-mails to send alarms on less critical issues – day and night. Since on-call service teams are often paid only if they are requested, costs can be reduced.
Recommendations for action for companies
The following practical tips have proven to be important building blocks for the success of many alarm management projects:
Identify those IT-based processes that may be business-critical; estimate the risk of malfunctions; if required, invest in an alerting concept including the pertaining technology.
Before deciding for an alerting tool, concentrate on improving the underlying monitoring tools and processes for event management. Because alerting tools are no universal remedy for bad event management.
Use it to eliminate "false positives" (i.e. "malfunction noise") before you take care of critical errors including their alarm communication. Define for each IT stakeholder what information he needs and in what form to avoid "spam".
Comfortable ease of use and individual on-call service planning are important factors for service teams to accept a solution.