The purpose of the Monitoring and Event Management practice is to systematically observe services and service components, and record and report selected changes of state identified as events. This practice identifies and prioritizes infrastructure, services, business processes, and information security events, and establishes the appropriate response to those events, including responding to conditions that could lead to potential faults or incidents. ServiceNow is the most popular ITIL based Cloud Application for ITIL Event Management Process.
What is Event Management? A Successful IT Service Operations Strategy
Definition of Event in ITIL 4 Foundation “Any change of state that has significance for the management of a service or other configuration item (CI). Events are typically recognized through notifications created by an IT service, CI, or monitoring tool.”
An event can be defined as any detectable or discernable occurrence that has significance for the management of the IT Infrastructure of the delivery of IT service and evaluation of the impact a deviation might cause to the services. Event Management process is the core of Successful ITIL IT Service Operations Strategy.
Events are typically notifications created by an IT Service, configuration item, or monitoring tool. Effective IT Service Operations is dependent on knowing the status of the infrastructure and detecting any deviation from the normal or expected operation. This is provided by good monitoring and control systems, which are based on two types of tools:
- Active monitoring tools that poll key configuration items to determine their status and availability. Any expectations will generate an alert that needs to be communicated to the appropriate tool or team for action.
- Passive monitoring tools that detect and correlate operational alerts or communications generated by configuration items.
The Objectives / Purpose of Event Management
Event Management is to provide the entry point for the execution of many Service Operation processes and activities. In addition, it provides a way of comparing actual performance and behavior against design standards and Service Level Agreements.
Other objectives include:
- Provides the ability to detect, interpret and initiate appropriate action for events
- The basis for operational monitoring and control and entry point for many service operation activities
- Provides operational information, as well as warnings and exceptions, to aid automation
- Supports continual service improvement activities of service assurance and reporting and service improvement
The Scope of Event Management
Event management can be applied to any aspect of IT Service Management that needs to be controlled and which can be automated. These include:
- Configuration Items (CIs):
- Some CIs will be included because they need to stay in a constant state
- Some CIs will be included because their status needs to change frequently, and event management can be used to automate this and update the CMS
- Environmental conditions (e.g. fire and smoke detection)
- Software license monitoring for usage to ensure optimum/legal license utilization and allocation
- Security (e.g. intrusion detection)
- Normal activity (e.g. tracking the use of an application or the performance of a server)
The Difference Between Monitoring and Event Management
These two areas are very closely related but slightly different in nature. Event Management is focused on generating and detecting meaningful notifications about the status of the IT Infrastructure and Services.
Whilst Monitoring is required to detect and track these notifications, monitoring is broader than event management. For example, monitoring tools will check the status of a device to ensure that it is operating within acceptable limits, even if that device is not generating events.
Examples of Events
Events that signify regular operation:
- Notification that a scheduled workload has completed
- A user has logged in to use an application
- An email has reached its intended recipient
Events that signify an exception:
- A user attempts to log on to an application with the incorrect password
- An unusual situation has occurred in a business process that may indicate an exception requiring further business investigation (e.g. a web page alert indicates that a payment authorization site is unavailable – impacting financial approval of business transactions)
- A device’s CPU is above the acceptable utilization rate
- A PC scan reveals the installation of unauthorized software
Events that signify unusual, but not exceptional, operation:
- Server’s memory utilization reaches within 5% of its highest acceptable performance level
- The completion time of a transaction is 10% longer than normal
The Value to the Organization and Benefits of Event Management
Event management’s value to the organization is generally indirect; however, it is possible to determine the basis for its value as follows:
- Event management provides mechanisms for the early detection of incidents. In many cases, it is possible for the incident to be detected and assigned to the appropriate group for action before any actual service outage occurs.
- Event management makes it possible for some types of automated activity to be monitored by exception – thus removing the need for expensive and resource-intensive real-time monitoring while reducing downtime.
- When integrated into other service management processes (such as, for example, availability or capacity management), event management can signal status changes or exceptions that allow the appropriate person or team to perform early response, thus improving the performance of the process. This, in turn, will allow the business to benefit from more effective and more efficient service management overall.
- Event management provides a basis for automated operations, thus increasing efficiencies and allowing expensive human resources to be used for more innovative work, such as designing new or improved functionality or defining new ways in which the business can exploit technology for increased competitive advantage.
The Activities of Event Management in ITIL 3 and ITIL 4
The IT Service Design phase of the service lifecycle should define which events need to be generated and then specify how this can be done for each type of Configuration Items (CI). During the service transition phase, the event generation options would be set and tested.
Event occurs – Events occur continuously, but not all of them are detected or registered. It is, therefore, important that everybody involved in designing, developing, managing, and supporting IT services and the IT infrastructure that they run on understands what types of events need to be detected.
Event Notification – Most CI’s are designed to communicate certain information about themselves in one of two ways:
- A device is interrogated by a management tool, which collects certain targeted data. This is often referred to as polling.
- The CI generates a notification when certain conditions are met. The ability to produce these notifications must be designed and built into the CI, for example, a programming hook inserted into an application.
Event Detection – Once an event notification has been generated, it will be detected by an agent running on the same system, or transmitted directly to a management tool, specifically designed to read and interpret the meaning of the event.
Event Filtering – The purpose of filtering is to decide whether to communicate the event to a management tool or to ignore it. If ignored, the event will usually be recorded in a log file on the device, but no further action will be taken.
The Event Management Lifecycle (Significance of Events) – Every organization will have its own categorization of the significance of an event, but it is suggested that at least these three broad categories be represented:
- Informational (INFO): This refers to an event that does not require any action and does not represent an exception. They are typically stored in the system or service log files and kept for a predetermined period.
Examples of informational events include:
- A device has come online
- A transaction is completed successfully
- Warning (WAN / ALERT): A warning is an event that is generated when a service or device is approaching a threshold. Warnings are intended to notify the appropriate person, process or tool so that the situation can be checked, and appropriate action taken to avoid an exception.
Examples of warning events are:
- Memory utilization on a server is currently at 65% and increasing. If it reaches 75%, response times will be unacceptably long and the Operational Level Agreement for that department will be breached.
- The collision rate on a network has increased by 15% in a short period of time (which is defined, i.e. an hour).
- Exception (ERROR): An exception means that a service or device is currently operating abnormally. Typically, this means that an Operational Level Agreement or Service Level Agreement has been breached and the business has been impacted. Exceptions could represent a total failure, impaired functionality, or degraded performance.
Examples of exception events include:
- A server is down
- Response time of a standard transaction across the network has slowed to more than 15 seconds
Event correlation – If an event is significant, a decision must be made about exactly what the significance is and what actions need to be taken to deal with it. It is here that the meaning of the event is determined.
Trigger – If the correlation activity recognizes an event, a response will be required. The mechanism used to initiate that response is also called a trigger. There are many different types of triggers, each designed specifically for the task it must initiate. Some examples could include:
- Incident triggers that generate a record in the incident management system
- Change triggers that generate a request for change
- A trigger resulting from an approved request for change that has been implemented but caused the event, or from an authorized change that has been detected
- Scripts that execute specific actions
- Paging systems that will notify a person or team of an event
- Database triggers that restrict access of a user to specific records or fields, or that create or delete entries in the database
Response selection – At this point of the process, there are several response options available:
Event logged – There will be a record of the event and any subsequent actions.
Auto Response – Some events are understood well enough that the appropriate response has already been defined and automated. This is normally a result of good design or previous experience (within problem management). The trigger will initiate the action and then evaluate whether it was completed successfully.
If not, an incident or problem record will be created. Examples of auto responses include rebooting a device, restarting a service, locking a device or application to protect it against unauthorized access.
Alert and human intervention – If the event requires human intervention, it will need to be escalated. The purpose of the alert is to ensure that the person with the skills appropriate to deal with the event is notified. The alert will contain all the information necessary for the person to determine the appropriate action
Incident, problem, or change? – Some events will represent a situation where the appropriate response will need to be handled through the incident, problem or change management process.
- Create a Request for Change (RFC).
- Create an Incident Record – As with an RFC, an incident can be created as soon as an exception is detected, or when the correlation engine determines that a specific type or combination of events represents an incident.
- Open or link to a problem record – It is rare for a problem record to be opened without related incidents. In most cases, this step refers to linking an incident to an existing problem record. This will assist the problem management teams to reassess the severity and impact of the problem and may result in a changed priority to an outstanding problem.
- Special types of the incident – In some cases, an event will indicate an exception that does not directly impact any IT service, e.g. unauthorized entry to a data center. In this case, the incident will be logged using an incident model that is appropriate for this type of exception, e.g. a security incident. The incident should be escalated to the group that manages that type of incident. As there is no outage, the incident model used should reflect that this was an operational issue rather than a service issue. These incidents should not be used to calculate downtime, and can, in fact, be used to demonstrate how proactive IT has been in making services available.
Review actions – As thousands of events are generated daily, it is not possible to review every one. However, it is important to check that any significant events or exceptions have been handled appropriately or to track trends or counts of event types, etc. In many cases, this can be done automatically.
Close event – Some events will remain open until a certain action takes place, for example, an event that is linked to an open incident. However, most events are not opened or closed. informational events are simply logged and then used as input to other processes, such as backup and storage management. Auto-response events will typically be closed by the generation of a second event. For example, a device generates an event and is rebooted through auto-response – as soon as that device is successfully back online, it generates an event that effectively closes the loop and clears the first event.
Monitoring and Event Management Key Activities in ITIL 4
The processes and procedures needed in the monitoring and event management practice must address these key activities and more:
- Identifying what services, systems, CIs, or other service components should be monitored, and establishing the monitoring strategy
- Implementing and maintaining monitoring, leveraging both the native monitoring features of the elements being observed as well as the use of designed-for-purpose monitoring tools
- Establishing and maintaining thresholds and other criteria for determining which changes of state will be treated as events, and choosing criteria to define each type of event (informational, warning, or exception)
- Establishing and maintaining policies for how each type of detected event should be handled to ensure proper management
- Implementing processes and automation required to operationalize the defined thresholds, criteria, and policies.
The Terminology of Event Management
- Event – A change of state that has significance for the management of a configuration item or IT service.
- Trigger – An indication that some action or response to an event may be needed.
- Alert – A warning that a threshold has been reached or something has been changed. (An event has occurred).
Risk of Event Management
The key risks are really those already mentioned above: Failure to obtain adequate funding. Ensuring the correct level of filtering and failure to maintain momentum in rolling out the necessary monitoring agents across the IT Infrastructure. If any of these risks is not addressed, it could adversely impact on the success of Event Management.
Critical Success Factors of Event Management
To obtain the necessary funding a compelling Business Case should be prepared showing how the benefits of effective Event Management can far outweigh the costs – giving a positive return on investment.
One of the most important CSFs is achieving the correct level of filtering. This is complicated by the fact that the significance of events changes.
For example, a user logging into a system today is normal, but if that user leaves the organization and tries to log in it is a security breach.
There are three keys to the correct level of filtering, as follows:
- Integrate Event Management into all Service Management processes where feasible. This will ensure that only the events significant to these processes are reported.
- Effective Event Management is not designed once a service has been deployed into Operations. Since Event Management is the basis for monitoring the performance and availability of a service, the exact targets and mechanisms for monitoring should be specified and agreed during the Availability and Capacity Management processes.
- Trial and error. No matter how thoroughly Event Management is prepared, there will be classes of events that are not properly filtered. Event Management must, therefore, include a formal process to evaluate the effectiveness of filtering.
Proper planning is needed for the rollout of the monitoring agent software across the entire IT Infrastructure. This should be regarded as a project with realistic timescales and adequate resources being allocated and protected throughout the duration of the project.
Event Management Relationship with other ITIL Processes
The primary process relationships are with the incident, problem, and change management which are an exception event and are detailed within the event management process.
Capacity and Availability Management are critical in defining what events are significant, what appropriate thresholds should be, and how to respond to them. In return, event management will improve the performance and availability of services by responding to events when they occur and by reporting on actual events and patterns of events to determine (by comparison with Service Level Agreement targets and KPIs) if there is some aspect of the infrastructure design or operation that can be improved.
Configuration Management can use events to determine the status of any CI in the infrastructure. Comparing events with the authorized baselines in the Configuration Management System (CMS) will help to determine whether there is unauthorized change activity taking place in the organization.
Asset Management can use event management to determine the lifecycle status of assets. For example, an event could be generated to signal that a new asset has been successfully configured and is now operational.
Events can be a rich source of information that can be processed for inclusion in Knowledge Management Systems. For example, patterns of performance can be correlated with business activity and used as input into future design and strategy decisions.
Event Management can play an important role in ensuring that potential impact on Service Level Agreements is detected early, and any failures are rectified as soon as possible so that impact on service targets is minimized.
Event Management Roles and Responsibilities Matrix (RACI)
ITIL Event Management RACI (Responsibility Matrix) is the starting point to implement successful Event Management Process and define the Event Management steps in details with different type of applications and software’s. You should get ITIL Event Management Training from the reputed sources to understand this process before implement to the Business. From the below examples, you can understand below how ITIL Event Management RACI Matrix or Event Management Roles and Responsibilities can be defined successfully.
IT Operations Manager – Process Owner
- An IT Operations Manager will be needed to take overall responsibility for several Service Operation activities. For instance, this role will ensure that all day-to-day operational activities are carried out in a timely and reliable way.
IT Operator / Technician
- IT Operators are the staff who perform the day-to-day operational activities.
- Typical responsibilities include Performing backups, ensuring that scheduled jobs are performed, installing standard equipment in the data center.
Responsibility Matrix (RACI): ITIL Event Management
|ITIL Role / Sub-Process||IT Operations Manager||IT Operator||(Event Monitoring System)||Other roles involved|
|Maintenance of Event Monitoring Mechanisms and Rules||AR||R||–||R|
|Event Filtering and 1st Level Correlation||A||–||R||–|
|2nd Level Correlation and Response Selection||A||R||R||–|
|Event Review and Closure||AR||–||–||–|
A: Accountable according to the RACI Model: Those who are ultimately accountable for the correct and thorough completion of the Event Management process.
R: Responsible according to the RACI Model: Those who do the work to achieve a task within Event Management.
In cooperation, as appropriate: IT Operations Manager, Access Manager, Capacity Manager, Availability Manager, IT Service Continuity Manager, Information Security Manager, Applications Analyst, and/ or technical Analyst.
ITIL Event Management Metrics and KPIs
You can define event management metrics during the design phase of IT services. Decide what types of events need to be generated and how they will be generated for each type of CI. Typical event management metrics include:
- The Number of events by category
- Number of events by the significance
- Number and percentage of events that required human intervention
- Percentage and Number of events that resulted in incidents or changes
- Number and percentage of events caused by an existing problem or known errors
- Number and percentage of repeated or duplicated events. This will help in the tuning of the Correlation Engine to eliminate unnecessary event generation and can also be used to assist in the design of better event generation functionality in new services
- Number and percentage of events indicating performance issues (for example, growth in the number of times an application exceeded its transaction thresholds over the past six months)
- Percentage and Number of events indicating potential availability issues (e.g. failovers to alternative devices, or excessive workload swapping)
- Number and percentage of each type of event per platform or application
- Number and ratio of events compared with the number of incidents.
Event Management Process Flow Chart
Following the ITIL 3, Best Practices and above details explained in this article, below process flow can help you to understand better.
Event Management processes events, generates alerts, and manages alert and incident resolution. Event Management either pulls events from supported external event sources using a Server or pushes events from external event sources following below ways:
- Acknowledge alerts.
- Create a task such as an incident, problem, or change.
- If automatic remediation tasks apply to the alert, begin automatic remediation to start a workflow.
- Complete all tasks or remediation activities.
- Close alerts for resolved issues.
- Add additional information, such as a knowledge article for future reference.
Monitoring and Event Management Contribution to ITIL 4 Service Value Chain
The contribution of monitoring and event management to the service value chain, with the practice being involved in all value chain activities except for plan:
The monitoring and event management practice are essential to the close observation of the environment to evaluate and proactively improve its health and stability.
Monitoring and event management may be the source of internal engagement for action.
Design and Transition
Monitoring data informs design decisions. Monitoring is an essential component of transition: it provides information about the transition success in all environments.
Monitoring and event management supports development environments, ensuring their transparency and manageability.
Deliver and Support
The practice guides how the organization manages internal support of identified events, initiating other practices as appropriate.
Event Management Process is one of the most Important Processes of ITIL IT Service Operations. It helps to IT Operations Team to monitor and take the required actions on time to avoid future Incidents and Problems to reduce the cost of the Change Management Process.
You should implement Monitoring Tool in our IT Infrastructure Landscape to monitor and generate the Events when required and create IT Tickets following the Incident Management Process so that the IT Operations team can take care effectively to support and create value for the Business in long-term.
Finally, you need to make sure your event logs are capturing the appropriate level of details — what happened when it happened, how it was handled, who it was escalated to, and any details of communication with other people or systems to support any actions taken. Let me know what you think and what you want to add here to improve ITIL Event Management Process?