Each time I ask a CISO, or a technological expert, on their number of events per second (EPS), I receive each time the same “No idea.“, “A lot of events“, “EPS WTF ?” answers. Most of actors are not sensibilized and/or don’t understand the key design factor of EPS metrics during the enumeration and scope design phase of a Log or Event Management (SIEM) project. Why are EPS metrics so important ?
EPS metrics usages
These EPS metrics will help you to determine and provide you responses to :
- Acquire an appropriate Log or Event Management solution
Most of Log & Event Management vendors arguing that their products are supporting thousands of events per second. And surely their products are designed to support this number of EPS, and surely the vendor will ask questions about your EPS metrics. Most of time, if it is not an appliance, Log & Event Management solutions are supported by others tiers hardware and software’s. You will surely have a dedicated servers (with a limited amount of CPU, RAM & NIC), a SAN storage connexion (with a limited amount of size, I/O, speed, etc.), an attached external database (with all their own critical metrics), a backup solution, network bandwidth, etc. The EPS metrics will help you to design a part of your architecture and determine a part of your costs (CAPEX / OPEX). More EPS you will have more you will need an scalable and available architecture. If you acquire a Log or Event Management appliance solution, you will be limited de-facto by the vendor solution.
To not determine the EPS metrics during a Log or Event Management solution acquisition process, will surely make you acquire a solution how is oversized or undersized in front of your real initial scope needs. But never forget, EPS rate is only one factor to make the final selection of your Log or Event Management solution.
- Respond appropriately to compliance’s and/or regulations
If you have compliance’s and/or regulations, how require Log & Event Management retention policies, the EPS metrics will help you to determine your online and offline storage requirements. Your retention policies period are indicated by compliance’s and/or regulations, but your storage requirements not. How many Giga or Tera bytes will you need to respond to your retention policies period ?
- Improve your Capacity Management
During you day to day operation of your Log & Event Management solution, your storage requirements have to be monitored to ensure that the capacity meets current and future business requirements in a cost-effective manner. EPS metrics, based on a baseline, will help you to improve your application sizing, your performance management and to create a Capacity Planning.
Depending on your EPS metrics, you will maybe have to redesign your technical infrastructure by adding clustering concept to your SIEM solution, creating an out-of-band network to deal with bandwidth limitations, etc.
- Improve your Incident Management
Once you have an EPS baseline per device and/or per infrastructure, if you see an abnormal variation in your event rate flow, it will maybe indicate your that an unauthorized change has be done, or that a device has a misconfiguration, or that you are maybe under attack.
- Improve your Service Level Management
As MSSP (Managed Security Service Provider), if you determine with your customer, during the scope definition, an EPS metrics baseline, it will be more easy for you to include EPS guaranties and/or limitations in the SLA. EPS metrics could be integrated in a SLA, same as for network bandwidth, and include concepts such as “burstable EPS“, “Peak EPS” and “EPS – 95th percentile“…
- Provide some useful KPI’s
Once you have an EPS baseline, you will be able to gather some interesting KPI’s, for examples, total audited events during a period of time, EPS versus correlated events, etc.
And they are surely other good reasons to determine your events per second 🙂
EPS metrics definitions and methodology
The best definition of EPS metrics, I have read, are available in the SANS Whitepaper “Benchmarking Security Information Event Management (SIEM)” published in February 2009. I will do a recap of the metrics definitions and the methodologies on how to to create your EPS baseline.
They are two EPS metrics definitions :
- Normal Events per second (NE) :The NE metric will represent the normal number of events usage time for a device, or for your Log or Event Management scope.
- Peak Events per second (PE) :The PE metric will represent the peak number of events usage time for a device, or for your Log or Event Management scope. The PE represent abnormal activities on devices you create temporary peaks of EPS, for example DoS, ports scanning, mass SQL injections attempts, etc. PE metric is the more important cause it will determine your real EPS requirements.
Depending of the activities and your SIEM infrastructure, you will have these metrics for both activities, NE and PE for Log Management, and NE and PE for Event Management. A Log Management solution will have his own EPS limitations how are not the same as the Event Management solution limitations. This case is depending on your futur Log & Event Management infrastructure, if you will have a Log management solution in front of the Event Management solution, you will be able to filter out unnecessary events from the Log Management solution to the Event Management solution. I really recommend you to split the activities by dedicated solutions.
Also, to have valuable EPS metrics we recommend you to do analyse a period of 90 days of logs. The analyzed logs should represent all your normal and peak activities. If you analyse only a short period of time, your EPS metrics will surely not represent the truth.
Methodology :
To define your initial scope, please ask you simple questions. What are your compliance or regulation requirements how need to be in the Log Management scope ? What are the initial “Use Cases“, or policies, you will monitor through the Event Management solution, etc. The scope definition could be a dedicated blog post, so I will not explain further on how to determine this scope.
Identify and do an inventory of all devices how should be integrated into your Log or Event Management scope. By your scope definition you will identify a certain number of required devices, some of these devices are running the same technology (for example : 4 Check Point firewalls, 2 Apache Web servers, etc). These identical devices don’t have the same roles and activities, so they will surely have a different EPS metrics.
- Identify logs location and required events
For each device, identify the logs location, the logs retention period and in these logs file identify the required events to respond to the “Use Cases” or policies monitoring. In case of Log Management, please log everything. For Event Management, if you will have a Log Management solution in front of the Event Management solution, you will only need certain logs patterns. Identify these logs patterns and extract them into dedicated log files. Event Management is not to log everything, don’t consider your SIEM solution as a long term storage solution, the long term storage role is for Log Management.
You will then probably have 1 original log file for the Log Management scope, and one deviated log file for the Event Management scope.
- Identify NE and PE metrics for devices and get the PE grand total
Here come the logfu and mathematics things. You will need some shell skills to extract all necessary information’s, and simple use Excel to analyse them.
Identify all your devices PE rates and sum all PE numbers to come up with a grand total for your environment. It is unlikely that all devices in your scope will ever simultaneously produce events at maximum rate.
Example of PE rate analysis
In this example (Google Docs), I have an IDS exposed to Internet, and I will do some statistical analysis. We will analyse 1 month logs to determine the PE metrics for this device. First gather the number of events per day and calculate you average and median EPS per day (Number of events per day / 86400 seconds). In this example I have an average EPS rate of 0.03 and a median EPS rate also equal to 0.03. But as you can see I have 12 days how have an average EPS rate above 0.03, and I have also one average EPS peak rate of 0.08.
We will zoom on the 2011-04-10 how as an average EPS peak rate of 0.08, to determine the exact average EPS peak rate for this day. The representation will be all events by minutes. We can see that the PE is located between 09:42 PM and 09:59 PM. We can also find that our PE rate, with a minute interval on the entire day, is now 6.27 (number of events per minutes / 60) and no more 0.08!
We will zoom in this time interval to identify more precisely our exact PE and we will represent all events per seconds. We can see that the real PE rate is equal to 12 and not 6.27 !
As described by this example, if you don’t analyse precisely logs, you will not able to determine your exact NE and PE rate. The PE grand total rate is clearly not representing a real PE rate, but will help you to not have a Log or Event Management solution how is undersized in term of EPS limitations.