The MITRE ATT&CK Evaluation is a unique exercise in many ways. One of its distinguishing features is the absence of any scores or ratings to enable the direct comparison of participating solutions (as happens in anti-malware tests). The result of the Evaluation is a complex table of assessments of all detections that a given security solution has produced, for different stages of the attacks of a specific adversary. We’ve already published some of our initial findings, based on the ‘matrix’ representation of our test results.
But other ways to access the Evaluation results are also available. Alongside the ATT&CK framework, MITRE offers Joystick, a data analysis tool to let users explore the results in graphic form. This article contains some observations based on MITRE visualizations of the Round 2 Evaluation, as well as data analysis using other statistical instruments. We’ll look at exploring the Evaluation results in four important directions:
- Coverage of the attack's operational flow, in terms of missed steps
- Coverage of individual techniques, in terms of high-quality detections
- Number of detections in specific categories
- Number of detections with specific modifiers
Coverage of the attack’s operational flow
Let's start with the Joystick diagrams that MITRE uses to display test results on its official webpage. The diagram for each vendor shows a timeline of steps, rather than a matrix of tactics/techniques. We’re looking here at the operational flow, and the number of detections of each category at each step in that flow – a step being a group of actual test runs at one stage of the Evaluation. Because they represent the actual attack implementation flow, these steps don’t follow the same order as the tactics and techniques in the ATT&CK matrix.
The test attack itself comprised 20 steps, but MITRE has chosen to omit Step 19 from the evaluation results.
The detection categories in the Joystick diagram (see Figure 1) are color-coded. ‘None’ and ‘Telemetry’ are the ‘lowest’ outcomes, represented in two shades of blue. ‘None’ means that the product didn't notice the suspicious action at all, while ‘Telemetry’ means that some data was collected, but no detection logic was applied, and the event was not labelled (as malicious or otherwise). So, basically, ‘Telemetry’ is not a true detect - just an event log. True detections are represented by the other four categories in shades of yellow and green: ’MSSP’, ‘General’, ‘Tactic’ and ‘Technique’.
Where only ‘None’ or ‘Telemetry’ blues are shown on the diagram for a step, and there are no yellows or greens, we may call this a missed step – a stage of the attack where no true detections were made. Here, for example, is the results diagram for one vendor's evaluation (not Kaspersky). Several missed steps can be seen – look at columns 2, 8, 10, 12, 13, and 17:
Is it a failing for a security solution to miss so many attack stages? Well, you must decide for yourself. We think it would be better to minimize such missed steps. Here’s the number of missed steps for each of the vendors evaluated:
An interesting question: which ‘missed steps’ matter most? It depends, of course, on the particular case and the ultimate goal of the specific attack. But remember one crucial limitation of the ATT&CK Evaluation: the security product is not allowed to take any preventative/remediation actions at any stage of the attack during testing. This fails to reflect real-life conditions, where prevention modules should stop any attack at the earliest possible stage. So we can say that, in the real world where there’s a progressively greater chance of any attack having already been detected - the later the missed step, the less risk it represents.
For example, you’ll see that a large group of vendors missed only one step in the Evaluation. Of these, Microsoft, Symantec, F-Secure, FireEye, Secureworks and GoSecure all missed step 10 (Persistence Execution), while TrendMicro and Palo Alto missed step 13 (Local Enumeration). These steps occur somewhere near the middle of the attack's operational flow.
For Kaspersky, the only step without good detects was step 18 (Exfiltration), the last but one in the attack chain (see Figure 1). In a real-world situation, the security solution would have detected the attack long before this step was reached.
The diagram below allows you to estimate the danger of missed steps based on the distance of these steps from the beginning of the attack:
Coverage of individual techniques
It should be noted that a broad-brush comparison of Evaluation results is not always that useful – after all, your organization will probably never come up against some of the attack techniques emulated in the MITRE test. For this reason, the MITRE Joystick data analysis tool allows you to select your own parameters and explore the Evaluation results this way.
You can also use your own statistical instruments to further explore the Evaluation data most relevant to your organization’s particular needs and resources. For example, you could choose a particular technique from the Evaluation Round 2 scope - and see how it was covered by high-quality detections, such as 'Technique' or 'MSSP'. We've previously listed Round 2 techniques where Kaspersky's solution achieved 100% visibility - all the test runs for each of these techniques resulted in high-quality detections:
To show how important taking this metric into account is, let's look harder at a couple of techniques, both involving several test-runs - one technique from the beginning (T1086 PowerShell) and one from the end (T1002 Data Compressed) of the emulated attack.
As you can see, not every solution evaluated detected all the test-runs of the selected techniques. These ‘patchy’ performances must raise issues about whether the solutions involved should be relied upon to provide consistent protection against these techniques.
The diagram below shows the total number of Round 2 techniques for which each vendor achieved 100% coverage with high-quality detection categories ('Technique' or 'MSSP').
Detections of specific categories
You may also want to explore the detections in some particular categories. For example, the next diagram shows the total number of MSSP detections made by different vendors:
You can see that six of the vendors evaluated didn't support their EDR performance with any MSSP services at all. At Kaspersky, we believe that while many attack methods can be detected automatically, there are also those that require human expertise to uncover (in this test, for instance, no vendor was good enough with fully automatic detections on steps 4 and 11). That's why we feel that a comprehensive solution should combine a fully automated security product with a threat hunting service. During our Evaluation, the latter element was successfully handled by Kaspersky Managed Protection service which benefits from both automatic and manual detection.
The combined metric ‘high-quality detection’ ('Technique' or 'MSSP') above grows out of this concept of a combined solution. The diagram below shows the number of test-runs where the vendors gained these high-quality detections:
Detections with specific modifiers
In addition to the main categories, each detection in the ATT&CK Evaluation can be marked with special modifiers that provide more information about the detection. One of the most important is Configuration Change: this modifier is used when a vendor has had to tune the product after the start of the evaluation.
By default, an out-of-the-box solution must be used in the test, with default settings. Each product can also be tuned by its vendor prior to the test, but all such configuration changes must be reported and shown on a dedicated page of the MITRE site, so that customers could apply these changes to their own copies of the product, in order to reproduce both the attack and the product response.
However, if a product didn’t detect some security events during first two days of the Evaluation, the vendor could obtain permission for additional tuning and re-testing on day 3. The detections gained after this re-configuration are marked ‘Configuration Change’. Simply speaking, if a customer uses this security solution without this additional re-configuration, it will very probably miss all such detections. And obviously, no security officer in the real world is going be happy having to constantly re-configure the product.
So it's could be interesting to see which solutions show the fewest 'Configuration Change' modifiers. Here we are:
Another way to explore detection types and modifiers in the Evaluation results has been proposed by Josh Zelonis from Forrester. He has calculated the ‘Actionability’ of alerts produced by different solutions. The idea is to estimate both the efficiency of the alerts (not too many) and their quality (how well alerts add to an understanding of the story). Based on the Forrester analyst's formula, vendors demonstrated these levels of Alert Actionability:
In conclusion, the ATT&CK Evaluation results should be viewed from the perspective of your own security goals and infrastructure features. Having said this, a good security solution should demonstrate the widest possible coverage of attack techniques and attack stages (steps, tactics) - and the ATT&CK Evaluation is an effective way of visually assessing this.
More materials about ATT&CK Evaluation and how ATT&CK is used in Kaspersky products can be found here – Kaspersky.com/MITRE