This page deals with SOC and CERT management
- Must read
- Challenges
- SOC organization
- CSIRT organization
- TTP knowledge base reference
- Data quality and management
- Key documents for a SOC
- Detection assessment
- Global self assessment
- Reporting
- To go further
- FIRST, Building a SOC
- NCSC, Building a SOC
- FIRST, CERT-in-a-box
- FIRST, CSIRT Services Framework
- ENISA, Good practice for incident management
- CIS, 8 critical security controls
- CMM, SOC-CMM
- Linkedin Pulse, Evolution Security Operations Center
- Gartner, Cybersecurity business value benchmark
- LogRythm, 7 metrics to measure the effectiveness of your SOC
- Google, Modernize your SOC for the future
- Signalblur, Getting started with ATT&CK heatmaps
- TheHackerNews, NIST CSF v2
- First, ISO 27035 Practical value for CSIRT and SOCs
- Infoblox, NIS2 & NCSC CAF
As per the aforementioned article, here are some typical challenges for a SOC/CSIRT:
As per the aforementioned article, I recommend to keep in mind the following common challenges:
-
No real need for tiering (L1/L2/L3)
- this is an old model for service provider, not necesserarily for a SOC!
- as per MITRE paper (p65):
In this book, the constructs of “tier 1” and “tier 2+” are sometimes used to describe analysts who are primarily responsible for front-line alert triage and in-depth investigation/analysis/ response, respectively. However, not all SOCs are arranged in this manner. In fact, some readers of this book are probably very turned off by the idea of tiering at all [38]. Some industry experts have outright called tier 1 as “dead” [39]. Once again, every SOC is different, and practitioners can sometimes be divided on the best way to structure operations. SOCs which do not organize in tiers may opt for an organizational structure more based on function. Many SOCs that have more than a dozen analysts find it necessary and appropriate to tier analysis in response to these goals and operational demands. Others do not and yet still succeed, both in terms of tradecraft maturity and repeatability in operations. Either arrangement can succeed if by observing the following tips that foreshadow a longer conversation about finding and nurturing staff in “Strategy 4: Hire AND Grow Quality Staff.”
Highly effective SOCs enable their staff to reach outside their assigned duties on a routine basis, regardless of whether they use “tier” to describe their structure.
- Instead of tiering, 3 different teams should be needed, based on experience:
- security monitoring team (which does actually the "job" of detecting security incident being fully autonomous)
- security monitoring engineering team (which fixes/improves security monitoring like SIEM rules and SOA playbooks, generates reportings, helps with uncommon use cases handling)
- build / project management team (which does tools integration, SIEM data ingestion, specific DevOps tasks, project management).
-
There is a huge difference between "on-call" and "24x7":
- "on-call" service is supposed to handle pre-validated types of alerts, with maximum severity and urgency.
- "24x7" service is supposed to provide same quality of service, no matter the time of day and date it is (night, WE, holidays).
-
Here is an example of teams shifts to really achieve "24x7":
Source: LinkedIn article
- Define a RACI, above all if you contract with an MSSP.
- You may want to consider my own template
- Designate among team analysts:
- triage officer;
- incident handler;
- incident manager;
- deputy CERT manager.
- Generally speaking, follow best practices as described in ENISA's ("Good practice for incident management", see "Must read")
- Use MITRE ATT&CK
- Document all detections (SIEM Rules, etc.) using MITRE ATT&CK ID, whenever possible.
- Implement an information model, like the Splunk CIM one:
- do not hesitate to extend it, depending on your needs
- make sure this datamodel is being implemented in the SIEM, SIRP, SOA and even TIP.
- Document an audit policy, that is tailored of the detection needs/expectations of the SOC:
- the document aims to answer a generic question: what to audit/log, on which equipments/OSes/services/apps?
- Take the Yamato Security work as an exemple regarding an audit policy required for the Sigma community rules.
- Don't forget to read the Microsoft Windows 10 and Windows Server 2016 security auditing and monitoring reference.
- Document a detection strategy, tailored to the needs and expectations regarding the SOC capabilities.
- The document will aim at listing the detection rules (SIEM searches, for instance), with key examples of results, and an overview of handling procedures.
- Document and keep up-to-date a detection matrix, which aims at representing the detection capabilities, for designated (feared) events and as per the security sensors known capabilities.
- You may want to have a look at my detection matrix template.
- Run regular purpleteaming sessions in time!!
- e.g.: Intrinsec, FireEye
- To do it on your own, here are a few recommended frameworks/tools:
- Frameworks:
- RedCanary Atomic Red Team
- CTID
- Tools:
- Ytisf's zoo
- Abuse.ch Malware Bazaar
- Knowbe4 ransomware simulator
- Frameworks:
- NB: don't forget to watermark the offensive tools being used! See ProtectMyTooling
- Picture the currently confirmed detection capabilities thanks to purpleteaming, with tools based on ATT&CK:
- e.g.: Vectr
- Use Security Stack Mappings to picture detection capabilities for a given security solution/environment (like AWS, Azure, NDR, etc.):
- Leverage the DeTTECT framework
- Leverage the RE&CT framework to drive detection activities
- Leverage the RE&CT framework to drive generic and fundamental containment actions.
- Read the SOC Cyber maturity model from CMM
- Run the SOC-CMM self-assessment tool
- Read the OpenCSIRT cybersecurity maturity framework from ENISA
- Run the OpenCSIRT, SIM3 self-assessment
- Read the SOC-CMM 4CERT from CMM
Generate metrics, leveraging the SIRP traceability and logging capabilities to get relevant data, as well as a bit of scripting.
As per Gartner, MTTR:
And MTTC:
Below are my recommendations for KPI and SLA. Unless specified, here are the recommended timeframes to compute those below KPI: 1 week, 1 month, and 6 months.
- Number of alerts (SIEM).
- Number of verified alerts (meaning, confirmed security incidents).
- Top security incident types.
- Top applications associated to alerts (detections).
- Top detection rules triggering most false positives.
- Top detection rules which corresponding alerts take the longest to be handled.
- Top 10 SIEM searches (ie: detection rules) triggering false positives.
- Most seen TTP in detection.
- Most common incident types.
- Top 10 longest tickets before closure.
- Percentage of SIEM data that is not associated to SIEM searches (ie: detection rules).
- Percentage of known endpoints with company-required security solutions.
- Percentage of critical and high-risk applications that are protected by multifactor authentication.
- Ratio of always-on personal privileged accounts to the number of individuals in roles who should have access to these accounts.
- Percentage of employees and contractors that have completed mandatory security training.
- Percentage of employees who report suspicious emails for the standard organization-wide phishing campaigns.
- Percentage of click-throughs for the organization-wide phishing campaigns in the past 12 months.
- Number of false positives.
- Number of new detection use-cases (SIEM rules) being put in production.
- Number of new detection automation use-cases (enrichment, etc.) being put in production.
- Number of new response automation use-cases (containment, eradication) being put in production.
- Number of detection rules which detection capability and handling process have been confirmed with purpleteaming session, so far.
- MTTH: for all incidents, mean time in H to handle (assign) the alerts.
- MTTT: for all incidents, mean time in H to triage ("verify") the alerts.
- MTTC: for critical and medium security incidents, mean time in H to handle the alerts and start mitigation steps (from triage to initial response).
- MTTR: for critical and medium security incidents, mean time in H to handle the alerts and remediate them (from triage to remediation).
- Percentage of critical assets that have successfully run ransomware recovery assessment, in the past 12 months.
- Average number of hours from the request for termination of access to sensitive or high-risk systems or information, to deprovisioning of all access.
- Define SOC priorities, with feared events and offensive scenarios (TTP) to be monitored, as per risk analysis results.
- My recommendation: leverage EBIOS RM methodology (see Detection engineering).
- Leverage machine learning, wherever it can be relevant in terms of good ratio false positives / real positives.
- My recommendations: be careful, try not to saturate SOC consoles with FP, and don't forget to grab the required context to be able to analyze (verify) the detection!
- Make sure to follow the 11 strategies for a (world class) SOC, as per MITRE paper (see Must Read).
- Publish your RFC2350, declaring what your CERT is (see 'Nice to read' on the main page)
Go to main page