SOC/CSIRT management:

This page deals with SOC and CERT management

Table of Content

Must read
Challenges
SOC organization
CSIRT organization
TTP knowledge base reference
Data quality and management
Key documents for a SOC
Detection assessment
Global self assessment
Reporting
To go further

Must read

Articles/recordings

FIRST, Building a SOC
NCSC, Building a SOC
FIRST, CERT-in-a-box
FIRST, CSIRT Services Framework
ENISA, Good practice for incident management
CIS, 8 critical security controls
CMM, SOC-CMM
Linkedin Pulse, Evolution Security Operations Center
Gartner, Cybersecurity business value benchmark
LogRythm, 7 metrics to measure the effectiveness of your SOC
Google, Modernize your SOC for the future
Signalblur, Getting started with ATT&CK heatmaps
TheHackerNews, NIST CSF v2
First, ISO 27035 Practical value for CSIRT and SOCs
Infoblox, NIS2 & NCSC CAF

Challenges

Generic ones

As per the aforementioned article, here are some typical challenges for a SOC/CSIRT:

After pandemic

As per the aforementioned article, I recommend to keep in mind the following common challenges:

SOC organization

Tiering or not tiering?

No real need for tiering (L1/L2/L3)
- this is an old model for service provider, not necesserarily for a SOC!
- as per MITRE paper (p65):
In this book, the constructs of “tier 1” and “tier 2+” are sometimes used to describe analysts who are primarily responsible for front-line alert triage and in-depth investigation/analysis/ response, respectively. However, not all SOCs are arranged in this manner. In fact, some readers of this book are probably very turned off by the idea of tiering at all [38]. Some industry experts have outright called tier 1 as “dead” [39]. Once again, every SOC is different, and practitioners can sometimes be divided on the best way to structure operations. SOCs which do not organize in tiers may opt for an organizational structure more based on function. Many SOCs that have more than a dozen analysts find it necessary and appropriate to tier analysis in response to these goals and operational demands. Others do not and yet still succeed, both in terms of tradecraft maturity and repeatability in operations. Either arrangement can succeed if by observing the following tips that foreshadow a longer conversation about finding and nurturing staff in “Strategy 4: Hire AND Grow Quality Staff.”

Highly effective SOCs enable their staff to reach outside their assigned duties on a routine basis, regardless of whether they use “tier” to describe their structure.

SOC teams

Instead of tiering, 3 different teams should be needed, based on experience:
- security monitoring team (which does actually the "job" of detecting security incident being fully autonomous)
- security monitoring engineering team (which fixes/improves security monitoring like SIEM rules and SOA playbooks, generates reportings, helps with uncommon use cases handling)
- build / project management team (which does tools integration, SIEM data ingestion, specific DevOps tasks, project management).

SOC shifts for 24*7

There is a huge difference between "on-call" and "24x7":
- "on-call" service is supposed to handle pre-validated types of alerts, with maximum severity and urgency.
- "24x7" service is supposed to provide same quality of service, no matter the time of day and date it is (night, WE, holidays).
Here is an example of teams shifts to really achieve "24x7":

Source: LinkedIn article

RACI

Define a RACI, above all if you contract with an MSSP.
- You may want to consider my own template

CSIRT organization

Designate among team analysts:
- triage officer;
- incident handler;
- incident manager;
- deputy CERT manager.
Generally speaking, follow best practices as described in ENISA's ("Good practice for incident management", see "Must read")

TTP (attack methods) knowledge base reference

Use MITRE ATT&CK
Document all detections (SIEM Rules, etc.) using MITRE ATT&CK ID, whenever possible.

Data quality and management

Implement an information model, like the Splunk CIM one:
- do not hesitate to extend it, depending on your needs
- make sure this datamodel is being implemented in the SIEM, SIRP, SOA and even TIP.

Key documents for a SOC

Document an audit policy, that is tailored of the detection needs/expectations of the SOC:
- the document aims to answer a generic question: what to audit/log, on which equipments/OSes/services/apps?
- Take the Yamato Security work as an exemple regarding an audit policy required for the Sigma community rules.
- Don't forget to read the Microsoft Windows 10 and Windows Server 2016 security auditing and monitoring reference.
Document a detection strategy, tailored to the needs and expectations regarding the SOC capabilities.
- The document will aim at listing the detection rules (SIEM searches, for instance), with key examples of results, and an overview of handling procedures.
Document and keep up-to-date a detection matrix, which aims at representing the detection capabilities, for designated (feared) events and as per the security sensors known capabilities.
- You may want to have a look at my detection matrix template.

Detection quality assessment

Run regular purpleteaming sessions in time!!
- e.g.: Intrinsec, FireEye
- To do it on your own, here are a few recommended frameworks/tools:
  - Frameworks:
    - RedCanary Atomic Red Team
    - CTID
  - Tools:
    - Ytisf's zoo
    - Abuse.ch Malware Bazaar
    - Knowbe4 ransomware simulator
- NB: don't forget to watermark the offensive tools being used! See ProtectMyTooling
Picture the currently confirmed detection capabilities thanks to purpleteaming, with tools based on ATT&CK:
- e.g.: Vectr

Detection capabilities representation

Standard for security technologies

Use Security Stack Mappings to picture detection capabilities for a given security solution/environment (like AWS, Azure, NDR, etc.):

SOC detection capabilities simplified view

Leverage the DeTTECT framework
Leverage the RE&CT framework to drive detection activities

Response capabilities representation :

Response simplified view

Leverage the RE&CT framework to drive generic and fundamental containment actions.

Global self-assessment

SOC Self-assessment

Read the SOC Cyber maturity model from CMM
Run the SOC-CMM self-assessment tool

CERT/CSIRT self-assessment

Read the OpenCSIRT cybersecurity maturity framework from ENISA
- Run the OpenCSIRT, SIM3 self-assessment
Read the SOC-CMM 4CERT from CMM
- Run the SOC-CMM 4CERT self-assessment tool

Reporting

Generate metrics, leveraging the SIRP traceability and logging capabilities to get relevant data, as well as a bit of scripting.

As per Gartner, MTTR:

And MTTC:

Below are my recommendations for KPI and SLA. Unless specified, here are the recommended timeframes to compute those below KPI: 1 week, 1 month, and 6 months.

SOC/CSIRT KPI:

Number of alerts (SIEM).
Number of verified alerts (meaning, confirmed security incidents).
Top security incident types.
Top applications associated to alerts (detections).
Top detection rules triggering most false positives.
Top detection rules which corresponding alerts take the longest to be handled.
Top 10 SIEM searches (ie: detection rules) triggering false positives.
Most seen TTP in detection.
Most common incident types.
Top 10 longest tickets before closure.
Percentage of SIEM data that is not associated to SIEM searches (ie: detection rules).

Compliance KPI:

Percentage of known endpoints with company-required security solutions.
Percentage of critical and high-risk applications that are protected by multifactor authentication.
Ratio of always-on personal privileged accounts to the number of individuals in roles who should have access to these accounts.
Percentage of employees and contractors that have completed mandatory security training.
Percentage of employees who report suspicious emails for the standard organization-wide phishing campaigns.
Percentage of click-throughs for the organization-wide phishing campaigns in the past 12 months.

SOC/CSIRT SLA:

Number of false positives.
Number of new detection use-cases (SIEM rules) being put in production.
Number of new detection automation use-cases (enrichment, etc.) being put in production.
Number of new response automation use-cases (containment, eradication) being put in production.
Number of detection rules which detection capability and handling process have been confirmed with purpleteaming session, so far.
MTTH: for all incidents, mean time in H to handle (assign) the alerts.
MTTT: for all incidents, mean time in H to triage ("verify") the alerts.
MTTC: for critical and medium security incidents, mean time in H to handle the alerts and start mitigation steps (from triage to initial response).
MTTR: for critical and medium security incidents, mean time in H to handle the alerts and remediate them (from triage to remediation).

Compliance SLA:

Percentage of critical assets that have successfully run ransomware recovery assessment, in the past 12 months.
Average number of hours from the request for termination of access to sensitive or high-risk systems or information, to deprovisioning of all access.

To go further

Priorities

Define SOC priorities, with feared events and offensive scenarios (TTP) to be monitored, as per risk analysis results.
- My recommendation: leverage EBIOS RM methodology (see Detection engineering).

Detections enhancements

Leverage machine learning, wherever it can be relevant in terms of good ratio false positives / real positives.
- My recommendations: be careful, try not to saturate SOC consoles with FP, and don't forget to grab the required context to be able to analyze (verify) the detection!

Leverage best practices:

Make sure to follow the 11 strategies for a (world class) SOC, as per MITRE paper (see Must Read).

Follow the security industry standards:

Publish your RFC2350, declaring what your CERT is (see 'Nice to read' on the main page)

End

Go to main page

Files

management.md

Latest commit

History