Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "EGI CSIRT:Security challenges"

From EGIWiki
Jump to navigation Jump to search
(134 intermediate revisions by 10 users not shown)
Line 1: Line 1:
{{Egi-csirt-header}}
{{Egi-csirt-header}}
= What is expecting from sites ? =


== What is important to bear in mind ? ==
= Security challenges: what is it about ? =


The sites contacted for a challenge are asked to follow the normal security incident response procedure, and react exactly as if the incident was real, with the two following exceptions:
The goals of the security drills are:
* to investigate whether sufficient information is available to be able conduct an audit trace as part of an incident response, and to ensure that appropriate communications channels are available.
* to assess the incident response capabilities of the involved security teams.
* to evaluate the efficiency of the various incident response operations aiming at containment.
* trigger and improve the collaboration of the full incident response chain, involving security teams from the RCs, NGIs, EGI, VOs and CAs.
 
 
== Scenario: Stolen Credentials ==
A common problem in distributed environments is that user credentials get compromised resulting in illicit usage of resources.
 
This might happen as a result of brute force attacks on weak passwords, lost/stolen hardware, phishing, or following an earlier incident where this data was harvested by the attacker.
In addition, in the Cloud environment, we rather often see that users choose insecure (default) configuration for services they install or introduce other vulnerabilities which are then quickly exploited by automated attacks constantly targeting all systems connected to the internet.
 
Stolen or brute forced (ssh) credentials in distributed environments carry the additional risk that such incidents can spread rapidly, affecting multiple resource centres in multiple countries. Therefore proper access management is crucial in incident response.
In the EGI infrastructure access to resources is usually controlled based on x509 certificates.
 
x509 access management can happen on different levels, each action has a certain delay until it takes effect and a certain scope.
* Resource Center / Service level, immediately, bans the user at the RC/Service
* Suspend DN at VOMS, up to 1 week, already issued voms-proxies remain valid, no new proxies will be issued. Scope VO wide, certificate could also be used within other VOs.
* CA revokes certificate, takes effect when the new CRLs are loaded to the services, up to 48 hours, globally. Certificate will not be accepted at an service.
* The FedCloud user management may not be fully integrated in the central suspension and therefore requires some manual intervention of the RC admins to make sure that the DN in question can not access the interfaces to start/stop/delete VMs.
 
Since suspending at the RC service level is immediately effective it is crucial that the RC security teams, as well as the VO security teams, managing the access to their resources are trained to suspend a reported malicious certififcate DN on all of there systems, to stop all running processes related to that DN, and to trace back a IP/VM to the controlling DN.
 
At the same time the state of the VM in question should be preserved for later investigation and further access to it suspended.
 
== Security challenges: what is expected from sites ? ==
 
=== Rules ===
 
The sites contacted for a challenge are asked to follow the normal security incident response procedure, and react as if the incident was real, with the two following exceptions:
<pre>
<pre>
       1. No sanctions must be applied against the Virtual
       1. No sanctions must be applied against the Virtual
         Organization (VO) that was used to submit the job.
         Organization (VO) that was used to submit the job / start the VM.
       


       2. All "multi-destination" alerts must be addressed to
       2. All "multi-destination" alerts must be addressed to
         the e-mail list which has been designated for the test:
         the e-mail list which has been designated for the test:


              project-egee-security-challenge@cern.ch
                    abuse(at)egi.eu


        DO NOT use:
              project-lcg-security-csirts@in2p3.fr
         for Security Service Challenges. Instead, insert the
         for Security Service Challenges. Instead, insert the
         originally intended "multi-destination" address(es) in
         originally intended "multi-destination" address(es) in
         the body of your message.
         the body of your message.
        Make sure to have the string:
                   
                    [SSC]
        in the subject of the message.
</pre>
</pre>


== Information to be gathered at the sites ==
== Scope of the SSC / Information to be gathered at the sites ==
For an initial response and first directions answers to the following questions might be useful.
In this challenge the following basic Incident Response activities will get evaluated:
* Communications: Provide in time information to be used in Incident Response
* Containment:
** Suspend DNs from accessing, starting, deleting a VM
** Snapshot a live VM associated to a reported IP, including its memory
* Traceability:
**  IP based, given a time-stamp and an IP, find a DN using a VM under the IP in question.
**  DN based: given a DN, find the IPs associated to VMs running under the DN in question
* Forensics
** Network connections of IP in time range X
 
== For an initial response and first directions try to find answers to the following questions ==


*NETWORK:  
*NETWORK:  
- Are there any other suspicious connections open? If so to which IPs
   
   
  - Is network monitoring data (e.g. netflows) available?
  - Are there any other suspicious connections open to/from a reported IP or jobs running under a reported DN?
  If so, to which IPs?
- What are the DNs associated to the reported IP?


*CONTAINMENT:
*CONTAINMENT:
  - Does the process belong to a batch job or an interactive login?
  - If possible suspend
- From where (IPs) were the jobs submitted?
   
   
  - From where was the login/job submission done?
  - From where (IPs) did l.
   
   
  - In case it is a Grid-Job, the following questions are important:
  - To which VO is the user/certificate affiliated?
    -To which VO is the user/certificate affiliated?
   
   
    - Which grid-certificates (DN) are involved in this test-incident?
- Which grid-certificates (DN) are involved in this test-incident?
     # Example: DN-1: CN=John Doe, O=<SomeInstitute>,O=<Something>, ..."
     # Example: DN-1: CN=John Doe, O=<SomeInstitute>,O=<Something>, ..."
   
   
  - Since when were the jobs running?
  - Since when were the VM running?
  # Example: YYYY:MM:DD hh:mm
  # Example: YYYY:MM:DD hh:mm
  Date:
  Date:


The sites should provide the security teams asap with this information at the latest within one working day.
The time needed to pass this information to EGI-CSIRT  by replying to the alarm mail will be measured and evaluated.


The sites should provide the security teams asap with this information at latest within one working day.
== What is the normal security incident response procedure? ==
The time needed to pass this information to EGI-CSIRT  by replying to the alarm mail will be measured and evaluated.
Replying to the alarm mail will automatically use the above sketched RTIR system.
 
== Evaluation - Report generation ==
 
= General Information on SSC =
=== Terms of reference ===
==== Mandate ====
The Security Service Challenges (SSC) are executed under the authority of the EGI-CSIRT.
 
The goal of the EGI-CSIRT Security Drills, is to investigate whether sufficient information is available to be able conduct an audit trace as part of an incident response, and to ensure that appropriate communications channels are available.
 
More specifically, the SSC will address the following security aspects of the Grid:
 
# Compliance with, and understanding of, the Audit Requirements for EGI;
# Compliance with, and understanding of, the EGI-CSIRT Incident Handling and Response Guide;
# The overall execution of the incident handling procedures.
 
=== Outline ===
==== Schedule ====
 
EGI-CSIRT conducts the SSC semi-regularly across all the EGI Grid Sites. The particulars of the challenge will evolve over time. Additional information, including historical information, about the SSC is available from the SSC Wiki.
Outline of the Challenge
 
The test job is a program which is launched by means of the published methods applicable to the Grid. It will be submitted under unobtrusive credentials that will be retained for the duration of the test. After 72 wall-clock-hours, the job will terminate itself. During this time, the job will mostly lie dormant, but it will wake up occasionally to report its presence through an out-of-band logging channel. While the job is active, the Security Service at the target Site is asked to make certain investigations and to take actions. The events are recorded.
Launching the Challenge
 
The Test Operator (TOP) submits a Grid job to a Computing Element (CE) located on the Site under test. The job is submitted under valid credentials, i.e. Distinguished Name (DN) and Virtual Organization (VO).
Requirements on the challenged sites
 
The sites contacted for a challenge are asked to follow the normal security incident response procedure, and react exactly as if the incident was real, with the two following exceptions:
<pre>
      1. No sanctions must be applied against the Virtual
        Organization (VO) that was used to submit the job.


      2. All "multi-destination" alerts must be addressed to
This exercise will also test the current [https://wiki.egi.eu/wiki/SEC01 Incident Response Procedure], and here in particular [https://wiki.egi.eu/wiki/SEC01#Incident_Analysis_Guideline step 5], which covers the information collected for the coordinated incident response.
        the e-mail list which has been designated for the test:


              project-egee-security-challenge@cern.ch
Please try to follow this procedure where possible, and note/report any problems with it


        DO NOT use:
<pre>     
              project-lcg-security-csirts@in2p3.fr
          PLEASE REMEMBER THAT FOR THE CHALLENGE
        for Security Service Challenges. Instead, insert the
          THE PROCEDURE IS APPLIED WITH RESTRICTIONS
        originally intended "multi-destination" address(es) in
          AS STATED IN THE PREVIOUS SECTION.
        the body of your message.
          For questions please contact: ssc(at)mailman.egi.eu
</pre>
</pre>


==== Alerting and Reporting ====
More informations about EGI security procedures ( flowchart, formal document, forensic howto ... ) can be found here : https://wiki.egi.eu/wiki/EGI_CSIRT:Policies
 
All e-mail exchanges related to the SSC incident, MUST:
<pre>
  1. Include the text “[THIS IS A TEST][<NAME OF YOUR SITE>]” in the “Subject” field;
  2. Show the following text as the first part of the message:
 
      This e-mail is an alert about a TEST incident. It is executed under
      the supervision of EGEE/LCG Operational Security Coordination Team
      (OSCT) as part of the OSCT Security Services Challenge (SSC). More
      information about the SSC can be found at


              https://wiki.egi.eu/wiki/EGI_CSIRT:SSC
Please also visit our [[Forensic Howto]] wiki pages. If you want to contribute, just send your input to irtf(at)mailman.egi.eu.


      You are asked to following the normal incident procedure, but you
== Scores, Evaluation - Report generation ==
      MUST_NOT take any collective action against the VO of the offending
      user.


  3. When the out-of-band log shows that the job is established at the designated CE, then TOP alerts the Computer Security Incident Response Team (CSIRT) at the target Site with the following message:
We distinguish  between


      Consider any activity from the following user as malicious.
1) Measurable per site operations (with target times):
      The distinguished name (DN) of the user is:
#initial feedback: 4h
#found malicious job/processes/stop them: 4h
#ban problematic certificate: 4h
#contain the malicious binary and sent it to the incident-coordinator: 24h


              <The user DN>
These will be measured by the ssc-monitor and the scores the sites get are
calculated according to the formula stated on the wiki  page.
Times are relative to the timestamp in the alarm ticket sent to the site, we try to make sure that the
alarms will be send during office-hours (09:00 - 18:00, local time).


      Please handle this test incident according to the normal incident
and 2) per site Forensic operations:
      response procedure with the two exceptions listed below:
#all sites will receive "malware" with unique per site artifacts like UUIDs, URLs, IPs etc  finding them may require more advanced forensic operations, like memory analysis. These findings should be reported to IRTF within the ssc incident ticket as soon as they are found.


      1. No sanctions must be applied against the Virtual Organization (VO)
The reported artifacts will be extracted from the transactions in the ticket. The available scores will decrease over time.
        that was used to submit the job.
      2. All "multi-destination" alerts must be addressed to the e-mail list
        which has been designated for the test:


            project-egee-security-challenge@cern.ch
== Post processing, clean up ==


            DO NOT use project-lcg-security-csirts@in2p3.fr for Security Challenges.
As part of the incident handling, Grid authorizations may have been withdrawn from the DN that was used to submit the job. When the incident response procedure is complete, the test operator will explicitly request restoration of any such authorizations to their original state.
            Instead, insert the originally intended "multi-destination" address(es)
            in the body of your message.
</pre>
The CSIRT at the target Site is expected to respond appropriately.


==== Post processing, clean up ====
== SSC Evaluation Form ==
[[File:Lhcb score table template.png|800px]]


As part of the incident handling, Grid authorizations may have been withdrawn from the DN that was used to submit the job. When the incident response procedure is complete, TOP will explicitly request restoration of any such authorizations to their original state.
= De-briefing =
==== De-briefing ====


When the challenge has been completed on a representative number of Sites, TOP will ask for de-briefing input from the participating Sites. Material submitted will be used to edit a report. The report will be circulated to the contributors for comments before being presented to the EGI-CSIRT.
When the challenge has been completed on a representative number of Sites, the test operator will ask for de-briefing input from the participating Sites. Material submitted will be used to edit a report. The report will be circulated to the contributors for comments before being presented to the EGI-CSIRT.


{{From OSCT wiki|http://osct.web.cern.ch/osct/ssc.html}}
== Communication Template Debriefing ==
 
Dear all,
= Security Drills Framework =
thank you for your contributions to the SSC-19.03
The framework has been developped to automate the operation of EGI security challenges.
<br/>
This message is about to inform you that the SSC-19.03 is now over. You should receive the site report the next days.
The release of may 2011 contains: the panda frameworkk for job submission, a prototype of the new EGI-CSIRT ticketing system based on RTIR.
 
== RT-IR: Alerting - Reporting - Communication ==
As a clean-up step we would now ask the challenged sites to restore eventually banned credentials, in particular:
In the following picture it is shown the workflow diagram of how a EGI CSIRT member will proceed to handle an SSC5 incident.
[[File:Workflow-SSC.png]]
    CN=Amelie Caillet, CN=700993, CN=acaillet, OU=Users, OU=Organic Units, DC=cern, DC=ch
 
This SSC5 workflow diagram is a subset of the [https://wiki.egi.eu/w/images/b/b4/Flowchart.pdf EGI-CSIRT Incident-Response flowchart], adapted for using RTIR in the handling process of the SSC5 class incidents. As you can see the SSC Launcher is opening a Incident and a Investigation for alarming a site about an issue that has been found at a particular site. According to our poicies and following the [https://documents.egi.eu/public/RetrieveFile?docid=47&version=11&filename=EGI-MS405-IRTF-47-V12.pdf EGI Incident Response Procedure], the site should react promptly, and try to discover the activities on the sites resources related to the reported incident. As soon as all required information is gathered, the site security officer should report this information back to EGI- and NGI-CSIRT, as its NGI Security officer will be on the Cc of the Investigation, the site will be able to ask for help, in case of its needs. The NGI-Security officer should take care that the proper actions are taken at the sites in his NGI and that the information flow to/from EGI-CSIRT is working properly.
and
 
 
Analysing the information sent by the site, the EGI-CSIRT Incident-Handler will take the proper actions for mitigating the incident, and solving it. For example, the EGI-CSIRT member could contact the CERT responsible for the WMS (a site hosting this service, or a VO running a VO-Job-submission-framework) to determine if the malicious job run at the site, was also submitted to other Grid sites of the EGI, and possibly other Grid  infrastructures.
    CN=Cindy Denis, CN=759002, CN=cidenis, OU=Users, OU=Organic Units, DC=cern, DC=ch
In case of yes, the EGI-CSIRT Incident-Handler will be contact the likely affected sites and its NGI-Security-Officer, asking to react accordingly. He/She could also ask for taking the needed actions for bouncing the DN of the job from the VO.
In a real case scenario all sites would be asked to check for activities related to a particular user certificate and adapt their user management settings accordingly.


== Job Submission ==


== Job status and Security Operations Monitoring ==
Remark: Not sure if it is worth to also ask them to clean up the worker nodes where the payload ran. This should have happened meanwhile automatically, the sites where we saw long living bots  got informed to kill them (in May ;-))

Revision as of 01:21, 9 July 2019


| Mission | Members | Contacts
| Incident handling | Alerts | Monitoring | Security challenges | Procedures | Dissemination



Security challenges: what is it about ?

The goals of the security drills are:

  • to investigate whether sufficient information is available to be able conduct an audit trace as part of an incident response, and to ensure that appropriate communications channels are available.
  • to assess the incident response capabilities of the involved security teams.
  • to evaluate the efficiency of the various incident response operations aiming at containment.
  • trigger and improve the collaboration of the full incident response chain, involving security teams from the RCs, NGIs, EGI, VOs and CAs.


Scenario: Stolen Credentials

A common problem in distributed environments is that user credentials get compromised resulting in illicit usage of resources.

This might happen as a result of brute force attacks on weak passwords, lost/stolen hardware, phishing, or following an earlier incident where this data was harvested by the attacker. In addition, in the Cloud environment, we rather often see that users choose insecure (default) configuration for services they install or introduce other vulnerabilities which are then quickly exploited by automated attacks constantly targeting all systems connected to the internet.

Stolen or brute forced (ssh) credentials in distributed environments carry the additional risk that such incidents can spread rapidly, affecting multiple resource centres in multiple countries. Therefore proper access management is crucial in incident response. In the EGI infrastructure access to resources is usually controlled based on x509 certificates.

x509 access management can happen on different levels, each action has a certain delay until it takes effect and a certain scope.

  • Resource Center / Service level, immediately, bans the user at the RC/Service
  • Suspend DN at VOMS, up to 1 week, already issued voms-proxies remain valid, no new proxies will be issued. Scope VO wide, certificate could also be used within other VOs.
  • CA revokes certificate, takes effect when the new CRLs are loaded to the services, up to 48 hours, globally. Certificate will not be accepted at an service.
  • The FedCloud user management may not be fully integrated in the central suspension and therefore requires some manual intervention of the RC admins to make sure that the DN in question can not access the interfaces to start/stop/delete VMs.

Since suspending at the RC service level is immediately effective it is crucial that the RC security teams, as well as the VO security teams, managing the access to their resources are trained to suspend a reported malicious certififcate DN on all of there systems, to stop all running processes related to that DN, and to trace back a IP/VM to the controlling DN.

At the same time the state of the VM in question should be preserved for later investigation and further access to it suspended.

Security challenges: what is expected from sites ?

Rules

The sites contacted for a challenge are asked to follow the normal security incident response procedure, and react as if the incident was real, with the two following exceptions:

      1. No sanctions must be applied against the Virtual
         Organization (VO) that was used to submit the job / start the VM.
         

      2. All "multi-destination" alerts must be addressed to
         the e-mail list which has been designated for the test:

                     abuse(at)egi.eu

         for Security Service Challenges. Instead, insert the
         originally intended "multi-destination" address(es) in
         the body of your message.
         Make sure to have the string: 
                    
                     [SSC] 

         in the subject of the message.

Scope of the SSC / Information to be gathered at the sites

In this challenge the following basic Incident Response activities will get evaluated:

  • Communications: Provide in time information to be used in Incident Response
  • Containment:
    • Suspend DNs from accessing, starting, deleting a VM
    • Snapshot a live VM associated to a reported IP, including its memory
  • Traceability:
    • IP based, given a time-stamp and an IP, find a DN using a VM under the IP in question.
    • DN based: given a DN, find the IPs associated to VMs running under the DN in question
  • Forensics
    • Network connections of IP in time range X

For an initial response and first directions try to find answers to the following questions

  • NETWORK:
- Are there any other suspicious connections open to/from a reported IP or jobs running under a reported DN? 
  If so, to which IPs?

- What are the DNs associated to the reported IP?
  • CONTAINMENT:
- If possible suspend 
- From where (IPs) were the jobs submitted?

- From where (IPs) did l.

- To which VO is the user/certificate affiliated?

- Which grid-certificates (DN) are involved in this test-incident?
   # Example: DN-1: CN=John Doe, O=<SomeInstitute>,O=<Something>, ..."

- Since when were the VM running?
# Example: YYYY:MM:DD hh:mm
Date:

The sites should provide the security teams asap with this information at the latest within one working day. The time needed to pass this information to EGI-CSIRT by replying to the alarm mail will be measured and evaluated.

What is the normal security incident response procedure?

This exercise will also test the current Incident Response Procedure, and here in particular step 5, which covers the information collected for the coordinated incident response.

Please try to follow this procedure where possible, and note/report any problems with it

      
           PLEASE REMEMBER THAT FOR THE CHALLENGE
           THE PROCEDURE IS APPLIED WITH RESTRICTIONS
           AS STATED IN THE PREVIOUS SECTION.
           For questions please contact: ssc(at)mailman.egi.eu

More informations about EGI security procedures ( flowchart, formal document, forensic howto ... ) can be found here : https://wiki.egi.eu/wiki/EGI_CSIRT:Policies

Please also visit our Forensic Howto wiki pages. If you want to contribute, just send your input to irtf(at)mailman.egi.eu.

Scores, Evaluation - Report generation

We distinguish between

1) Measurable per site operations (with target times):

  1. initial feedback: 4h
  2. found malicious job/processes/stop them: 4h
  3. ban problematic certificate: 4h
  4. contain the malicious binary and sent it to the incident-coordinator: 24h

These will be measured by the ssc-monitor and the scores the sites get are calculated according to the formula stated on the wiki page. Times are relative to the timestamp in the alarm ticket sent to the site, we try to make sure that the alarms will be send during office-hours (09:00 - 18:00, local time).

and 2) per site Forensic operations:

  1. all sites will receive "malware" with unique per site artifacts like UUIDs, URLs, IPs etc finding them may require more advanced forensic operations, like memory analysis. These findings should be reported to IRTF within the ssc incident ticket as soon as they are found.

The reported artifacts will be extracted from the transactions in the ticket. The available scores will decrease over time.

Post processing, clean up

As part of the incident handling, Grid authorizations may have been withdrawn from the DN that was used to submit the job. When the incident response procedure is complete, the test operator will explicitly request restoration of any such authorizations to their original state.

SSC Evaluation Form

Lhcb score table template.png

De-briefing

When the challenge has been completed on a representative number of Sites, the test operator will ask for de-briefing input from the participating Sites. Material submitted will be used to edit a report. The report will be circulated to the contributors for comments before being presented to the EGI-CSIRT.

Communication Template Debriefing

Dear all, 
thank you for your contributions to the SSC-19.03

This message is about to inform you that the SSC-19.03 is now over. You should receive the site report the next days.


As a clean-up step we would now ask the challenged sites to restore eventually banned credentials, in particular:

   CN=Amelie Caillet, CN=700993, CN=acaillet, OU=Users, OU=Organic Units, DC=cern, DC=ch

and
 
   CN=Cindy Denis, CN=759002, CN=cidenis, OU=Users, OU=Organic Units, DC=cern, DC=ch


Remark: Not sure if it is worth to also ask them to clean up the worker nodes where the payload ran. This should have happened meanwhile automatically, the sites where we saw long living bots got informed to kill them (in May ;-))