Difference between revisions of "EGI CSIRT:Security challenges"

From EGIWiki
Jump to: navigation, search
(What is the normal security incident response procedure?)
Line 131: Line 131:
 
= Security challenge: how is it operated ? =
 
= Security challenge: how is it operated ? =
 
== Participating sites ==
 
== Participating sites ==
  # Format GOC-Name PANDA-Name NGI-NAME VO)
+
 
Taiwan-LCG2            ANALY_TAIWAN    APAC atlas
+
Currently the following sites cn be used for the SSC
  Australia-ATLAS        ANALY_AUSTRALIA APAC atlas
+
  # Format GOC-Name NGI-NAME VO=FedCloud)
CA-SCINET-T2            ANALY_SCINET    ROC-CA atlas
+
 
  CA-VICTORIA-WESTGRID-T2 ANALY_VICTORIA-WG1  ROC-CA atlas
+
  BEgrid-BELNET NGI-NL
  TRIUMF-LCG2            ANALY_TRIUMF    ROC-CA atlas
+
  CESNET-MetaCloud NGI-CZ
BEIJING-LCG2            ANALY_BEIJING  ROC-CA atlas
+
  CYFRONET-CLOUD NGI-PL
CERN-PROD              ANALY_CERN  CERN atlas
+
  IISAS-FedCloud NGI-SK
CYFRONET-LCG2          ANALY_CYF  PL atlas
+
  IN2P3-IRES NGI-FRANCE
praguelcg2              ANALY_FZU  CZ  atlas
+
  INFN-CATANIA-STACK NGI-IT
  DESY-HH                ANALY_DESY-HH  DE atlas
+
  INFN-PADOVA-STACK NGI-IT
  FZK-LCG2                ANALY_FZK  DE atlas
+
  TR-FC1-ULAKBIM NGI-TR
GoeGrid                ANALY_GOEGRID  DE atlas
 
HEPHY-UIBK              ANALY_HEPHY-UIBK    DE atlas
 
TUDresden-ZIH          ANALY_DRESDEN  DE atlas
 
UAM-LCG2                ANALY_UAM  SPAIN  atlas
 
pic                    ANALY_PIC  SPAIN  atlas
 
IFAE                    ANALY_IFAE  SPAIN  atlas
 
IFIC-LCG2              ANALY_IFIC  SPAIN  atlas
 
csTCDie                ANALY_CSTCDIE IE atlas
 
IL-TAU-HEP              ANALY_IL-TAU-HEP IL atlas
 
TECHNION-HEP            ANALY_TECHNION-HEP  IL atlas
 
WEIZMANN-LCG2          ANALY_WEIZMANN  IL atlas
 
  INFN-FRASCATI          ANALY_INFN-FRASCATI Italy atlas
 
INFN-MILANO-ATLASC      ANALY_INFN-MILANO-ATLASC Italy atlas
 
  INFN-ROMA1              ANALY_INFN-ROMA1    Italy atlas
 
INFN-T1                ANALY_INFN-T1  Italy atlas
 
  NIKHEF-ELPROD          ANALY_NIKHEF-ELPROD NL atlas
 
SARA-MATRIX            ANALY_SARA  NL atlas
 
LIP-Coimbra            ANALY_LIP-Coimbra P atlas
 
LIP-Lisbon              ANALY_LIP-Lisbon    P atlas
 
NCG-INGRID-PT          ANALY_NCG-INGRID-PT P atlas
 
ITEP                    ANALY_ITEP RU atlas
 
JINR-LCG2              ANALY_JINR  RU atlas
 
RRC-KI                  ANALY_RRC-KI    RU atlas
 
RU-Protvino-IHEP        ANALY_IHEP  RU atlas
 
ru-PNPI                ANALY_PNPI  RU atlas
 
ARC-SITE-SI            ARC-pikolit.ijs.si SI  atlas
 
ARC-SITE-CH            ARC-ce.lhep.unibe.ch    CH  atlas
 
ARC-SITE-liu-SE        ARC-arc-ce.smokerings.nsc.liu.se    SE atlas
 
ARC-SITE-umu-SE        ARC-jeannedarc.hpc2n.umu.se SE  atlas
 
UKI-SCOTGRID-GLASGOW    ANALY_GLASGOW  UK atlas
 
UKI-NORTHGRID-LANCS-HEP ANALY_LANCS UK atlas
 
UKI-SOUTHGRID-CAM-HEP  ANALY_CAM  UK atlas
 
IN2P3-LPSC              ANALY_LPSC  F  atlas
 
  
 
== Tools ==
 
== Tools ==

Revision as of 21:23, 17 July 2017


| Mission | Members | Contacts
| Incident handling | Alerts | Monitoring | Security challenges | Procedures | Dissemination



Security challenges: what is it about ?

The goals of the security drills are:

  • to investigate whether sufficient information is available to be able conduct an audit trace as part of an incident response, and to ensure that appropriate communications channels are available.
  • to assess the incident response capabilities of the involved security teams.
  • to evaluate the efficiency of the various incident response operations aiming at containment.
  • trigger and improve the collaboration of the full incident response chain, involving security teams from the RCs, NGIs, EGI, VOs and CAs.

FedCloud-SSC Scenario: Vulnerable VM / Stolen Credentials

A common problem in the Cloud Environment is that users may choose insecure (default) configuration for services they install or introduce other vulnerabilities which then get exploited by automated attacks constantly targeting all systems connected to the internet.


Scenario: Stolen Credentials

A common problem in distributed environments is that user credentials get compromised resulting in an illicit usage of the resources.

This might happen in course of brute force attacks on weak passwords, lost/stolen hardware, phishing, or as a result of an earlier incident, where this data got harvested by the attacker. In addition, we rather often see in the Cloud environment, that users may choose insecure (default) configuration for services they install or introduce other vulnerabilities which then get exploited by automated attacks constantly targeting all systems connected to the internet.

Stolen or bruteforced (ssh) credentials in distributed environments carry the additional risk that these incidents can spread out out rapidly affecting multiple resource centres in multiple countries. Therefore proper access management is crucial in incident response. In EGI access to the resources is usually controlled based on x509 certificates.

x509 access management can happen on different levels, each action has a certain delay until it takes effect and a certain scope.

  • Resource Center / Service level, immediately, bans the user at the RC/Service
  • Suspend DN at VOMS, up to 1 week, already issued voms-proxies remain valid, no new proxies will be issued. Scope VO wide, certificate could also be used within other VOs.
  • CA revokes certificate, takes effect when the new CRLs are loaded to the services, up to 48 hours, globally. Certificate will not be accepted at an service.
  • The FedCloud user management may not be fully integrated in the central suspension and therefore requires some manual intervention of the RC admins to make sure that the DN in question can not access the interfaces to start/stop/delete VMs.

Since suspending at RC resp. service level is immediately effective it is crucial, that the RC security teams as well as the VO security teams managing the access to their resources are trained to suspend a reported malicious certififcate DN on all of there systems, stop all running processes related to that DN, and to trace back a IP/VM to the controlling DN.

At the same time the state of the VM in question should be preserved for later investigations and further access to it suspended.


Security challenges: what is expected from sites ?

What is important to bear in mind ?

The sites contacted for a challenge are asked to follow the normal security incident response procedure, and react as if the incident was real, with the two following exceptions:

      1. No sanctions must be applied against the Virtual
         Organization (VO) that was used to submit the job / start the VM.
         In case of of 

      2. All "multi-destination" alerts must be addressed to
         the e-mail list which has been designated for the test:

              fedcloud-ssc(at)egi.eu

         DO NOT use:
                     abuse(at)egi.eu

         for Security Service Challenges. Instead, insert the
         originally intended "multi-destination" address(es) in
         the body of your message.

Information to be gathered at the sites

For an initial response and first directions answers to the following questions might be useful.

  • NETWORK:
- Are there any other suspicious connections open? If so to which IPs

- Is network monitoring data (e.g. netflows) available?
  • CONTAINMENT:
- Does the process belong to a batch job or an interactive login?

- From where was the login/job submission done?

- In case it is a Grid-Job, the following questions are important:
   -To which VO is the user/certificate affiliated?

   - Which grid-certificates (DN) are involved in this test-incident?
   # Example: DN-1: CN=John Doe, O=<SomeInstitute>,O=<Something>, ..."

- Since when were the jobs running?
# Example: YYYY:MM:DD hh:mm
Date:

- Trace back the job to the originating UI or WMS.

The sites should provide the security teams asap with this information at latest within one working day. The time needed to pass this information to EGI-CSIRT by replying to the alarm mail will be measured and evaluated. Replying to the alarm mail will automatically use the above sketched RTIR system.

What is the normal security incident response procedure?

This exercise will also test the current Incident Response Procedure, and here in particular step 5, which covers the aspects of incident response in a cloud environment.

Please try to follow this procedure where possible, and note/report any problems with it

      PLEASE REMIND THAT FOR THE CHALLENGE
           THE PROCEDURE IS APPLIED WITH RESTRICTIONS
           STATED IN THE PREVIOUS SECTION
           In case of doubt please contact: ssc-fedcloud(at)egi.eu

More informations about EGI security procedures ( flowchart, formal document, forensic howto ... ) can be found here : https://wiki.egi.eu/wiki/EGI_CSIRT:Policies

Please also visit our Forensic Howto wiki pages. If you want to contribute, just send your input to egi-csirt-team(at)mailman.egi.eu.

Evaluation - Report generation

We distinguish between

1) Measurable per site operations (with target times):

  1. initial feedback: 4h
  2. found malicious job/processes/stop them: 4h
  3. ban problematic certificate: 8h
  4. contain the malicious binary and sent it to the incident-coordinator: 24h

These will be measured by the ssc-monitor and the points the sites get are calculated according to the formula stated on the wiki page. Times are relative to the alarm to the site, we try to make sure that the alarms will be send during office-hours (09:00 - 18:00, local time). The target times might change, will be in the final version on the wiki page.

2) Collaborative investigations: Since we want to achieve cross site communication, and possibly collaboration on the "malware" forensics the evaluation schema has changed accordingly. I..e Network forensics are needed, but we don't measure this, since due to the overall SSC set-up, most of this information should already be available to the "more western" sites relative to the initially alarmed sites.

ban/unban of the pilot-job-submitter DN is based on local policies. It will not be measured, but a statement on the decision, whether to ban/unban the pilot-job-submitter or not, is expected.

Security challenge: how is it operated ?

Participating sites

Currently the following sites cn be used for the SSC

# Format GOC-Name  NGI-NAME VO=FedCloud)
BEgrid-BELNET NGI-NL
CESNET-MetaCloud NGI-CZ
CYFRONET-CLOUD NGI-PL
IISAS-FedCloud NGI-SK
IN2P3-IRES NGI-FRANCE
INFN-CATANIA-STACK NGI-IT 
INFN-PADOVA-STACK NGI-IT 
TR-FC1-ULAKBIM NGI-TR

Tools

A framework has been developped to automate the operation of EGI security challenges.

The release of may 2011 contains: the panda framework for job submission, a prototype of the new EGI-CSIRT ticketing system based on RTIR.

The test malware is not intrusive, it does not try to get elevated priviledges.

More informations about the framework are given at security drills framework.

Post processing, clean up

As part of the incident handling, Grid authorizations may have been withdrawn from the DN that was used to submit the job. When the incident response procedure is complete, the test operator will explicitly request restoration of any such authorizations to their original state.

De-briefing

When the challenge has been completed on a representative number of Sites, the test operator will ask for de-briefing input from the participating Sites. Material submitted will be used to edit a report. The report will be circulated to the contributors for comments before being presented to the EGI-CSIRT.

SSC5 Feedbacks

Moved to EGI CSIRT private wiki: https://wiki.egi.eu/csirt/index.php/SSC5_Feedback