Difference between revisions of "EGI CSIRT:Security challenges"
Line 286: | Line 286: | ||
* https://ssc-rt.nikhef.nl/Ticket/Display.html?id=378 | * https://ssc-rt.nikhef.nl/Ticket/Display.html?id=378 | ||
* https://ssc-rt.nikhef.nl/Ticket/Display.html?id=406 | * https://ssc-rt.nikhef.nl/Ticket/Display.html?id=406 | ||
=== NGI: FR Security Officer: Dorine Fouossong === | |||
Feedback provided | |||
* https://ssc-rt.nikhef.nl/Ticket/Display.html?id=399 | |||
* https://ssc-rt.nikhef.nl/Ticket/Display.html?id=359 |
Revision as of 10:21, 5 July 2011
| Mission | Members | Contacts
| Incident handling | Alerts | Monitoring | Security challenges | Procedures | Dissemination
Security challenges: what is it about ?
The goal of security drills, is to investigate whether sufficient information is available to be able conduct an audit trace as part of an incident response, and to ensure that appropriate communications channels are available.
EGI-CSIRT action on this thematic is at two points:
- development of drills framework. They are available for egi sites; this is to help them verify their security maturity.
- challenges at egi levels. Some wide level security challenge campaign are organized; this contributes to security at project level.
For further informations, you can contact ssc-monitor(at)zwaan.nikhefhousing.nl .
Security challenges: what is expecting from sites ?
What is important to bear in mind ?
The sites contacted for a challenge are asked to follow the normal security incident response procedure, and react as if the incident was real, with the two following exceptions:
1. No sanctions must be applied against the Virtual Organization (VO) that was used to submit the job. 2. All "multi-destination" alerts must be addressed to the e-mail list which has been designated for the test: ssc-monitor(at)zwaan.nikhefhousing.nl DO NOT use: abuse(at)egi.eu for Security Service Challenges. Instead, insert the originally intended "multi-destination" address(es) in the body of your message.
Information to be gathered at the sites
For an initial response and first directions answers to the following questions might be useful.
- NETWORK:
- Are there any other suspicious connections open? If so to which IPs - Is network monitoring data (e.g. netflows) available?
- CONTAINMENT:
- Does the process belong to a batch job or an interactive login? - From where was the login/job submission done? - In case it is a Grid-Job, the following questions are important: -To which VO is the user/certificate affiliated? - Which grid-certificates (DN) are involved in this test-incident? # Example: DN-1: CN=John Doe, O=<SomeInstitute>,O=<Something>, ..." - Since when were the jobs running? # Example: YYYY:MM:DD hh:mm Date:
The sites should provide the security teams asap with this information at latest within one working day.
The time needed to pass this information to EGI-CSIRT by replying to the alarm mail will be measured and evaluated.
Replying to the alarm mail will automatically use the above sketched RTIR system.
What is the normal security incident response procedure?
Following is site checklist for normal incident response procedure.
PLEASE REMIND THAT FOR THE CHALLENGE THE PROCEDURE IS APPLIED WITH RESTRICTIONS STATED IN THE PREVIOUS SECTION In case of doubt please contact: ssc-monitor(at)zwaan.nikhefhousing.nl
PLEASE REMIND THAT FOR THE CHALLENGE THE PROCEDURE IS APPLIED WITH RESTRICTIONS STATED IN THE PREVIOUS SECTION In case of doubt please contact: ssc-monitor(at)zwaan.nikhefhousing.nl
More informations about EGI security procedures ( flowchart, formal document, forensic howto ... ) can be found here : https://wiki.egi.eu/wiki/EGI_CSIRT:Policies
Please also visit our Forensic Howto wiki pages. If you want to contribute, just send your input to egi-csirt-team(at)mailman.egi.eu.
Evaluation - Report generation
We distinguish between
1) Measurable per site operations (with target times):
- initial feedback: 4h
- found malicious job/processes/stop them: 4h
- ban problematic certificate: 8h
- contain the malicious binary and sent it to the incident-coordinator: 24h
These will be measured by the ssc-monitor and the points the sites get are calculated according to the formula stated on the wiki page. Times are relative to the alarm to the site, we try to make sure that the alarms will be send during office-hours (09:00 - 18:00, local time). The target times might change, will be in the final version on the wiki page.
2) Collaborative investigations: Since we want to achieve cross site communication, and possibly collaboration on the "malware" forensics the evaluation schema has changed accordingly. I..e Network forensics are needed, but we don't measure this, since due to the overall SSC set-up, most of this information should already be available to the "more western" sites relative to the initially alarmed sites.
ban/unban of the pilot-job-submitter DN is based on local policies. It will not be measured, but a statement on the decision, whether to ban/unban the pilot-job-submitter or not, is expected.
Security challenge: how is it operated ?
Participating sites
# Format GOC-Name PANDA-Name NGI-NAME VO) Taiwan-LCG2 ANALY_TAIWAN APAC atlas Australia-ATLAS ANALY_AUSTRALIA APAC atlas CA-SCINET-T2 ANALY_SCINET ROC-CA atlas CA-VICTORIA-WESTGRID-T2 ANALY_VICTORIA-WG1 ROC-CA atlas TRIUMF-LCG2 ANALY_TRIUMF ROC-CA atlas BEIJING-LCG2 ANALY_BEIJING ROC-CA atlas CERN-PROD ANALY_CERN CERN atlas CYFRONET-LCG2 ANALY_CYF PL atlas praguelcg2 ANALY_FZU CZ atlas DESY-HH ANALY_DESY-HH DE atlas FZK-LCG2 ANALY_FZK DE atlas GoeGrid ANALY_GOEGRID DE atlas HEPHY-UIBK ANALY_HEPHY-UIBK DE atlas TUDresden-ZIH ANALY_DRESDEN DE atlas UAM-LCG2 ANALY_UAM SPAIN atlas pic ANALY_PIC SPAIN atlas IFAE ANALY_IFAE SPAIN atlas IFIC-LCG2 ANALY_IFIC SPAIN atlas csTCDie ANALY_CSTCDIE IE atlas IL-TAU-HEP ANALY_IL-TAU-HEP IL atlas TECHNION-HEP ANALY_TECHNION-HEP IL atlas WEIZMANN-LCG2 ANALY_WEIZMANN IL atlas INFN-FRASCATI ANALY_INFN-FRASCATI Italy atlas INFN-MILANO-ATLASC ANALY_INFN-MILANO-ATLASC Italy atlas INFN-ROMA1 ANALY_INFN-ROMA1 Italy atlas INFN-T1 ANALY_INFN-T1 Italy atlas NIKHEF-ELPROD ANALY_NIKHEF-ELPROD NL atlas SARA-MATRIX ANALY_SARA NL atlas LIP-Coimbra ANALY_LIP-Coimbra P atlas LIP-Lisbon ANALY_LIP-Lisbon P atlas NCG-INGRID-PT ANALY_NCG-INGRID-PT P atlas ITEP ANALY_ITEP RU atlas JINR-LCG2 ANALY_JINR RU atlas RRC-KI ANALY_RRC-KI RU atlas RU-Protvino-IHEP ANALY_IHEP RU atlas ru-PNPI ANALY_PNPI RU atlas ARC-SITE-SI ARC-pikolit.ijs.si SI atlas ARC-SITE-CH ARC-ce.lhep.unibe.ch CH atlas ARC-SITE-liu-SE ARC-arc-ce.smokerings.nsc.liu.se SE atlas ARC-SITE-umu-SE ARC-jeannedarc.hpc2n.umu.se SE atlas UKI-SCOTGRID-GLASGOW ANALY_GLASGOW UK atlas UKI-NORTHGRID-LANCS-HEP ANALY_LANCS UK atlas UKI-SOUTHGRID-CAM-HEP ANALY_CAM UK atlas IN2P3-LPSC ANALY_LPSC F atlas
Tools
A framework has been developped to automate the operation of EGI security challenges.
The release of may 2011 contains: the panda framework for job submission, a prototype of the new EGI-CSIRT ticketing system based on RTIR.
The test malware is not intrusive, it does not try to get elevated priviledges.
More informations about the framework are given at security drills framework.
Post processing, clean up
As part of the incident handling, Grid authorizations may have been withdrawn from the DN that was used to submit the job. When the incident response procedure is complete, the test operator will explicitly request restoration of any such authorizations to their original state.
De-briefing
When the challenge has been completed on a representative number of Sites, the test operator will ask for de-briefing input from the participating Sites. Material submitted will be used to edit a report. The report will be circulated to the contributors for comments before being presented to the EGI-CSIRT.
Feedback
Please all NGI Security Officers participating in SSC5 put your comments here. Comments from NGI Security Officer as well as from site point of view both kindly welcome. Please indicate:
- what kind of problems you have encountered, what problems sites had,
- ideas to solve mentioned problems (of course if you have, not obligatory field :) ),
- whether procedure and broadcasted information were clear enough for you and for sites,
- which parts of SSC sites liked and consider useful and which they don't and why?,
- if you think that situation during SSC5 run revealed some weakness of our procedure, please show where,
- tips from sites, how to do what, maybe we can build later extend tutorial for dealing with incidents,
- what questions appeared from sites, maybe we can add some more info on wiki pages/templates/procedure to make it even more clear,
- and all other stuff, which you believe can help us improve our work.
If you see a problem, but someone else has mentioned it, please write it as well, this will show the scope.
NGI: XY Security Officer: Name (Template)
Problems encountered and ideas for solutions:
- Problem One - and solution for it
- Problem Two - no idea how to solve it
Ideas for improvements:
- Let's do this in a different way, such as...
Other comments:
- comment
NGI: NL Security Officer: Sven Gabriel
NGI: DE Security Officer: Ursula Epting
Problems encountered and ideas for solutions:
- Roles/levels were mixed up (EGI-CSIRT/NGI/site)- ex. NGI/Site-sec.officers had background knowledge if also member of EGI-CSIRT, not easy to decide which knowledge can be used to solve the incident, which not. This was communicated in the chat room, but did not reach all people. For the real case of course everything would be used, but had been unfair in the test as sites will be compared.
- Tasks of NGI-sec.officer for their sites was unclear
- To many goals of SSC5 - test sites, test RTIR, test chat, test CSIRT-Team. SSC5 consumed a lot of manpower at each site, it would be regrettable if evaluation can't be done right, because to many things were tested at the same time.
- Flood of information in general, additionally many mails arrived three times at my mailbox via different information routes.
Ideas for improvements:
- Try to clearly state duties, try to not provide background knowledge about SSC to tested people
- Incident coordinator delegates tasks
- Isolate tests for sites from the other (CSIRT internal) ones to have meaningful results for the sites.
- Separate recipients lists, avoid overlap.
Other comments:
- exercise still useful!
- last mail with demand to send final reports should have included deadline
NGI: IT Security Officer: Riccardo Brunetti - Giuseppe Misurelli
Problems encountered and ideas for solutions:
- Excess of information and emails (most of them duplicated).
Ideally, in case of a widespread incident like the one simulated by the SSC5, the information and the results of the forensic analysis should be shared with all the involved sites as much as possible, but this must happen in a more ordered way. In my view, each NGI security officer should receive emails and notifications about its sites only, to avoid confusion and information flood. The information which can be useful for all the sites should be put in some sort of dashboard available to the NGI Security Officers and to the global incident coordinator. This can be done using the incident ticket , like we did in the SSC5 after a while. - Not clear which information had to be passed to the sites and which not.
After the first couple of sites who made the exercise it was clear, for example, which DN had to be banned and which was the name of the executable (or at least one of the names). It was not clear to which extend we, as incident coordinators for our NGI, had to inform the other sites about that. Moreover, in at least 2 of our sites the jobs have already finished or have been stopped when the site received the first alert, so it was not possible to perform a live analysis of the running job.
Other comments:
- Apart from the security and privacy issues that were mentioned by someone about the web access to the panda framework (concerns that I personally share with them), it was clear that in a security incident on the Grid, really valuable information can come from the VO experts, and from the people who actually know how the job submission frameworks work and where to find the traces of the submitted jobs.
We should probably try to make for the VO submission frameworks something similar to what we did for the Services Reference Cards.
NGI: PL Security Officer: Adam Smutnicki
Problems encountered and ideas for solutions:
- drill should be started during working hours, with plenty of time till the end of anyone working day, for example 11:00 (not everyone are working till 18:00, in Poland a lot of people work 7:30 -15:30). There was no clear information that sites are not required to work after hours and that it won't make them loose points.
- not consistent information about responding to drill: in mail "Announcement..." it was: "Please, reply to the alarm mail keeping the subject and provide...", but in mail "Initial alarm..." it was "For reference the complete Incident Response Procedure can be found in the following link:...", when we look into procedure, we see that there is a template for "Initial HEADS-UP message. This can be confusing. It should be clearly stated from the beginning that we are responding according to procedure (not only mention, that there is somewhere a procedure), and the only difference is that instead of sending mail to abuse@... site should respond to ticket, leaving markers in the brackets, but setting subject according to procedure requirement.
- not clear which information appearing in EGI CSIRT group and on what rules can, should or can't be send to participating sites.
- information, guidelines, descriptions were divided between incident response procedure and ssc wiki page; it should be in one place: incident response procedure and ssc should only link those information.
Ideas for improvements:
- if it is possible, all threads should be marked with NGI name, for mail filtering purpose
- clarify what should be included in "brief final report"
- templates in procedure:
- words like "encourage", "can be used" according to templates can be confusing. Someone reading it starts thinking whether he should use this template or not. Lets just say that it is required - we will receive structure messages from site, and site won't have doubts which format should they use.
- It should be clarified (within procedure or somewhere else) that sites should send updates using provided template, but with information (at least any), how often
- Template for follow up message - it is not clear whether updates should contain only new information, or each update should contain whole template filled with what is known. This makes information less structured and more chaotic. I suggest to use whole template for each update, this will allow to have all information within one place, instead on searching through whole ticket thread. Maybe we can move timeline up within template, because it is used often during first stage of incident response.
- there should be one place on wiki, containing clear and brief instruction, step by step, what to do when there is an incident. when site is acting under time pressure and do not have security team, they don't have time to read and analyse procedure. Such tutorial should be as user friendly as possible, and contain all information in one place, aggregated. Checklist and flowchart are very good, but they can be extended. E.g. 1) Inform EGI CSIRT with template 1. 2) provide follow up with template 2 (with information how often and how to use template). What kind of information site should look for (like in section Information to be gathered at the sites in SSC wiki page).
Other comments:
- whole SSC was very educating
NGI: UK Security Officer: MingChao
NGI: GRNET Security Officer: Christos Triantafylldis
Problems encountered and ideas for solutions:
- Investigation ownerships
It appears that whenever someone tried to steal an investigation it got the whole incident and all the investigation. This is not the foreseen reaction. This was solved by adding a new custom-field (Security Officer) to store the responsible security officer for each investigation. This also solved the issue of having 2 people responsible for one investigation i.e. in Italy's case. - Mail flow
There are many mails that were repeating the same information (from other source). Ideally only the responsible people for each investigation should get these mails while everyone should only get the updates at the incident ticket. - Single view of the status of all investigation
To ease the investigation follow-up i created a dashboard (https://ssc-rt.nikhef.nl/Dashboards/365/Current%20investigations) to have an overview of the current situation. It would be nice if such views could be created in a less manual way
Ideas for improvements:
- It would be nice to be able to communicate information to all involved contacts but also keep information at a central point for EGI CSIRT needs. I would propose to use the incident ticket for this were replies should go to every contact (like broadcast but only to sites/services that are involved) and comments to store the internal information that EGI CSIRT has before releasing them
Other comments:
- I think this time we have achieved the target of having each person with one role in the whole procedure (with exception of Leif and Ursula who also had the site hat). In future i think we should also distinguish the infrastructure that is used (i.e. it appears like our RTIR, the main communication channel, was co-hosted with the intruder)
NGI: IE Security Officer: David O'Callaghan
Feedback provided
- https://ssc-rt.nikhef.nl/Ticket/Display.html?id=378
- https://ssc-rt.nikhef.nl/Ticket/Display.html?id=406
NGI: FR Security Officer: Dorine Fouossong
Feedback provided