Tools/Manuals/TS121

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Administration FAQ


How to stop WMS jobs flooding my site

Problem

A CE is getting flooded with jobs from a particular user or WMS/Condor-G node: how to stop that while keeping the CE available to others?

Solution

If the trouble comes from a particular WMS/Condor-G node, the firewall rules on the CE can be adjusted to refuse connections from that node. Note: here it is better to refuse connections instead of dropping the traffic, to allow the remote host to abort the corresponding jobs with the smallest delay.

If the trouble comes from a particular user (e.g. a huge amount of jobs submitted by accident), on an LCG-CE or CREAM CE the user DN should first be banned in /opt/glite/etc/lcas/ban_users.db (between double quotes, like in the grid-mapfile) to prevent further load on the service; the DN may need to be kept banned for a few days to try and ensure that the remote submission hosts had enough time to fail all the jobs involved.

Then, on a CREAM CE the jobs in question can be canceled in the batch system and (if needed) purged selectively as described here:

On an LCG-CE one can remove all traces of the jobs in question as follows:

  1. Stop the relevant daemons:
   /etc/init.d/globus-gma stop
   /etc/init.d/globus-job-manager-marshal stop
   /etc/init.d/globus-gass-cache-marshal stop
  1. In /opt/globus/tmp/gram_job_state remove the files owned by the affected (pool) account to which the user is mapped
  2. In /opt/globus/tmp/gma_state remove the single file for that account (beware that the file is owned by root)
  3. In the account's home directory rename the .lcgjm and .globus subdirectories:
   mkdir junk
   mv .lcgjm .globus junk
  1. Restart the relevant daemons:
   /etc/init.d/globus-gma start
   /etc/init.d/globus-job-manager-marshal start
   /etc/init.d/globus-gass-cache-marshal start
  1. Cancel the account's remaining jobs in the batch system.
  2. Remove the junk directory created earlier (it may take many minutes):
   nice /bin/rm -r junk > /tmp/rm-$$.log 2>&1 < /dev/null &
Check the log file afterwards: it should be empty.