Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Tools/Manuals/TS87"

From EGIWiki
Jump to navigation Jump to search
Line 9: Line 9:
goes like this:
goes like this:


# The WMS contacts the CE on port 2119 and indicates on which port the WMS should be called back by the globus-job-manager.  That port is the first free port in the Globus port range defined on the WMS.  The range usually is 20000-25000, so the first free port usually is 20000 + O(10).
# The WMS contacts the CE on port 2119 and indicates on which port the WMS should be called back by the globus-job-manager.  That port is the first free port in the Globus port range defined on the WMS.  The range usually is 20000-25000, so the first free port usually is 20000 + O(10). <br />The WMS is also called back on a second port by the "grid_monitor" process running on the CE.  A WMS will launch one such process per user on any CE that has unfinished jobs for that user.  The "grid_monitor" keeps track of the state of the user's jobs and regularly reports back to its WMS. Each such process will exit after 1 hour and another instance will then be launched as needed.
:The WMS is also called back on a second port by the "grid_monitor" process running on the CE.  A WMS will launch one such process per user on any CE that has unfinished jobs for that user.  The "grid_monitor" keeps track of the state of the user's jobs and regularly reports back to its WMS. Each such process will exit after 1 hour and another instance will then be launched as needed.
# The WMS contacts the globus-job-manager on a few ports in the CE port range. The CE calls the WMS back various times on the ports from step 1. If all went well, a wrapper script for the user job has then been accepted by the CE for submission to the CE's batch system.
# The WMS contacts the globus-job-manager on a few ports in the CE port range. The CE calls the WMS back various times on the ports from step 1. If all went well, a wrapper script for the user job has then been accepted by the CE for submission to the CE's batch system.
# When the job wrapper has been successfully submitted to the batch system the globus-job-manager for that job is told to exit.
# When the job wrapper has been successfully submitted to the batch system the globus-job-manager for that job is told to exit.

Revision as of 08:28, 22 August 2011


Back to Troubleshooting Guide



The job submission and cleanup dialog between WMS/Condor-G and LCG-CE/OSG-CE goes like this:

  1. The WMS contacts the CE on port 2119 and indicates on which port the WMS should be called back by the globus-job-manager. That port is the first free port in the Globus port range defined on the WMS. The range usually is 20000-25000, so the first free port usually is 20000 + O(10).
    The WMS is also called back on a second port by the "grid_monitor" process running on the CE. A WMS will launch one such process per user on any CE that has unfinished jobs for that user. The "grid_monitor" keeps track of the state of the user's jobs and regularly reports back to its WMS. Each such process will exit after 1 hour and another instance will then be launched as needed.
  2. The WMS contacts the globus-job-manager on a few ports in the CE port range. The CE calls the WMS back various times on the ports from step 1. If all went well, a wrapper script for the user job has then been accepted by the CE for submission to the CE's batch system.
  3. When the job wrapper has been successfully submitted to the batch system the globus-job-manager for that job is told to exit.
  4. The job wrapper eventually starts on the WN and copies the input sandbox from the WMS using globus-url-copy. (This step is specific to the WMS, not to Condor-G in general.)
  5. The job wrapper runs the user payload.
  6. The job wrapper copies the output sandbox (and the "Maradona" file with the payload's exit status) back to the WMS and exits. (Also this step is specific to the WMS, not to Condor-G in general.)
  7. The "grid_monitor" running on the CE informs the WMS that the job has exited. The WMS contacts the CE again on port 2119 to restart the globus-job-manager for the job in question, which then cleans things up and sends the stderr and stdout of the job wrapper to the WMS (for a WMS job also stdout normally contains the exit status of the user payload).

An illustration shows all calls to bind(), connect(), listen() and accept() made by the gahp_server process on a WMS for a single job submission to a CE plus the launch of a "grid_monitor", followed by the subsequent cleanup of the job (whenever a file descriptor is reused, it was first closed):


   bind( 6, {AF_INET, 20000,  0},  16 ) =  0
 listen( 6,                       128 ) =  0
   bind( 7, {AF_INET, 20001,  0},  16 ) =  0
 listen( 7,                       128 ) =  0
   bind( 8, {AF_INET, 20002,  0},  16 ) =  0
connect( 8, {AF_INET,  2119, CE},  16 ) = -1 EINPROGRESS
   bind( 8, {AF_INET, 20003,  0},  16 ) =  0
connect( 8, {AF_INET,  2119, CE},  16 ) = -1 EINPROGRESS
   bind( 9, {AF_INET, 20005,  0},  16 ) =  0
connect( 9, {AF_INET,  2119, CE},  16 ) = -1 EINPROGRESS
   bind( 8, {AF_INET, 20007,  0},  16 ) =  0
connect( 8, {AF_INET, 20007, CE},  16 ) = -1 EINPROGRESS
 accept( 6, {AF_INET, 20009, CE}, [16]) = 10
 accept( 6, {AF_INET, 20010, CE}, [16]) =  8
   bind(10, {AF_INET, 20007,  0},  16 ) =  0
connect(10, {AF_INET, 20007, CE},  16 ) = -1 EINPROGRESS
 accept( 6, {AF_INET, 20011, CE}, [16]) =  8
 accept( 7, {AF_INET, 20012, CE}, [16]) =  8
 accept( 7, {AF_INET, 20013, CE}, [16]) =  9
 accept( 7, {AF_INET, 20014, CE}, [16]) =  9
 accept( 7, {AF_INET, 20007, CE}, [16]) =  9
 accept( 7, {AF_INET, 20007, CE}, [16]) =  9
 accept( 7, {AF_INET, 20007, CE}, [16]) =  9
 accept( 7, {AF_INET, 20000, CE}, [16]) =  9
 accept( 7, {AF_INET, 20000, CE}, [16]) =  9
 accept( 7, {AF_INET, 20000, CE}, [16]) =  9
 accept( 7, {AF_INET, 20000, CE}, [16]) =  9
 accept( 7, {AF_INET, 20000, CE}, [16]) =  9
   bind( 9, {AF_INET, 20002,  0},  16 ) =  0
connect( 9, {AF_INET,  2119, CE},  16 ) = -1 EINPROGRESS
   bind( 9, {AF_INET, 20003,  0},  16 ) =  0
connect( 9, {AF_INET, 20010, CE},  16 ) = -1 EINPROGRESS
 accept( 6, {AF_INET, 20013, CE}, [16]) =  9
 accept( 7, {AF_INET, 20014, CE}, [16]) =  9
 accept( 7, {AF_INET, 20015, CE}, [16]) =  9
 accept( 6, {AF_INET, 20016, CE}, [16]) =  9
   bind( 9, {AF_INET, 20007,  0},  16 ) =  0
connect( 9, {AF_INET, 20010, CE},  16 ) = -1 EINPROGRESS
   bind( 9, {AF_INET, 20007,  0},  16 ) =  0
connect( 9, {AF_INET, 20010, CE},  16 ) = -1 EINPROGRESS