Tools/Manuals/TS87

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Contents



Back to Troubleshooting Guide


Dialog between WMS and LCG-CE

The job submission and cleanup dialog between WMS/Condor-G and LCG-CE/OSG-CE goes like this:

  1. The WMS contacts the CE on port 2119 and indicates on which port the WMS should be called back by the globus-job-manager. That port is the first free port in the Globus port range defined on the WMS. The range usually is 20000-25000, so the first free port usually is 20000 + O(10).
    The WMS is also called back on a second port by the "grid_monitor" process running on the CE. A WMS will launch one such process per user on any CE that has unfinished jobs for that user. The "grid_monitor" keeps track of the state of the user's jobs and regularly reports back to its WMS. Each such process will exit after 1 hour and another instance will then be launched as needed.
  2. The WMS contacts the globus-job-manager on a few ports in the CE port range. The CE calls the WMS back various times on the ports from step 1. If all went well, a wrapper script for the user job has then been accepted by the CE for submission to the CE's batch system.
  3. When the job wrapper has been successfully submitted to the batch system the globus-job-manager for that job is told to exit.
  4. The job wrapper eventually starts on the WN and copies the input sandbox from the WMS using globus-url-copy. (This step is specific to the WMS, not to Condor-G in general.)
  5. The job wrapper runs the user payload.
  6. The job wrapper copies the output sandbox (and the "Maradona" file with the payload's exit status) back to the WMS and exits. (Also this step is specific to the WMS, not to Condor-G in general.)
  7. The "grid_monitor" running on the CE informs the WMS that the job has exited. The WMS contacts the CE again on port 2119 to restart the globus-job-manager for the job in question, which then cleans things up and sends the stderr and stdout of the job wrapper to the WMS (for a WMS job also stdout normally contains the exit status of the user payload).

An illustration shows all calls to bind(), connect(), listen() and accept() made by the gahp_server process on a WMS for a single job submission to a CE plus the launch of a "grid_monitor", followed by the subsequent cleanup of the job (whenever a file descriptor is reused, it was first closed):


   bind( 6, {AF_INET, 20000,  0},  16 ) =  0
 listen( 6,                       128 ) =  0
   bind( 7, {AF_INET, 20001,  0},  16 ) =  0
 listen( 7,                       128 ) =  0
   bind( 8, {AF_INET, 20002,  0},  16 ) =  0
connect( 8, {AF_INET,  2119, CE},  16 ) = -1 EINPROGRESS
   bind( 8, {AF_INET, 20003,  0},  16 ) =  0
connect( 8, {AF_INET,  2119, CE},  16 ) = -1 EINPROGRESS
   bind( 9, {AF_INET, 20005,  0},  16 ) =  0
connect( 9, {AF_INET,  2119, CE},  16 ) = -1 EINPROGRESS
   bind( 8, {AF_INET, 20007,  0},  16 ) =  0
connect( 8, {AF_INET, 20007, CE},  16 ) = -1 EINPROGRESS
 accept( 6, {AF_INET, 20009, CE}, [16]) = 10
 accept( 6, {AF_INET, 20010, CE}, [16]) =  8
   bind(10, {AF_INET, 20007,  0},  16 ) =  0
connect(10, {AF_INET, 20007, CE},  16 ) = -1 EINPROGRESS
 accept( 6, {AF_INET, 20011, CE}, [16]) =  8
 accept( 7, {AF_INET, 20012, CE}, [16]) =  8
 accept( 7, {AF_INET, 20013, CE}, [16]) =  9
 accept( 7, {AF_INET, 20014, CE}, [16]) =  9
 accept( 7, {AF_INET, 20007, CE}, [16]) =  9
 accept( 7, {AF_INET, 20007, CE}, [16]) =  9
 accept( 7, {AF_INET, 20007, CE}, [16]) =  9
 accept( 7, {AF_INET, 20000, CE}, [16]) =  9
 accept( 7, {AF_INET, 20000, CE}, [16]) =  9
 accept( 7, {AF_INET, 20000, CE}, [16]) =  9
 accept( 7, {AF_INET, 20000, CE}, [16]) =  9
 accept( 7, {AF_INET, 20000, CE}, [16]) =  9
   bind( 9, {AF_INET, 20002,  0},  16 ) =  0
connect( 9, {AF_INET,  2119, CE},  16 ) = -1 EINPROGRESS
   bind( 9, {AF_INET, 20003,  0},  16 ) =  0
connect( 9, {AF_INET, 20010, CE},  16 ) = -1 EINPROGRESS
 accept( 6, {AF_INET, 20013, CE}, [16]) =  9
 accept( 7, {AF_INET, 20014, CE}, [16]) =  9
 accept( 7, {AF_INET, 20015, CE}, [16]) =  9
 accept( 6, {AF_INET, 20016, CE}, [16]) =  9
   bind( 9, {AF_INET, 20007,  0},  16 ) =  0
connect( 9, {AF_INET, 20010, CE},  16 ) = -1 EINPROGRESS
   bind( 9, {AF_INET, 20007,  0},  16 ) =  0
connect( 9, {AF_INET, 20010, CE},  16 ) = -1 EINPROGRESS
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export