Difference between revisions of "Tools/Manuals/TS87"
< Tools
Jump to navigation
Jump to search
(Created page with '{{TOC_right}} Category:FAQ ------ Back to Troubleshooting Guide ------ The job submission and cleanup dialog between WMS/Condor-G and…') |
|||
(5 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
{{Template:Op menubar}} | |||
{{Template:Doc_menubar}} | |||
[[Category:Operations Manuals]] | |||
{{TOC_right}} | {{TOC_right}} | ||
------ | ------ | ||
Back to [[Tools/Manuals/SiteProblemsFollowUp|Troubleshooting Guide]] | Back to [[Tools/Manuals/SiteProblemsFollowUp|Troubleshooting Guide]] | ||
------ | ------ | ||
= Dialog between WMS and LCG-CE = | |||
The job submission and cleanup dialog between WMS/Condor-G and LCG-CE/OSG-CE | The job submission and cleanup dialog between WMS/Condor-G and LCG-CE/OSG-CE | ||
goes like this: | goes like this: | ||
# The WMS contacts the CE on port 2119 and indicates on which port the WMS | # The WMS contacts the CE on port 2119 and indicates on which port the WMS should be called back by the globus-job-manager. That port is the first free port in the Globus port range defined on the WMS. The range usually is 20000-25000, so the first free port usually is 20000 + O(10). <br />The WMS is also called back on a second port by the "grid_monitor" process running on the CE. A WMS will launch one such process per user on any CE that has unfinished jobs for that user. The "grid_monitor" keeps track of the state of the user's jobs and regularly reports back to its WMS. Each such process will exit after 1 hour and another instance will then be launched as needed. | ||
should be called back by the globus-job-manager. That port is the | # The WMS contacts the globus-job-manager on a few ports in the CE port range. The CE calls the WMS back various times on the ports from step 1. If all went well, a wrapper script for the user job has then been accepted by the CE for submission to the CE's batch system. | ||
first free port in the Globus port range defined on the WMS. The range | # When the job wrapper has been successfully submitted to the batch system the globus-job-manager for that job is told to exit. | ||
usually is 20000-25000, so the first free port usually is 20000 + O(10). | # The job wrapper eventually starts on the WN and copies the input sandbox from the WMS using globus-url-copy. (This step is specific to the WMS, not to Condor-G in general.) | ||
The WMS is also called back on a second port by the "grid_monitor" process | |||
running on the CE. A WMS will launch one such process per user on any | |||
CE that has unfinished jobs for that user. The "grid_monitor" keeps track | |||
of the state of the user's jobs and regularly reports back to its WMS. | |||
Each such process will exit after 1 hour and another instance will then | |||
be launched as needed. | |||
# The WMS contacts the globus-job-manager on a few ports in the CE port range. | |||
The CE calls the WMS back various times on the ports from step 1. | |||
If all went well, a wrapper script for the user job has then been accepted | |||
by the CE for submission to the CE's batch system. | |||
# When the job wrapper has been successfully submitted to the batch system | |||
the globus-job-manager for that job is told to exit. | |||
# The job wrapper eventually starts on the WN and copies the input sandbox | |||
from the WMS using globus-url-copy. (This step is specific to the WMS, | |||
not to Condor-G in general.) | |||
# The job wrapper runs the user payload. | # The job wrapper runs the user payload. | ||
# The job wrapper copies the output sandbox (and the "Maradona" file with the payload's exit status) back to the WMS and exits. (Also this step is specific to the WMS, not to Condor-G in general.) | |||
# The job wrapper copies the output sandbox (and the "Maradona" file with | # The "grid_monitor" running on the CE informs the WMS that the job has exited. The WMS contacts the CE again on port 2119 to restart the globus-job-manager for the job in question, which then cleans things up and sends the stderr and stdout of the job wrapper to the WMS (for a WMS job also stdout normally contains the exit status of the user payload). | ||
the payload's exit status) back to the WMS and exits. (Also this step is | |||
specific to the WMS, not to Condor-G in general.) | |||
# The "grid_monitor" running on the CE informs the WMS that the job has exited. | |||
The WMS contacts the CE again on port 2119 to restart the globus-job-manager | |||
for the job in question, which then cleans things up and sends the stderr | |||
and stdout of the job wrapper to the WMS (for a WMS job also stdout normally | |||
contains the exit status of the user payload). | |||
An illustration shows all calls to | An illustration shows all calls to |
Latest revision as of 13:48, 23 November 2012
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Back to Troubleshooting Guide
Dialog between WMS and LCG-CE
The job submission and cleanup dialog between WMS/Condor-G and LCG-CE/OSG-CE goes like this:
- The WMS contacts the CE on port 2119 and indicates on which port the WMS should be called back by the globus-job-manager. That port is the first free port in the Globus port range defined on the WMS. The range usually is 20000-25000, so the first free port usually is 20000 + O(10).
The WMS is also called back on a second port by the "grid_monitor" process running on the CE. A WMS will launch one such process per user on any CE that has unfinished jobs for that user. The "grid_monitor" keeps track of the state of the user's jobs and regularly reports back to its WMS. Each such process will exit after 1 hour and another instance will then be launched as needed. - The WMS contacts the globus-job-manager on a few ports in the CE port range. The CE calls the WMS back various times on the ports from step 1. If all went well, a wrapper script for the user job has then been accepted by the CE for submission to the CE's batch system.
- When the job wrapper has been successfully submitted to the batch system the globus-job-manager for that job is told to exit.
- The job wrapper eventually starts on the WN and copies the input sandbox from the WMS using globus-url-copy. (This step is specific to the WMS, not to Condor-G in general.)
- The job wrapper runs the user payload.
- The job wrapper copies the output sandbox (and the "Maradona" file with the payload's exit status) back to the WMS and exits. (Also this step is specific to the WMS, not to Condor-G in general.)
- The "grid_monitor" running on the CE informs the WMS that the job has exited. The WMS contacts the CE again on port 2119 to restart the globus-job-manager for the job in question, which then cleans things up and sends the stderr and stdout of the job wrapper to the WMS (for a WMS job also stdout normally contains the exit status of the user payload).
An illustration shows all calls to bind(), connect(), listen() and accept() made by the gahp_server process on a WMS for a single job submission to a CE plus the launch of a "grid_monitor", followed by the subsequent cleanup of the job (whenever a file descriptor is reused, it was first closed):
bind( 6, {AF_INET, 20000, 0}, 16 ) = 0 listen( 6, 128 ) = 0 bind( 7, {AF_INET, 20001, 0}, 16 ) = 0 listen( 7, 128 ) = 0 bind( 8, {AF_INET, 20002, 0}, 16 ) = 0 connect( 8, {AF_INET, 2119, CE}, 16 ) = -1 EINPROGRESS bind( 8, {AF_INET, 20003, 0}, 16 ) = 0 connect( 8, {AF_INET, 2119, CE}, 16 ) = -1 EINPROGRESS bind( 9, {AF_INET, 20005, 0}, 16 ) = 0 connect( 9, {AF_INET, 2119, CE}, 16 ) = -1 EINPROGRESS bind( 8, {AF_INET, 20007, 0}, 16 ) = 0 connect( 8, {AF_INET, 20007, CE}, 16 ) = -1 EINPROGRESS accept( 6, {AF_INET, 20009, CE}, [16]) = 10 accept( 6, {AF_INET, 20010, CE}, [16]) = 8 bind(10, {AF_INET, 20007, 0}, 16 ) = 0 connect(10, {AF_INET, 20007, CE}, 16 ) = -1 EINPROGRESS accept( 6, {AF_INET, 20011, CE}, [16]) = 8 accept( 7, {AF_INET, 20012, CE}, [16]) = 8 accept( 7, {AF_INET, 20013, CE}, [16]) = 9 accept( 7, {AF_INET, 20014, CE}, [16]) = 9 accept( 7, {AF_INET, 20007, CE}, [16]) = 9 accept( 7, {AF_INET, 20007, CE}, [16]) = 9 accept( 7, {AF_INET, 20007, CE}, [16]) = 9 accept( 7, {AF_INET, 20000, CE}, [16]) = 9 accept( 7, {AF_INET, 20000, CE}, [16]) = 9 accept( 7, {AF_INET, 20000, CE}, [16]) = 9 accept( 7, {AF_INET, 20000, CE}, [16]) = 9 accept( 7, {AF_INET, 20000, CE}, [16]) = 9 bind( 9, {AF_INET, 20002, 0}, 16 ) = 0 connect( 9, {AF_INET, 2119, CE}, 16 ) = -1 EINPROGRESS bind( 9, {AF_INET, 20003, 0}, 16 ) = 0 connect( 9, {AF_INET, 20010, CE}, 16 ) = -1 EINPROGRESS accept( 6, {AF_INET, 20013, CE}, [16]) = 9 accept( 7, {AF_INET, 20014, CE}, [16]) = 9 accept( 7, {AF_INET, 20015, CE}, [16]) = 9 accept( 6, {AF_INET, 20016, CE}, [16]) = 9 bind( 9, {AF_INET, 20007, 0}, 16 ) = 0 connect( 9, {AF_INET, 20010, CE}, 16 ) = -1 EINPROGRESS bind( 9, {AF_INET, 20007, 0}, 16 ) = 0 connect( 9, {AF_INET, 20010, CE}, 16 ) = -1 EINPROGRESS