Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Tools/Manuals/TS60"

From EGIWiki
Jump to navigation Jump to search
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
{{Template:Op menubar}}
{{Template:Doc_menubar}}
[[Category:Operations Manuals]]
{{TOC_right}}
{{TOC_right}}
[[Category:FAQ]]
------
------
Back to [[Tools/Manuals/SiteProblemsFollowUp|Troubleshooting Guide]]
Back to [[Tools/Manuals/SiteProblemsFollowUp|Troubleshooting Guide]]
Line 8: Line 10:


== Full message ==
== Full message ==
:Various error messages, usually not directly showing an ssh/scp problem; see other job submission errors. After the job has finished, a Torque/PBS WN needs to copy the stdout and stderr of the job wrapper back to the CE, usually using scp.  This may fail for several reasons. If failures are intermittent, then the SSH daemon on the CE may be configured to allow too few connections.
Various error messages, usually not directly showing an ssh/scp problem; see other job submission errors.
 
Before a job starts, the batch system needs to copy the job wrapper script and the user proxy to the WN. It is normal for Torque/PBS and possibly other batch systems to rely on '''scp''' for letting the WN copy those files from the CE host before the real job gets started.
 
Similarly, after the job has finished, the stdout and stderr of the job wrapper need to be copied to the CE.
 
The various scp invocations may fail for several reasons. If the failures are intermittent, then the SSH daemon on the CE may not have been configured to allow sufficient simultaneous connections.


== Diagnosis ==
== Diagnosis ==
Line 23: Line 31:
* Re-run the following scripts on CE, that are usually also <font face="Courier New,Courier">cron</font> jobs.
* Re-run the following scripts on CE, that are usually also <font face="Courier New,Courier">cron</font> jobs.


  /opt/edg/sbin/edg-pbs-knownhosts
  /usr/sbin/edg-pbs-knownhosts
  /opt/edg/sbin/edg-pbs-shostsequiv
  /usr/sbin/edg-pbs-shostsequiv


* Re-run the following script on WN, that is usually also a <font face="Courier New,Courier">cron</font> job.
* Re-run the following script on WN, that is usually also a <font face="Courier New,Courier">cron</font> job.


  /opt/edg/sbin/edg-pbs-knownhosts
  /usr/sbin/edg-pbs-knownhosts


If insufficient connections are allowed to the SSH daemon on the CE:
If insufficient connections are allowed to the SSH daemon on the CE:

Latest revision as of 18:48, 21 June 2014

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Troubleshooting Guide


ssh problem from WN to CE

Full message

Various error messages, usually not directly showing an ssh/scp problem; see other job submission errors.

Before a job starts, the batch system needs to copy the job wrapper script and the user proxy to the WN. It is normal for Torque/PBS and possibly other batch systems to rely on scp for letting the WN copy those files from the CE host before the real job gets started.

Similarly, after the job has finished, the stdout and stderr of the job wrapper need to be copied to the CE.

The various scp invocations may fail for several reasons. If the failures are intermittent, then the SSH daemon on the CE may not have been configured to allow sufficient simultaneous connections.

Diagnosis

  • Run "pbsnodes -a" to see the list of WNs and make sure the keys for CE and WNs are present in /etc/ssh/ssh_known_hosts on the CE and WNs. Also ensure there are no old keys present. Check the WNs are present in /etc/ssh/shosts.equiv on the CE.
  • Look in /var/log/messages or /var/log/secure on the CE for hints.
  • From the WN, as a grid account (e.g. pool user) try ssh or scp to the CE. (Note: for root it does not work, even if everything is set up properly.) It is wrong if you get a password prompt instead of using the ssh trust relationship.
  • On your CE, check if /var/spool/pbs/server_priv/nodes and /etc/ssh/ssh_known_hosts both use fully qualified domain names.
  • Check if pool accounts on CE and WN are enabled for interactive login in /etc/loginusers if such a file is configured in /etc/pam.d/sshd, /etc/pam.d/login, /etc/pam.d/system-auth, etc.

Solution

Possible problem with duplicate entries for the WNs in the CE ssh configuration.

  • Remove shosts.equiv and ssh_known_hosts files from /etc/ssh directory on the CE and WNs.
  • Re-run the following scripts on CE, that are usually also cron jobs.
/usr/sbin/edg-pbs-knownhosts
/usr/sbin/edg-pbs-shostsequiv
  • Re-run the following script on WN, that is usually also a cron job.
/usr/sbin/edg-pbs-knownhosts

If insufficient connections are allowed to the SSH daemon on the CE:

  • Add the MaxStartups parameter to sshd_config on the CE
MaxStartups 100
  • Restart sshd on the CE.