Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Tools/Manuals/TS60

From EGIWiki
< Tools
Revision as of 14:34, 25 May 2011 by Aesch (talk | contribs) (Created page with '{{TOC_right}} Category:FAQ ------ Back to Troubleshooting Guide ------ = ssh problem from WN to CE = == Full message == :Various erro…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Back to Troubleshooting Guide


ssh problem from WN to CE

Full message

Various error messages, usually not directly showing an ssh/scp problem; see other job submission errors. After the job has finished, a Torque/PBS WN needs to copy the stdout and stderr of the job wrapper back to the CE, usually using scp. This may fail for several reasons. If failures are intermittent, then the SSH daemon on the CE may be configured to allow too few connections.

Diagnosis

  • Run "pbsnodes -a" to see the list of WNs and make sure the keys for CE and WNs are present in /etc/ssh/ssh_known_hosts on the CE and WNs. Also ensure there are no old keys present. Check the WNs are present in /etc/ssh/shosts.equiv on the CE.
  • Look in /var/log/messages or /var/log/secure on the CE for hints.
  • From the WN, as a grid account (e.g. pool user) try ssh or scp to the CE. (Note: for root it does not work, even if everything is set up properly.) It is wrong if you get a password prompt instead of using the ssh trust relationship.
  • On your CE, check if /var/spool/pbs/server_priv/nodes and /etc/ssh/ssh_known_hosts both use fully qualified domain names.
  • Check if pool accounts on CE and WN are enabled for interactive login in /etc/loginusers if such a file is configured in /etc/pam.d/sshd, /etc/pam.d/login, /etc/pam.d/system-auth, etc.

Solution

Possible problem with duplicate entries for the WNs in the CE ssh configuration.
  • Remove shosts.equiv and ssh_known_hosts files from /etc/ssh directory on the CE and WNs.
  • Re-run the following scripts on CE, that are usually also cron jobs.
/opt/edg/sbin/edg-pbs-knownhosts
/opt/edg/sbin/edg-pbs-shostsequiv
  • Re-run the following script on WN, that is usually also a cron job.
/opt/edg/sbin/edg-pbs-knownhosts

If insufficient connections are allowed to the SSH daemon on the CE:

  • Add the MaxStartups parameter to sshd_config on the CE
MaxStartups 100
  • Restart sshd on the CE.