Tools/Manuals/TS60
< Tools
Jump to navigation
Jump to search
Revision as of 14:34, 25 May 2011 by Aesch (talk | contribs) (Created page with '{{TOC_right}} Category:FAQ ------ Back to Troubleshooting Guide ------ = ssh problem from WN to CE = == Full message == :Various erro…')
Back to Troubleshooting Guide
ssh problem from WN to CE
Full message
- Various error messages, usually not directly showing an ssh/scp problem; see other job submission errors. After the job has finished, a Torque/PBS WN needs to copy the stdout and stderr of the job wrapper back to the CE, usually using scp. This may fail for several reasons. If failures are intermittent, then the SSH daemon on the CE may be configured to allow too few connections.
Diagnosis
- Run "pbsnodes -a" to see the list of WNs and make sure the keys for CE and WNs are present in /etc/ssh/ssh_known_hosts on the CE and WNs. Also ensure there are no old keys present. Check the WNs are present in /etc/ssh/shosts.equiv on the CE.
- Look in /var/log/messages or /var/log/secure on the CE for hints.
- From the WN, as a grid account (e.g. pool user) try ssh or scp to the CE. (Note: for root it does not work, even if everything is set up properly.) It is wrong if you get a password prompt instead of using the ssh trust relationship.
- On your CE, check if /var/spool/pbs/server_priv/nodes and /etc/ssh/ssh_known_hosts both use fully qualified domain names.
- Check if pool accounts on CE and WN are enabled for interactive login in /etc/loginusers if such a file is configured in /etc/pam.d/sshd, /etc/pam.d/login, /etc/pam.d/system-auth, etc.
Solution
- Possible problem with duplicate entries for the WNs in the CE ssh configuration.
- Remove shosts.equiv and ssh_known_hosts files from /etc/ssh directory on the CE and WNs.
- Re-run the following scripts on CE, that are usually also cron jobs.
/opt/edg/sbin/edg-pbs-knownhosts /opt/edg/sbin/edg-pbs-shostsequiv
- Re-run the following script on WN, that is usually also a cron job.
/opt/edg/sbin/edg-pbs-knownhosts
If insufficient connections are allowed to the SSH daemon on the CE:
- Add the MaxStartups parameter to sshd_config on the CE
MaxStartups 100
- Restart sshd on the CE.