Tools/Manuals/TS60

From EGIWiki
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Back to Troubleshooting Guide


ssh problem from WN to CE

Full message

Various error messages, usually not directly showing an ssh/scp problem; see other job submission errors.

Before a job starts, the batch system needs to copy the job wrapper script and the user proxy to the WN. It is normal for Torque/PBS and possibly other batch systems to rely on scp for letting the WN copy those files from the CE host before the real job gets started.

Similarly, after the job has finished, the stdout and stderr of the job wrapper need to be copied to the CE.

The various scp invocations may fail for several reasons. If the failures are intermittent, then the SSH daemon on the CE may not have been configured to allow sufficient simultaneous connections.

Diagnosis

  • Run "pbsnodes -a" to see the list of WNs and make sure the keys for CE and WNs are present in /etc/ssh/ssh_known_hosts on the CE and WNs. Also ensure there are no old keys present. Check the WNs are present in /etc/ssh/shosts.equiv on the CE.
  • Look in /var/log/messages or /var/log/secure on the CE for hints.
  • From the WN, as a grid account (e.g. pool user) try ssh or scp to the CE. (Note: for root it does not work, even if everything is set up properly.) It is wrong if you get a password prompt instead of using the ssh trust relationship.
  • On your CE, check if /var/spool/pbs/server_priv/nodes and /etc/ssh/ssh_known_hosts both use fully qualified domain names.
  • Check if pool accounts on CE and WN are enabled for interactive login in /etc/loginusers if such a file is configured in /etc/pam.d/sshd, /etc/pam.d/login, /etc/pam.d/system-auth, etc.

Solution

Possible problem with duplicate entries for the WNs in the CE ssh configuration.

  • Remove shosts.equiv and ssh_known_hosts files from /etc/ssh directory on the CE and WNs.
  • Re-run the following scripts on CE, that are usually also cron jobs.
/usr/sbin/edg-pbs-knownhosts
/usr/sbin/edg-pbs-shostsequiv
  • Re-run the following script on WN, that is usually also a cron job.
/usr/sbin/edg-pbs-knownhosts

If insufficient connections are allowed to the SSH daemon on the CE:

  • Add the MaxStartups parameter to sshd_config on the CE
MaxStartups 100
  • Restart sshd on the CE.