The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Main

EGI.eu operations services

Support

Documentation

Tools

Activities

Performance

Technology

Catch-all Services

Resource Allocation

Security

Documentation menu:

Home •

Manuals •

Procedures •

Training •

Other •

Contact ►

For:

VO managers •

Administrators

Back to Troubleshooting Guide

ssh problem from WN to CE

Full message

Various error messages, usually not directly showing an ssh/scp problem; see other job submission errors.

Before a job starts, the batch system needs to copy the job wrapper script and the user proxy to the WN. It is normal for Torque/PBS and possibly other batch systems to rely on scp for letting the WN copy those files from the CE host before the real job gets started.

Similarly, after the job has finished, the stdout and stderr of the job wrapper need to be copied to the CE.

The various scp invocations may fail for several reasons. If the failures are intermittent, then the SSH daemon on the CE may not have been configured to allow sufficient simultaneous connections.

Diagnosis

Run "pbsnodes -a" to see the list of WNs and make sure the keys for CE and WNs are present in /etc/ssh/ssh_known_hosts on the CE and WNs. Also ensure there are no old keys present. Check the WNs are present in /etc/ssh/shosts.equiv on the CE.
Look in /var/log/messages or /var/log/secure on the CE for hints.
From the WN, as a grid account (e.g. pool user) try ssh or scp to the CE. (Note: for root it does not work, even if everything is set up properly.) It is wrong if you get a password prompt instead of using the ssh trust relationship.
On your CE, check if /var/spool/pbs/server_priv/nodes and /etc/ssh/ssh_known_hosts both use fully qualified domain names.
Check if pool accounts on CE and WN are enabled for interactive login in /etc/loginusers if such a file is configured in /etc/pam.d/sshd, /etc/pam.d/login, /etc/pam.d/system-auth, etc.

Solution

Possible problem with duplicate entries for the WNs in the CE ssh configuration.

Remove shosts.equiv and ssh_known_hosts files from /etc/ssh directory on the CE and WNs.
Re-run the following scripts on CE, that are usually also cron jobs.

/usr/sbin/edg-pbs-knownhosts
/usr/sbin/edg-pbs-shostsequiv

Re-run the following script on WN, that is usually also a cron job.

/usr/sbin/edg-pbs-knownhosts

If insufficient connections are allowed to the SSH daemon on the CE:

Add the MaxStartups parameter to sshd_config on the CE

MaxStartups 100

Restart sshd on the CE.

Tools/Manuals/TS60

Contents

ssh problem from WN to CE

Full message

Diagnosis

Solution

Navigation menu

Tools/Manuals/TS60

ssh problem from WN to CE

Full message

Diagnosis

Solution

Navigation menu

Search