Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "HOWTO04 Site Certification Manual tests"

From EGIWiki
Jump to navigation Jump to search
(Deprecate page)
Tag: Replaced
 
(38 intermediate revisions by 8 users not shown)
Line 1: Line 1:
{{Template: Op menubar}}
{{Template: Op menubar}} {{Template:Doc_menubar}}
{{Template:Doc_menubar}}
{{TOC_right}}
[[Category:Sites Documentation]]
= Check the functionality of the grid elements  =


Be sure that the site's GIIS URL is contained in the Top level BDII/Information System your NGI will use for your certification. <!--Note that this BDII can also contain uncertified sites. -->
{{DeprecatedAndMovedTo|new_location=https://docs.egi.eu/providers/operations-manuals/howto04_site_certification_manual_tests/}}


<br> Note that the examples here use the Italian NGI and sites. Please substitute '''YOUR OWN''' NGI and site credentials when running the test.
[[Category:Operations_Manuals]]
 
== lcg-CE checks  ==
 
Verify the authentication and authorization on the CE by running a simple command. i.e.
 
'''''$ globus-job-run inaf-ce-01.ct.pi2s2.it /bin/hostname'''''
 
(you could also use: /usr/bin/whoami, or whatever you want!!)
 
Check if the lcg-CE gridftp server is working
 
'''''$ globus-url-copy -dbg -v -vb file:/home/csys/goncalo/teste.txt gsiftp://ce02.lip.pt/tmp/txt'''''
 
'''''$ uberftp ce02.lip.pt'''''
 
<br> In case of pbs, check the WNs with the following command:
 
'''''$ globus-job-run pbs-enmr.cerm.unifi.it /usr/bin/pbsnodes -a'''''
 
Verify the functioning of the batch system: be careful that the queue you are querying really exists, and your VO is enabled on it. For example:
 
'''''$ globus-job-run ce02.lip.pt:2119/jobmanager-fork /bin/pwd'''''
 
'''''$ globus-job-run ce-cyb.ca.infn.it/jobmanager-lcglsf -queue poncert /bin/pwd'''''
 
check dgas processes on CE (with a ps ax| grep dgas)
 
== Cream-CE checks  ==
 
Open your browser to
 
'''''<nowiki>https://<hostname-of-cream-ce>:8443/ce-cream/services</nowiki>'''''
 
A page with link to the CREAM WSDL should be shown
 
Try a gsiftp (e.g. using globus-url-copy or @uberftp@@) towards that CREAM CE. E.g.:
 
'''''$ globus-url-copy gsiftp://&lt;hostname-of-cream-ce&gt;/opt/glite/yaim/etc/versions/ig-yaim file:/tmp/ig-version-test''''' 
 
Try the following command:
 
'''''$ glite-ce-allowed-submission &lt;&lt;hostname-of-cream-ce&gt;&gt;:8443'''''
 
It should report:
 
Job Submission to this CREAM CE is enabled 
 
Try a submission to Cream-CE using the glite-ce-job-submit command, e.g.:
 
$ /bin/cat sleep.jdl
[
executable="/bin/sleep";
arguments="1";
]
 
'''''$ glite-ce-job-submit -a -r &lt;hostname-of-cream-ce&gt;:8443/&lt;queue&gt; test.jdl''''' 
 
'''''$ glite-ce-job-submit -a -r ce-cr-02.ts.infn.it:8443/cream-lsf-cert sleep.jdl'''''
<nowiki>https://ce-cr-02.ts.infn.it:8443/CREAM127814374</nowiki>
 
Check the status of that job, which eventually should be DONE-OK
 
'''''$ glite-ce-job-status <nowiki>https://ce-cr-02.ts.infn.it:8443/CREAM127814374</nowiki>'''''
2010-07-27 11:55:37,986 WARN - No configuration file suitable for loading. Using built-in configuration
******  JobID=<nowiki>[https://ce-cr-02.ts.infn.it:8443/CREAM127814374]</nowiki>
  Status        = [DONE-OK]
  ExitCode      = [0]
 
Try a submission to that CE using the glite-ce-job-submit command, and then tries to cancel it (using the glite-ce-job-cancel command)
 
$ /bin/cat sleep2.jdl
[
executable="/bin/sleep";
arguments="1000";
]
 
$ glite-ce-job-submit -a -r cecream-cyb.ca.infn.it:8443/cream-lsf-poncert sleep2.jdl
<nowiki>https://cecream-cyb.ca.infn.it:8443/CREAM126335182</nowiki>
 
$ glite-ce-job-cancel <nowiki>https://cecream-cyb.ca.infn.it:8443/CREAM126335182</nowiki>
 
$ glite-ce-job-status <nowiki>https://cecream-cyb.ca.infn.it:8443/CREAM126335182</nowiki>
2010-07-27 12:18:26,973 WARN - No configuration file suitable for loading. Using built-in configuration
******  JobID=<nowiki>[https://cecream-cyb.ca.infn.it:8443/CREAM126335182]</nowiki>
  Status        = [CANCELLED]
  ExitCode      = []
  Description  = [Cancelled by user]
 
== ARC CE checks  ==
 
A first test can be done using ARC's <font face="Courier New,Courier">ngstat</font> command:
 
'''''$ export X509_USER_PROXY=/etc/nagios/globus/userproxy.pem-ops'''''
'''''$ export LD_LIBRARY_PATH=/opt/nordugrid/lib64:/opt/nordugrid/lib'''''
'''''$ /opt/nordugrid/bin/ngstat -q -l -c &lt;CE hostname&gt; -t 20'''''
...
... plenty of output
...
 
If a [https://tomtools.cern.ch/confluence/display/SAM/SAM+setup+for+ARC+services monitoring host of your NGI] is available, then the probes can easily be executed from there:
 
Check the status of the CE with:
 
'''''$ /usr/libexec/grid-monitoring/probes/org.ndgf/ARCCE-status -H &lt;CE hostname&gt; -x /etc/nagios/globus/userproxy.pem-ops'''''
Status is active
 
Test gsiftp:
 
'''''$ /usr/libexec/grid-monitoring/probes/org.ndgf/ARCCE-auth -H &lt;CE hostname&gt; -x /etc/nagios/globus/userproxy.pem-ops'''''
gsiftp OK
 
Test the versions of the CA's:
 
'''''$ /usr/libexec/grid-monitoring/probes/org.ndgf/ARCCE-caver -H &lt;CE hostname&gt; -x /etc/nagios/globus/userproxy.pem-ops'''''
version = 1.38 - All CAs present
 
Check the versions of ARC and Globus:
 
'''''$ /usr/libexec/grid-monitoring/probes/org.ndgf/ARCCE-softver -H &lt;CE hostname&gt; -x /etc/nagios/globus/userproxy.pem-ops'''''
nordugrid-arc-0.8.3.1, globus-5.0.3
 
Copy a file:
 
'''''$ /usr/libexec/grid-monitoring/probes/org.ndgf/ARCCE-gridftp -H &lt;CE hostname&gt; -x /etc/nagios/globus/userproxy.pem-ops'''''
Job finished successfully
 
Submit a test job:
 
'''''$ /usr/libexec/grid-monitoring/probes/org.ndgf/ARCCE-jobsubmit -H &lt;CE hostname&gt; --vo ops -x /etc/nagios/globus/userproxy.pem-ops'''''
Job submission successful
 
Check the LFC:
 
'''''$ /usr/libexec/grid-monitoring/probes/org.ndgf/ARCCE-lfc -H &lt;CE hostname&gt; -x /etc/nagios/globus/userproxy.pem-ops'''''
Job finished successfully
 
Check the SRM:
 
'''''$ /usr/libexec/grid-monitoring/probes/org.ndgf/ARCCE-srm -H &lt;CE hostname&gt; -x /etc/nagios/globus/userproxy.pem-ops'''''
Job finished successfully
 
Before continuing, you may want to make sure that the probes for all services which the CE intends to offer, do actually succeed.
 
== SE checks  ==
 
check if gridftp server on SE works:
 
$ uberftp inaf-se-01.ct.pi2s2.it
 
For STORM SE: check if SRM client works (on the published information you can find the right port to use)
 
$ /opt/storm/srm-clients/bin/clientSRM ping -e httpg://sunstorm.cnaf.infn.it:8444
============================================================
Sending Ping request to: httpg://sunstorm.cnaf.infn.it:8444
============================================================
Request status:
statusCode="SRM_SUCCESS"(0)
explanation="SRM server successfully contacted"
============================================================
SRM Response:
versionInfo="v2.2"
otherInfo (size=2)
[0] key="backend_type"
[0] value="StoRM"
[1] key="backend_version"
[1] value="&lt;FE:1.5.0-1.sl4&gt;&lt;BE:1.5.3-4.sl4&gt;"
============================================================
 
<br> Try to write on SE. Be sure your UI is pointing to an IS the SE is contained in (you may use your certification BDII)
 
1) Setting a top-bdii that is publishing the SE you have to test
 
$ export LCG_GFAL_INFOSYS=&lt;TopBDII hostname&gt;:2170
 
2) Copy a file from the local filesystem to the SE, registering it in the LFC. This command output will return a SURL that you can use latter for other tests.
 
A SURL is a path of the type: srm://srm01.ncg.ingrid.pt/ibergrid/iber/generated/2011-02-01/file4034a935-8d7a-48f4-914f-16f2634d4802
 
$ lcg-cr -v --vo &lt;VO&gt; -d &lt;Your SE&gt; -l lfn:/grid/&lt;VO&gt;/test.txt file:&lt;/path/to/your/local/file&gt;
 
3) Create a new replica in other SE (to check the 3rd party transfer between 2 SEs)
 
$ lcg-rep -v --vo &lt;VO&gt; -d &lt;Other SE&gt; &lt;SURL&gt;
 
4) List Replicas
 
$ lcg-lr -v --vo &lt;VO&gt; lfn:/grid/&lt;VO&gt;/test.txt
 
5) Delete all replicas
 
$ lcg-del -v --vo &lt;VO&gt; -a &lt;guid&gt;
 
== Job submission  ==
 
Submit a test job to either '''lcg-CE''' or '''Cream-CE''' through the '''WMS''', i.e. using the '''glite-wms-job-submit''' command. In case, submit a mpi test job. The NGI_IT certification WMS is gridit-cert-wms.cnaf.infn.it
 
== Registration into 1st level HLR  ==
 
'''NOTE: this step is needed if your infrastructure uses DGAS as accounting system'''
 
After the site entered in production, it needs to register the site resources in the hlr. Ask the site-admins to open a ticket towards the hlr adminstrators, passing them the following information:
 
*grid queues names, in the form:
**gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert
 
*not-grid queues names, in the form:
**hostname:queue
 
*Name, surname ad certificate subject of each site-admin
*Certificate subject of Computing Element
 
Eventually, the site-admins have to open a ticket to DGAS support unit asking to enable the forwarding of accounting data from the 2° level hlr to APEL
 
== Certification Job  ==
 
The [[Cert Job|test job]] checks several things, like the environment on WN and installed rpms. Moreover it performs some replica management tests. With a "grep TEST" you may get a summary of the results: in case of errors, you have to see in detail what is gone wrong!
 
As already said, if the site supports any flavour of mpi, launch a mpi test job, like [[SiteCertMan/MPI Job Cert|this]]
 
don't forget to set a reasonable value in ''CPUNumber'': most important is that your job starts running quickly
 
If you want less stuff in the .out and .err files, in the file mpi-start-wrapper.sh comment the line
 
export I2G_MPI_START_DEBUG=1
 
A successful output will look like the following one (extract)
 
[...]
mpi-start [DEBUG  ]: using user supplied startup&nbsp;: '/opt/mpich-1.2.7p1/bin/mpirun '
mpi-start [DEBUG  ]: =&gt; MPI_SPECIFIC_PARAMS=
mpi-start [DEBUG  ]: =&gt; I2G_MPI_PRECOMMAND=
mpi-start [DEBUG  ]: =&gt; MPIEXEC=/opt/mpich-1.2.7p1/bin/mpirun
mpi-start [DEBUG  ]: =&gt; I2G_MACHINEFILE_AND_NP=-machinefile /tmp/tmp.iBypc12521 -np 6
mpi-start [DEBUG  ]: =&gt; I2G_MPI_APPLICATION=/home/dteam022/globus-tmp.t3-wn-13.11955.0/https_3a_2f_2falbalonga.cnaf.infn.it_3a9000_2fI06uWaKi1evxL3tTF-DTOg/hello
mpi-start [DEBUG  ]: =&gt; I2G_MPI_APPLICATION_ARGS=
mpi-start [DEBUG  ]: /opt/mpich-1.2.7p1/bin/mpirun -machinefile /tmp/tmp.iBypc12521 -np 6 /home/dteam022/globus-tmp.t3-wn-13.11955.0/https_3a_2f_2falbalonga.cnaf.infn.it_3a9000_2fI06uWaKi1evxL3tTF-DTOg/hello
Process 4 on t3-wn-37.pn.pd.infn.it out of 6
Process 3 on t3-wn-34.pn.pd.infn.it out of 6
Process 1 on t3-wn-13.pn.pd.infn.it out of 6
Process 2 on t3-wn-34.pn.pd.infn.it out of 6
Process 5 on t3-wn-37.pn.pd.infn.it out of 6
Process 0 on t3-wn-13.pn.pd.infn.it out of 6
[...]
 
== Globus checks<br>  ==
 
'''These checks should be executed depending on the services registered in GOCDB under a Resource Centre. Not all services are compulsory for a RC, but upon registration of new ones, the corresponding tests should be executed.'''<br><br>
 
=== GSISSH<br>  ===
 
Initialize grid proxy and check if GSISSH works:<br>
<pre>$ grid-proxy-init
$ gsissh USER@HOST -p 2222 /bin/date
(Debug with: USER@HOST -vvv -p 2222 /bin/date)
</pre>
=== <br>GridFTP<br>  ===
 
Check if upload works:<br>
<pre>$ globus-url-copy file:/tmp/test.txt gsiftp://HOST:2811/tmp/test.txt
(Debug with: globus-url-copy -dbg -v -vb file:/tmp/test.txt gsiftp://HOST:2811/tmp/test.txt)
</pre>
<br>Check if download works:
<pre>$ globus-url-copy gsiftp://HOST:2811/tmp/test.txt file:/tmp/test.txt
(Debug with: globus-url-copy -dbg -v -vb gsiftp://HOST:2811/tmp/test.txt file:/tmp/test.txt)
</pre>
<br>Delete the remote file:
<pre>$ uberftp HOST 'rm /tmp/test.txt'
(Debug with: uberftp HOST 'rm /tmp/test.txt' -debug 3)
</pre>
=== <br>GRAM  ===
 
Check authentication:
<pre>$ globusrun -a -r HOST:2119
</pre>
<br>Check job submission:
<pre>$ globusrun -s -r HOST:2119 "&amp;(executable="/bin/date")"
</pre>
== Unicore checks  ==
 
This testing manual assumes that the test instance has not been added to the “Global” registry. “Global” registry does not have to be global (for the whole infrastructure) - is a register used by a group of site which work together. For example each Resource Infrastructure Provider can have own “Global” registry.<br><br>It is suggested to add the instance to the “Global” registry only if it was tested and works properly. For this reason this instruction refers to the local registry.
 
=== Preliminary testing<br>  ===
 
After installation and configuration, start all the services and see if functioning properly. To avoid errors/warnings in the logs first start the TSI and the Gateway and then the Unicore/X (requires two other servers to operate).<br>The first step of verification is to verify proper configuration of log files for all services whether they running. Logs for Unicore/X and Gateway are in standard locations ''/var/log/unicore/unicorex/unicorex.log'' and ''/var/log/unicore/gateway/gateway.log''. In the case where there is no log file, check the file ''/var/log/unicore/unicorex/unicorex-startup.log'' or ''/var/log/unicore/gateway/gateway-startup.log'' - those file contain the servers' standard output&nbsp; output, and can be useful in case of generic, system-wide issues as missing&nbsp; Java virtual machine.<br>Log files should be checked carefully for warnings and errors. They should show only the information about the start of the service, without any warnings (the WARN label) or errors (the ERROR label).<br>In case of problems, you should proceed according to the information found in the log files. If they are unclear you should increase logging detail (for Unicore/X and Gateway). This is set in the file ''/etc/unicore/gateway/logging.propertie''s and ''/etc/unicore/unicorex/logging.properties''. UNICORE uses log4j logging subsystem. When you change the login parameters is not required to restart the component.<br>After the successful initialization of all services you can begin to test them in practice. Please connect to the site via any UNICORE client (URC or UCC). Since the registration of newly created VSite was initially turned off in the global registry, you should use the local registry.<br>The local registry address is: ''https://GATEWAY-ADDESS/VSITE_NAME/services/Registry?res=default_registry.''<br>Is recommended for test script execution, which displays the user. This should be a user associated with the certificate.<br>
 
=== Testing using the URC<br>  ===
 
#Testing should start from setting up the user's credentials,
#A local registry should be added in URC Grid Browser view..
#The registry contents should be listed, by double clicking on its node. It is worth to enable the display of all sites by clicking on the Grid Browser the "Show" button and selecting from the list "All services". If you see a red cross on the service, click on it and see the details of the error message in the URC and the error on the server side.
#If all services are available, you can send the job. At the same time it is recommended to monitor the logs Unicore/X and TSI for errors.<br>
 
=== Testing using the UCC<br>  ===
 
#Configure UCC credentials
#Configure the registry in UCC preferences file (the registry property).
#Invoke:
<pre> ./ucc shell
  ucc&gt; connect
  You can access 1 target system(s).
  ucc&gt; list-sites
  VSITE_NAME https://GATEWAY_ADDRESS/VSITE_NAME/services/TargetSystemService?res=XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
  ucc&gt; list-storages
  SHARE https://GATEWAY_ADDRESS/VSITE_NAME/services/services/StorageManagement?res=default_storage
  ucc&gt; ls u6://SHARE
  /a5063ea0-ecbe-4097-9abc-f55ec9437376
  /3f501d37-5851-4c9e-a1da-5ad7b9f16633
  /3bf169c2-2149-4564-b827-0b6560a3dd35
  ...
  ucc&gt; list-applications
  Applications on target system &lt;VSITE_NAME&gt;
  R 2.10.0
  BLAST 2.2.22
  POVRay 3.6.1
  ...
</pre>
We should get a message similar to the above.<br>Then test the file transfer:<br>
<pre>  ucc&gt; put-file -s LOCAL_FILE_PATH -t https://GATEWAY_ADDRESS/VSITE_NAME/services/StorageManagement?res=default_storage#TARGET_FILE_NAME
</pre>
and job submition:<br>
<pre>  ucc&gt; run -s VSITE_NAME JOB_FILE_PATH.u
  SUCCESSFUL exit code: 0
</pre>
If an error occurs, you can on each of these commands add the "-v" flag, what increases&nbsp; UCC verbosity. As in the URC case it is advised to simultaneously monitor Unicore / Xa and TSI log files.<br>
 
=== After testing  ===
 
<br>If testing was successful, you can unlock the registration system in the global registry.<br>
 
<br>
 
Back to Site Certification GIIS Check [https://wiki.egi.eu/wiki/Operations/HOWTO03 HOWTO03]
 
Back to Resource Centre registration and certification procedure [https://wiki.egi.eu/wiki/PROC09#Resource_Centre_certification PROC09]  
 
= Revision history  =
 
{| cellspacing="0" cellpadding="5" border="1" align="center"
|-
! Version
! Authors
! Date
! Comments
|-
| 1.0
| Alessandro Paolini
| 2010-12-15
| first draft
|-
| 1.1
| Alessandro Paolini
| 2010-12-16
| added links to certification job pages
|-
| 1.2
| Alessandro Paolini
| 2011-06-08
| added some other lcg-utils test
|-
| 1.3
| Malgorzata Krakowian
| 2012-10-15
| added Globus and Unicore check instructions
|}
 
<br>

Latest revision as of 13:51, 25 August 2021