Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "QosCosGrid Platform"

From EGIWiki
Jump to navigation Jump to search
 
(45 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{Tech menubar}} {{EGIPlatforms submenu}} {{TOC_right}}
{{Tech menubar}}
[[Category:Technology]]


{{PlatformIntegrator|name=PSNC}} {{TechnologyProvider|name=PSNC}} {{MoreInfo|location=http://www.qoscosgrid.org/}}  
{{EGIPlatforms submenu}}  


The QosCosGrid (QCG) platform provides resource reservation, co-allocation and management capabilities providing users with HPC like performance and scalability. Connecting many local computing resources, QosCosGrid provides advanced monitoring and job execution capabilities for distributed and parallel C, C++, Fortran and Java applications.
{{Template:Deprecated}}
{{TOC_right}}


= The QosCosGrid platform =


== Overview ==
{{PlatformIntegrator|name=PSNC}}
{{TechnologyProvider|name=PSNC}}
{{MoreInfo|location=http://www.qoscosgrid.org/}}


[[Image:QCGPlatformDiagram.png|thumb|right|250px|QCG Platform diagram]]  
[http://www.egi.eu/community/collaborations/PSNC.html Public overview]


The QCG platform comprises of four key middleware services, and several community facing services as the user's main entry points to the QCG platform. General integration with the EGI Core Infrastructure platform is provided for Security, Monitoring, Accounting and Service endpoint discovery.
The QosCosGrid (QCG) platform provides advance resource reservation, co-allocation and management capabilities providing users with HPC like performance and scalability. Connecting many local computing resources, QosCosGrid provides advanced monitoring and job execution capabilities for distributed and parallel C, C++, Fortran and Java applications.  


'''The QCG-Computing''' system forms the low-level block of the QCG platform providing the main job execution capabilities. It is typically deployed fronting one compute clusters that are managed through LRMSs. QCG supports PBS, PBSPro, SLURM, LL, LSF, (S)GE, Torque out of the box. QCG-Computing is directly integrated with the EGI Core Infrastructure Platform for Security purposes; QCG-Computing accepts user authentication from the EGI's X.509v3 PKI system, and authorises user access to computing resources based on generated grid-mapfiles.
= Platform overview =


The '''QCG-Notification''' system provides asynchronous notification of job progress to any subscribed notification consumer. The notification system is particularly useful (and required) for deployments suitable for cross-cluster compute jobs. The system supports direct end-user notification, provided she is properly subscribed as a notification consumer (out of scope for this documentation); however the typical use case of the QCG-Notification system is to support workflow engines and cross-cluster coordinating services (see below) in tracking the progress of individual tasks. The main ''Notification Producer'' is the QCG-Computing system, exposing a diverse set of notification topics on Compute Jobs while the main "Notification Consumer" is the QCG-Broker service. While the services exchange usually  SOAP messages the QCG-Notification services support other delivery protocols, namely: MAIL and Jabber (XMPP). Moreover the QCG-Notification services is involved in the scenarios where user receive notification about job output change (e.g. new minimal energy computed).
[[Image:QCGPlatform.png|thumb|right|250px|The QCG Platform]] 


The '''QCG-Broker''' is responsible for finding and consigning compute jobs to resources that are exposed through QCG-Computing systems as per requirements of users or higher-level tools. It does so my monitoring the state of connected QCG-Computing services and then directly submitting the job to the QCG-Computing system that matches best the requirements. Moreover, the QCG-Broker service is capable of co-allocating resources of multiple sites, thus  enabling the  cross-cluster parallel jobs as well as multi-scale cross-cluster compute jobs.
The QCG platform comprises of three key middleware services, two infrastructure integration systems, and several community facing services as the user's main entry points to the QCG platform. General integration with the EGI Core Infrastructure platform is provided for Security, Monitoring, Accounting and Service endpoint discovery.


The QCG platform also includes several Research Community services and tools, represented by its most prominent member the '''QCG-ScienceGateways''' system. These services typically provide portal services to the consuming end user communities, but also includes a mobile client (QCG-Mobile). Another application gaining popularity is QCG-Icon - a lightweight desktop application for Windows, MAC OSX and Linux platforms, aiming to provide transparent access to applications installed on remote clusters.
==== QCG-Computing ====
The QCG-Computing system forms the low-level block of the QCG platform providing the main job execution capabilities. It is typically deployed fronting a compute cluster that is managed through an LRMS. QCG supports PBS, PBSPro, SLURM, LL, LSF, (S)GE, Torque out of the box. QCG-Computing is directly integrated with the EGI Core Infrastructure Platform for Security purposes; QCG-Computing accepts user authentication from the EGI's X.509v3 PKI system, and authorises user access to computing resources based on generated grid-mapfiles. QCG-Computing supports parallel compute jobs and multi-scale jobs out of the box, provided that a suitable parallel and/or multi-scale toolkit is installed in the cluster
; Capabilities
: Job Execution, Advance Reservation <br> <u>Optional</u>: Parallel Job (ProActive, OpenMPI), Multi-scale Job (MUSCLE)


The '''QCG-Accounting''' system is not a user facing service. It queries the QCG-Computing system for accounting information and feeds them as EMI-CAR into the EGI Core Infrastructure accounting system.
==== QCG-Notification ====
The QCG-Notification system provides asynchronous notification of job progress to any subscribed notification consumer. The system supports direct end-user notification, provided she is properly subscribed as a notification consumer (out of scope for this documentation); however the typical use case of the QCG-Notification system is to support workflow engines and cross-cluster coordinating services (see below) in tracking the progress of individual tasks. The default underlying message transport protocol is WS messages over SOAP; QCG-Notification also supports E-Mail (SMTP) and Jabber (XMPP) notification delivery. Though the main notification producer is the QCG-Compute service, QCG-Notification accepts any notification producer that implements the WS-Notification family of standards. The QCG-Notification system is a mandatory component of the QosCosGrid cross-platform computing capability.
; Capabilities
: Notification <br><u>Optional</u>: Cross-Cluster computing (partial)


The QCG Platform integrates with the NAGIOS monitoring system by providing '''Nagios monitoring plugins''' (not shown). Although allowing individual independent NAGIOS instances, EGI deployments of the QCG platform will integrate with the EGI SAM system by including the QCG Nagios plugins in EGI SAM for NGI-wide deployment.
==== QCG-Broker ====
Being the main notification consumer of the QCG-Notification system, the QCG-Broker is responsible for finding and consigning compute jobs to resources that are exposed through QCG-Computing systems as per requirements of users or higher-level tools. It does so my monitoring the state of connected QCG-Computing services and then directly submitting the job to the QCG-Computing system that matches best the requirements. Moreover, the QCG-Broker service is capable of co-allocating resources of multiple sites (through advance resource reservation provided by QCG-Computing) enabling cross-cluster computing. If combined with cluster programming toolkits for parallel jobs and multi-scale jobs, QCG-Broker provides cross-cluster parallel computing and cross-cluster multi-scale computing.
; Capabilities
: Schedule Job, Co-allocation<br><u>Optional</u>: Cross-cluster computing (partial)
 
==== QCG Accounting ====
The QCG-Accounting system is not a user facing system. It queries the QCG-Computing system for accounting information and feeds this information to target accounting systems using plugins. Currently, plugins exist for PL-Grid accounting system (BAT), GridSafe, and APEL SSM v0.2 (soon to be replaced by EMI-CAR).
; Capabilities
: Accounting
 
==== QCG Monitoring ====
The QCG Platform integrates with the NAGIOS monitoring system by providing Nagios monitoring plugins (not shown). Although allowing individual independent NAGIOS instances, EGI deployments of the QCG platform will integrate with the EGI SAM system by including the QCG Nagios plugins in EGI SAM for NGI-wide deployment.
; Capabilities
: Monitoring
 
==== QCG-ScienceGateways (et al.) ====
The QCG platform also includes several Research Community services and tools, represented by its most prominent member the QCG-ScienceGateways system. These services typically provide portal services to the consuming end user communities, but also includes a mobile client (QCG-Mobile). Another application gaining popularity is  QCG-Icon - a lightweight desktop application for Windows, MAC OSX and Linux platforms, aiming to provide transparent access to applications installed on remote clusters.
 
= Integration with the EGI Core Infrastructure platform =
 
[[Image:QCG-EGI-Integration.png|thumb|right|300px|Integrating QCG with the EGI Core Infrastructure]]
 
The QCG platform seamlessly integrates with the EGI Core Infrastructure platform as follows:
 
; Authentication & Authorisation
: QCG-Computing authorises users by using gridmap files, a technique that is commonly used by Grid communities. User DNs are mapped to local cluster accounts, after the presented user certificate chain is cryptographically validated against the EGI Trust Anchor collection (not shown here).
 
; Accounting
:QCG Computing integrates with EGI Accounting by using a plugin for APEL/SSM. The plugin takes the accounting records generated by the QCG Accounting agent, formats them in the APEL STOMP format, and drops them as a file in the APEL/SSM service's "outgoing" directory.
 
; Monitoring
:The QCG platform includes Nagios plugins for all three key QCG platform services; these probes are bundled and invoked with the EGI SAM service. Additional Nagios installations are not needed.
 
; Inforrmation Discovery
:QCG Grid services must be registered in EGI's [https://goc.egi.eu GOC DB]. Three service types are available for registration: ''QCG.Computing'', ''QCG.Notification'', and ''QCG.Broker''. The EGI SAM framework will query the GOC DB for QCG services and invoke the Nagios plugins accordingly.<br>QCG currently is not publishing dynamic information into the BDII system. EGI and PSNC are currently exploring the feasibility of this integration.
 
= Technical Architecture =
 
This section drills in more detail into the architecture of the QosCosGrid platform. The previous section aims to provide an overview of the key subsystems and offered capabilities of the QCG platform, this section describes the fundamental architecture of this platform, how it integrates with the EGI Core Infrastructure, as well as deployment scenarios capturing what needs to be deployed in order to offer certain capabilities of this platform.


== QCG-Computing ==
== QCG-Computing ==


[[Image:QCGComputing.png|thumb|right|350px|Principal QCG Computing architecture]]  
[[Image:QCGComputing.png|thumb|right|350px|The components of the QCG-Computing subsystem]]  


The functionality of the QCG Computing system is provided by six components; their deployment depends on the needed functionality (see below).  
The QCG-Computing system provides the main computing capabilities available with the QosCosGrid platform. The ''Computing'' component implements most of the compute functionality; it is supported by several internal components. ''Gridmapfile'' provides user authorization based on commonly used grid-map files and is directly integrated with EGI's X.509-based user authentication infrastructure. The ''core:Core'' component is shared with the QCG-Notification system and provides shared packages and libraries. ''core:DEP'' and ''core:curl'' are compatibility packages that were bundled by PSNC to provide more recent versions or missing libraries compared to Scientific Linux 5 baseline.


The middleware components are typically deployed together into a QCG Computing service. '''QCG Compute''', '''Gridmapfile''' and '''LRMS/DRMAA''' provide the essential Compute Job and Parallel Job capabilities - QCG Computing is capable of processing parallel jobs out of the box through OpenMPI. ''QCG Compute'' provides the core job processing and management functionality. ''Gridmapfile'' is used to interface to the EGI Core Infrastructure platform through the means of X.509v3 PKI and gridmap files. the LRMS/DRMAA component provides the the integration with Local Resource Management Systems (LRMS) deployed by the Resource Provider. QCG Compute supports the following LRMSs:
QCG-Computing integrates with a broad set of Local Resource Management Systems (LRMS) through its ''LRMS'' component abstracting away LRMS-specifics using a publicly standardised interface. Implementations of this interface exist for PBS, PBSPro, SLURM, LL, LSF, (S)GE and Torque.
# PBS
# PBSPro
# SLURM
# LL
# LSF
# (S)GE
#TORQUE


=== QCG Computing extensions ===
QCG-Computing supports both simple compute jobs and parallel jobs out of the box - provided that a suitable parallel programming toolkit (ProActive/OpenMPI) is installed in the cluster. QCG-Computing also supports multi-scale jobs through the MUSCLE library for multi-scale jobs that do not have heterogeneous requirements.


[[Image:QCGComputingExtension.png|thumb|right|300px|Advanced QCG Computing capabilities]]
==== Interfaces & Standards ====


A speciality of the QCG Computing platform is its capability of coordinating parallel jobs that span across multiple computing clusters, and even across multiple Resource Providers. Multiscale computing refers to deploying a complex parallel compute job across several compute resources (including HPC, HTC and clusters) that are correlated and parallel. Particularly the [http://mapper-project.eu MAPPER project] makes heavy use of this capability.  
; WS-Notification 1.3
: QCG-Computing uses the [https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=wsn WS-Notification 1.3] family of standards to implement the role of a ''Notification Producer'' by using the ''ws-n:RegisterPublisherPublisher'' interface to register itself, and then the ''ws-n:NotificationBroker'' interface to sent notification events to subscribed consumers.


QCG provides these by using a common coordinator component, the '''QCG Coordinator''' that must be deployed in "Grid space", i.e. outside Resource Provider firewalls, either as a truly public service, or located in the DMZ of a Resource Provider. This Coordinator then communicates directly with the corresponding libraries deployed on the cluster worker nodes. These in turn communicate with the LRMS deployed on the CLuster head node.  
; OGF OGSA Basic Execution Service 1.0
: QCG-Computing uses the OGF OGSA BES 1.0 specification to expose its computing services to the Grid.  


Thus, the QCG extension consists of the ''QCG Coordinator'' plus the appropriate cluster worker node libraries as described below.
; OGF JSDL 1.0 & JSDL HPC 1.0 extension
: As mandated by the OGF BES 1.0 specification QCG-Computing accepts compute job descriptions in the JSDL 1.0. As mandated by the OGF OGSA-HPC Basic Profile QCG-Computing also accepts the JSDL HPC 1.0 extension for Job Descriptions.


==== Cross-cluster parallel Jobs ====
; OGF OGSA-HPC Basic Profile 1.0
: QCG-Compute implements the [http://ogf.org/documents/GFD.114.pdf OGF OGSA-HPC BP 1.0] which defines a profile across the following specification by incorporation: WS-I Basic Profile 1.1, OGF OGSA-BES 1.0, JSDL 1.0.


QCG is able to span parallel jobs across multiple QCG-Computing managed compute clusters, even if these are operated by different Resource Providers. This capability is available for applications written in '''C''', '''C++''', '''Fortran''' and '''Java'''.  
; OGF DRMAA 1.0
:SCG Compute uses the [http://ogf.org/documents/GFD.133.pdf OGF DRMAA 1.0] to integrate with various different LRMSs. QCG Compute acts as a client to the DRMAA-compliant implementations and is implemented against the DRMAA service interface. A number of DRMAA 1.0 integrations for LRMSs are available, either by PSNC directly (e.g. for IBM LoadLeveler, LL), or bundled and sourced from elsewhere. The corresponding DRMAA plugin needs to be installed and configured in the QCG Computing configuration file. An overview of available DRMAA implementations for LRMSs is available at the [http://www.drmaa.org/implementations.php DRMAA implementations] part of the DRMAA WG web site.


;'''Parallel Java'''
==== Deployment scenarios ====
:Parallel Jave is provided by deploying the ''ProActive'' library on the cluster worker nodes, complementing a QCG Coordinator deployment.
;'''Parallel C'''<br/>
:Parallel C allows the execution of parallel applications written in ''C'', ''C++'' or ''Fortran''. This is done by deploying a ''patched'' OpenMPI library on the compute cluster. This patched version is fully compatible to the standard OpenMPI library, but adds the cross-cluster feature for QCG. More information is available [http://www.qoscosgrid.org/trac/qcg-openmpi here].


==== Multiscale compute jobs ====
This section provides an overview of the deployment requirements and dependencies in order to provide a specific capability.


Similar to cross-cluster parallel jobs, QCG allows to consign and coordinate multiscale compute jobs. Deploying this feature is done in a similar way, in that the QCG-Coordinator is either deployed, or an existing instance is reused. The only thing that is left to deploy is the '''MUSCLE''' library in the same way as the QCG-OMPI and ProActive libraries on the cluster worker nodes. This capability supports multi-scale workloads as defined by the [http://www.mapper-project.eu/ Mapper project] and the [http://www.complex-automata.org/ COAST project].
{| cellspacing="0" cellpadding="5" style="border:1px solid black;"
|- style="background-color:lightgray;"
! style="border-bottom:1px solid black; text-align:left;" | QCG-Computing
! style="border-bottom:1px solid black; text-align:left;" | ProActive
! style="border-bottom:1px solid black; text-align:left;" | OpenMPI
! style="border-bottom:1px solid black; text-align:left;" | MUSCLE
! style="border-bottom:1px solid black; text-align:left;" | Capability
|-
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" | Fronting a compute cluster
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" |
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" |
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" |
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" | Job Execution, Advance Reservation
|-
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" | Fronting a compute cluster
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" | On the cluster worker nodes
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" |
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" |
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" | Job Execution, Advance Reservation, Parallel Job (Java)
|-
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" | Fronting a compute cluster
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" |
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" | On the cluster worker nodes
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" |
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" | Job Execution, Advance Reservation, Parallel Job (C, C++, Fortran)
|-
| text-align:left; vertical-align:top;" | Fronting a compute cluster
| text-align:left; vertical-align:top;" |
| text-align:left; vertical-align:top;" |
| text-align:left; vertical-align:top;" | On the cluster worker nodes
| text-align:left; vertical-align:top;" | Job Execution, Advance Reservation, Multi-scale job
|}


=== Interfaces & Standards ===
== QCG-Notification ==


[[Image:QCGComputingInterfaces.png|thumb|right|300px|QCG Computing: Standards based interfaces]]
[[Image:QCGNotification.png|thumb|right|350px|An overview of the QCG-Notification system]]  


QCG Computing employs a number of standards at the interfaces to its external and other modular components. These standards were developed by two well-known standardisation bodies, '''OASIS''' and '''OGF''' (the WS-I organisation is now integrated into OASIS).
The QCG-Notification system is a generic implementation of the WS-Notification family of standards (Base Notification, Brokered Notification, Topics). QCG-Notification supports all mandatory and optional elements of WS-Notification specifications, particularly topics, subscriptions, and pull points. Hence it can be integrated with many other WS-Notification compliant systems, even though QCG-Notification extends the WS-Notification standards with some management and discovery operations. However, the main use case for QCG-Notification is to provide asynchronous status notification for cross-cluster job executions managed by the QCG-Broker system.


==== OGF DRMAA integration with LRMS ====
The bulk of the functionality is provided by the ''Notification'' component, offloading logfile management to the ''logrotate'' component. QCG-Notification shares common functionality with QCG-Computing through the ''core:Core'' and ''core:DEP'' components. The ''Notification consumer'' component provides the bulk communication with notification consumers, providing the various different transport implementations (E-Mail, XMPP, etc.).


SCG Compute uses the [http://ogf.org/documents/GFD.133.pdf OGF DRMAA 1.0] to integrate with various different LRMSs. QCG Compute acts as a client to the DRMAA-compliant implementations and is implemented against the DRMAA service interface.
The QCG-Notification [http://www.qoscosgrid.org/trac/qcg-notification/wiki documentation] provides extensive information about supported use cases, roles, deployments and installation & configuration.


A number of DRMAA 1.0 integrations for LRMSs are available, either by PSNC directly (e.g. for IBM LoadLeveler, LL), or bundled and sourced from elsewhere. The corresponding DRMAA plugin needs to be installed and configured in QCG Computing configuration file.
==== Interfaces & Standards ====


An overview of available DRMAA implementations for LRMSs is available at the [http://www.drmaa.org/implementations.php DRMAA implementations] part of the DRMAA WG web site.
; WS-Notification 1.3
: QCG-Notification is a generic implementation of the [https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=wsn WS-Notification 1.3] family of standards. The individual standardised interfaces are not provided here; QCG-Notification implements all mandatory and optional parts of the specifications. The exception are three key interfaces: ''ws-n:RegisterProducer'' is used to allow any type of notification producer to register itself with QCG-Notification, which in turn uses the ''ws-n:NotificationBroker'' interface to dispatch notifications to ''ws-n:NotificationConsumer'' instances.


==== OGF HPC BP integration for Compute job submissions ====
; WS-Resource Framework 1.2
: Since WS-Notification makes use of the [https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=wsrf WS-Resource Framework 1.2], QCG-Notification implements WS-Resource Framework 1.2, too.


QCG Compute implements the [http://ogf.org/documents/GFD.114.pdf OGF HPC BP] (High Performance Computing Basic Profile), which includes by reference three standard specifications and one extension: '''WS-I Basic Profile 1.1''', '''JSDL 1.0''', '''JSDL HPC 1.0''', and '''OGSA BES'''. Together, these standards ensure common Web Service interoperability (WS-I Basic Profile, by OASIS) and a common (HPC) Compute service interface (OGSA-BES, by OGF) that accepts job descriptions expressed in JSDL and its HPC extension (both OGF).
==== Deployment scenarios ====


The OGSA-BES standard allows the integration of notification capabilities, and standardises the use of either WS-Notification or WS-Eventing as a standardised notification interface. QCG Computing integrates with the QCG-Notification system by assuming the WS-Notification::Publisher role (see below).
Due to its nature, there are very limited deployment scenarios for QCG-Notification. It can be used stand-alone, providing a straight-forward notification framework, or it can be deployed together with QCG-Computing, providing notification on Job Execution related topics.  


== QCG Notification ==
However, QCG-Notification is '''required''' to provide cross-cluster capabilities (see below).


The QCG Notification system is a reference implementation of the QS-Notification family of standards (Base Notification, Brokered Notification, Topics).
{| cellspacing="0" cellpadding="5" style="border:1px solid black;"
|- style="background-color:lightgray;"
! style="border-bottom:1px solid black; text-align:left;" | QCG-Notification
! style="border-bottom:1px solid black; text-align:left;" | QCG-Compute
! style="border-bottom:1px solid black; text-align:left;" | Capability
|-
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" | Directly on resource (or within VM)
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" |
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" | Notification
|-
| text-align:left; vertical-align:top;" | Directly on resource (or within VM)
| text-align:left; vertical-align:top;" | Any [[QosCosGrid_Platform#Deployment_scenarios|QCG-Compute deployment scenario]] above
| text-align:left; vertical-align:top;" | Notification on QCG-Compute events
|}


It can be integrated with many other WS-Notification compliant systems, even though QCG Notification extends the WS-Notification standards with some management and discovery operations. QCG Notification supports all mandatory and optional elements of WS-Notification specifications, particularly topics, subscriptions, and pull points.
== QCG-Broker ==


The QCG Notification [http://www.qoscosgrid.org/trac/qcg-notification/wiki documentation] provides extensive information about supported use cases, roles, deployments and installation & configuration.
[[Image:QCGBroker.png|thumb|right|350px|A QCG-Broker system overview]]  


=== Standards & interfaces ===
The QCG-Broker is a system that links and coordinates clusters on a programmatic level, which otherwise would be completely unrelated. In a simple deployment, it can schedule jobs to individual QCG-Computing-managed compute clusters, connecting to several QCG-Computing-managed clusters, the broker provides co-allocation of local resources.


QCG Notification supports all standardised interfaces that are defined by the WS-Notification family of specifications. Detailed information is available in the specification [https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=wsn documents] at OASIS. In short, the following WS-Notification interfaces and roles are implemented:
Most of the features are provided by the ''Broker'' component, implementing the necessary interfaces in order to interoperate with ''QCG-Computing'' and ''QCG-Notification''.
# BaseNotification
## NotificationConsumer
## <strike>NotificationProducer</strike> (sources of notifications are required to assume the role of a publisher in the brokered notification model)
## PullPoint
## CreatePullPoint
## SubscriptionManager
## PausableSubscriptionManager
# BrokeredNotification
## NotificationBroker
## RegisterPublisher
## PublisherRegistrationManager
# Topics
## TopicNamespace
## TopicType
##TopicSet
##TopicExpression


The WS-Notification specifications make use of the WS-Resource family of specifications. Therefore QCG Notification implements the WS-Resource set of standards and interfaces.
The optional component ''Coordinator'' -- when deployed -- is capable of linking compute jobs across clusters, provided that the complex job pattern was submitted through the QCG-Broker (ensuring co-allocated advance reservation or resources). In this context it is important to stress the fact that OpenMPI is <u>not</u> suitable for this type of cross-cluster parallel jobs. PSNC provides an binary compatible version of OpenMPI -- QCG-OMPI --, which provides the necessary extensions for cross-cluster parallel jobs written in C, C++ or Fortran.


== QCG Broker ==
==== Interfaces & standards ====


The QCG Broker is a system providing advanced reservation and cross-cluster job submission capabilities. It interacts with QCG Computing systems via the OGSA HPC BP interface.
; WS-Notification 1.3
: QCG-Broker uses a subset of the  [https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=wsn WS-Notification 1.3] family of standards, as a notification consumer.


== QCG Accounting ==
; WS-Resource Framework 1.2
: Through WS-Notification, QCG-Broker also implements the WS-Resource Framework.


The QCG Accounting system is an independent service; it is usually deployed in close proximity of the QCG Computing system, since it queries the computing system's job database and parses the LRMS log files for accounting information.
; OGF OGSA HPC Basic Profile 1.0
: As a consumer of QCG-Computing services, the QCG-Broker also implements the OGF OGSA HPC Basic Profile 1.0, including its included standards (WS-I Basic Profile 1.1, OGF OGSA BES 1.0, OGF JSDL 1.0, OGF JSDL HPC 1.0 extension).


A number of plugins for accounting infrastructures exist; these are capable of translating internal accounting information into the required output format, as well as contacting the corresponding accounting endpoint.
==== Deployment scenarios ====
 
= Integration with the EGI Core Infrastructure platform =


[[Image:QCGEGICoreInfraIntegration.png|thumb|right|300px|QCG integration with the Core Infrastructure platform]]
QCG-Broker can be freely deployed and combined with the other QosCosGrid platform components (some restrictions apply, see below); however the most common deployment is to include the Coordinator component enabling cross-cluster computing. That is, when combining it with parallel job toolkits, the QCG platform will provide cross-cluster parallel job capabilities:


The QCG platform integrates seamlessly with the EWGI Core Infrastructure platform.
{| cellspacing="0" cellpadding="5" style="border:1px solid black;"
|- style="background-color:lightgray;"
! style="border-bottom:1px solid black; text-align:left;" | QCG-Broker
! style="border-bottom:1px solid black; text-align:left;" | QCG-Broker Coordinator
! style="border-bottom:1px solid black; text-align:left;" | Capability
|-
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" | Directly on a resource In Grid-space on a free resource (or in a VM)
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" |
| style="border-bottom:1px solid black; text-align:left; vertical-align:top;" | Schedule Job, Co-allocation
|-
| style="text-align:left; vertical-align:top;" | Directly on a resource In Grid-space on a free resource (or in a VM)
| style="text-align:left; vertical-align:top;" | Together with the QCG-Broker
| style="text-align:left; vertical-align:top;" | Schedule Job, Co-allocation, cross-cluster job
|}


== Accounting ==
'''Important note:''' To enable cross-cluster parallel computing for C, C++ or Fortran, one <u>must</u> deploy ''QCG-OMPI'' (QCG OpenMPI) instead of a vanilla OpenMPI library.


The QCG Accounting service features an APEL SSM plugin, so that it can store accounting records in UR format for the APEL SSM to transfer to the EGI APEL database. This requires that the APEL SSM upload directory is accessible by the QCG Accounting service either directly (i.e. deployed on the same server) or by means of mounting remote directories via e.g. pNFS.
== QCG accounting ==


== Monitoring ==
[[Image:QCGAccounting.png|thumb|right|200px|The QCG Accounting agent]]


QCG does not provide its own monitoring system. Instead, NAGIOS plugins are provided for the three customer facing QCG services Computing, Notification and Broker. THe deployment and configuration of thse plugins is intentionally left to the system administrator.
The QCG accounting agent is an infrastructure integration service, and not available for direct end-user consumption. It is usually deployed in close proximity to the QCG-Computing system since it queries the computing system's job database and parses the LRMS log files for accounting information.


For EGI, this requires the SAM technology provider to regularly pull the Nagios plugins and integrate these into the SAM system that is deployed by each Resource Provider, who then configures the plugins according to local QCG deployments.
A number of plugins for accounting infrastructures exist; these are capable of translating internal accounting information into the required output format, as well as contacting the corresponding accounting endpoint. For EGI, a plugin integrating QCG accounting with APEL is included, dropping accounting records stored in an output file in the "outgoing" directory of the APEL/SSM service, which must be installed alongside QCG Accounting on the same resource.


== Information Discovery ==
== QCG monitoring ==


QCG system deployments must be registered in EGI's [https://goc.egi.eu GOC DB]]. For this, the following three service types are available in GOC DB:
[[Image:QCGMonitoring.png|thumb|right|200px|The QCG Accounting agent]]  


* QCG.Computing
The QCG platform also provides three Nagios plugins that enable any Nagios deployment to monitor a QCG platform deployment. In the case of EGI, Grid service monitoring is provided by the EGI SAM framework, which includes a Nagios instance and all necessary monitoring plugins bundled with it. Thus the QCG Nagios plugins will be shipped with the EGI SAM system, not with QCG platform components.
* QCG.Notification
* QCG.Broker

Latest revision as of 18:56, 9 December 2014

Technology Software Component Delivery Software Provisioning UMD Middleware Cloud Middleware Distribution Containers Distribution Technology Glossary

Template:EGIPlatforms submenu

Alert.png This article is Deprecated and should no longer be used, but is still available for reasons of reference.




Platform Integrator: PSNC

Technology Provider: PSNC

More information: http://www.qoscosgrid.org/


Public overview

The QosCosGrid (QCG) platform provides advance resource reservation, co-allocation and management capabilities providing users with HPC like performance and scalability. Connecting many local computing resources, QosCosGrid provides advanced monitoring and job execution capabilities for distributed and parallel C, C++, Fortran and Java applications.

Platform overview

The QCG Platform

 

The QCG platform comprises of three key middleware services, two infrastructure integration systems, and several community facing services as the user's main entry points to the QCG platform. General integration with the EGI Core Infrastructure platform is provided for Security, Monitoring, Accounting and Service endpoint discovery.

QCG-Computing

The QCG-Computing system forms the low-level block of the QCG platform providing the main job execution capabilities. It is typically deployed fronting a compute cluster that is managed through an LRMS. QCG supports PBS, PBSPro, SLURM, LL, LSF, (S)GE, Torque out of the box. QCG-Computing is directly integrated with the EGI Core Infrastructure Platform for Security purposes; QCG-Computing accepts user authentication from the EGI's X.509v3 PKI system, and authorises user access to computing resources based on generated grid-mapfiles. QCG-Computing supports parallel compute jobs and multi-scale jobs out of the box, provided that a suitable parallel and/or multi-scale toolkit is installed in the cluster

Capabilities
Job Execution, Advance Reservation
Optional: Parallel Job (ProActive, OpenMPI), Multi-scale Job (MUSCLE)

QCG-Notification

The QCG-Notification system provides asynchronous notification of job progress to any subscribed notification consumer. The system supports direct end-user notification, provided she is properly subscribed as a notification consumer (out of scope for this documentation); however the typical use case of the QCG-Notification system is to support workflow engines and cross-cluster coordinating services (see below) in tracking the progress of individual tasks. The default underlying message transport protocol is WS messages over SOAP; QCG-Notification also supports E-Mail (SMTP) and Jabber (XMPP) notification delivery. Though the main notification producer is the QCG-Compute service, QCG-Notification accepts any notification producer that implements the WS-Notification family of standards. The QCG-Notification system is a mandatory component of the QosCosGrid cross-platform computing capability.

Capabilities
Notification
Optional: Cross-Cluster computing (partial)

QCG-Broker

Being the main notification consumer of the QCG-Notification system, the QCG-Broker is responsible for finding and consigning compute jobs to resources that are exposed through QCG-Computing systems as per requirements of users or higher-level tools. It does so my monitoring the state of connected QCG-Computing services and then directly submitting the job to the QCG-Computing system that matches best the requirements. Moreover, the QCG-Broker service is capable of co-allocating resources of multiple sites (through advance resource reservation provided by QCG-Computing) enabling cross-cluster computing. If combined with cluster programming toolkits for parallel jobs and multi-scale jobs, QCG-Broker provides cross-cluster parallel computing and cross-cluster multi-scale computing.

Capabilities
Schedule Job, Co-allocation
Optional: Cross-cluster computing (partial)

QCG Accounting

The QCG-Accounting system is not a user facing system. It queries the QCG-Computing system for accounting information and feeds this information to target accounting systems using plugins. Currently, plugins exist for PL-Grid accounting system (BAT), GridSafe, and APEL SSM v0.2 (soon to be replaced by EMI-CAR).

Capabilities
Accounting

QCG Monitoring

The QCG Platform integrates with the NAGIOS monitoring system by providing Nagios monitoring plugins (not shown). Although allowing individual independent NAGIOS instances, EGI deployments of the QCG platform will integrate with the EGI SAM system by including the QCG Nagios plugins in EGI SAM for NGI-wide deployment.

Capabilities
Monitoring

QCG-ScienceGateways (et al.)

The QCG platform also includes several Research Community services and tools, represented by its most prominent member the QCG-ScienceGateways system. These services typically provide portal services to the consuming end user communities, but also includes a mobile client (QCG-Mobile). Another application gaining popularity is QCG-Icon - a lightweight desktop application for Windows, MAC OSX and Linux platforms, aiming to provide transparent access to applications installed on remote clusters.

Integration with the EGI Core Infrastructure platform

Integrating QCG with the EGI Core Infrastructure

The QCG platform seamlessly integrates with the EGI Core Infrastructure platform as follows:

Authentication & Authorisation
QCG-Computing authorises users by using gridmap files, a technique that is commonly used by Grid communities. User DNs are mapped to local cluster accounts, after the presented user certificate chain is cryptographically validated against the EGI Trust Anchor collection (not shown here).
Accounting
QCG Computing integrates with EGI Accounting by using a plugin for APEL/SSM. The plugin takes the accounting records generated by the QCG Accounting agent, formats them in the APEL STOMP format, and drops them as a file in the APEL/SSM service's "outgoing" directory.
Monitoring
The QCG platform includes Nagios plugins for all three key QCG platform services; these probes are bundled and invoked with the EGI SAM service. Additional Nagios installations are not needed.
Inforrmation Discovery
QCG Grid services must be registered in EGI's GOC DB. Three service types are available for registration: QCG.Computing, QCG.Notification, and QCG.Broker. The EGI SAM framework will query the GOC DB for QCG services and invoke the Nagios plugins accordingly.
QCG currently is not publishing dynamic information into the BDII system. EGI and PSNC are currently exploring the feasibility of this integration.

Technical Architecture

This section drills in more detail into the architecture of the QosCosGrid platform. The previous section aims to provide an overview of the key subsystems and offered capabilities of the QCG platform, this section describes the fundamental architecture of this platform, how it integrates with the EGI Core Infrastructure, as well as deployment scenarios capturing what needs to be deployed in order to offer certain capabilities of this platform.

QCG-Computing

The components of the QCG-Computing subsystem

The QCG-Computing system provides the main computing capabilities available with the QosCosGrid platform. The Computing component implements most of the compute functionality; it is supported by several internal components. Gridmapfile provides user authorization based on commonly used grid-map files and is directly integrated with EGI's X.509-based user authentication infrastructure. The core:Core component is shared with the QCG-Notification system and provides shared packages and libraries. core:DEP and core:curl are compatibility packages that were bundled by PSNC to provide more recent versions or missing libraries compared to Scientific Linux 5 baseline.

QCG-Computing integrates with a broad set of Local Resource Management Systems (LRMS) through its LRMS component abstracting away LRMS-specifics using a publicly standardised interface. Implementations of this interface exist for PBS, PBSPro, SLURM, LL, LSF, (S)GE and Torque.

QCG-Computing supports both simple compute jobs and parallel jobs out of the box - provided that a suitable parallel programming toolkit (ProActive/OpenMPI) is installed in the cluster. QCG-Computing also supports multi-scale jobs through the MUSCLE library for multi-scale jobs that do not have heterogeneous requirements.

Interfaces & Standards

WS-Notification 1.3
QCG-Computing uses the WS-Notification 1.3 family of standards to implement the role of a Notification Producer by using the ws-n:RegisterPublisherPublisher interface to register itself, and then the ws-n:NotificationBroker interface to sent notification events to subscribed consumers.
OGF OGSA Basic Execution Service 1.0
QCG-Computing uses the OGF OGSA BES 1.0 specification to expose its computing services to the Grid.
OGF JSDL 1.0 & JSDL HPC 1.0 extension
As mandated by the OGF BES 1.0 specification QCG-Computing accepts compute job descriptions in the JSDL 1.0. As mandated by the OGF OGSA-HPC Basic Profile QCG-Computing also accepts the JSDL HPC 1.0 extension for Job Descriptions.
OGF OGSA-HPC Basic Profile 1.0
QCG-Compute implements the OGF OGSA-HPC BP 1.0 which defines a profile across the following specification by incorporation: WS-I Basic Profile 1.1, OGF OGSA-BES 1.0, JSDL 1.0.
OGF DRMAA 1.0
SCG Compute uses the OGF DRMAA 1.0 to integrate with various different LRMSs. QCG Compute acts as a client to the DRMAA-compliant implementations and is implemented against the DRMAA service interface. A number of DRMAA 1.0 integrations for LRMSs are available, either by PSNC directly (e.g. for IBM LoadLeveler, LL), or bundled and sourced from elsewhere. The corresponding DRMAA plugin needs to be installed and configured in the QCG Computing configuration file. An overview of available DRMAA implementations for LRMSs is available at the DRMAA implementations part of the DRMAA WG web site.

Deployment scenarios

This section provides an overview of the deployment requirements and dependencies in order to provide a specific capability.

QCG-Computing ProActive OpenMPI MUSCLE Capability
Fronting a compute cluster Job Execution, Advance Reservation
Fronting a compute cluster On the cluster worker nodes Job Execution, Advance Reservation, Parallel Job (Java)
Fronting a compute cluster On the cluster worker nodes Job Execution, Advance Reservation, Parallel Job (C, C++, Fortran)
Fronting a compute cluster On the cluster worker nodes Job Execution, Advance Reservation, Multi-scale job

QCG-Notification

An overview of the QCG-Notification system

The QCG-Notification system is a generic implementation of the WS-Notification family of standards (Base Notification, Brokered Notification, Topics). QCG-Notification supports all mandatory and optional elements of WS-Notification specifications, particularly topics, subscriptions, and pull points. Hence it can be integrated with many other WS-Notification compliant systems, even though QCG-Notification extends the WS-Notification standards with some management and discovery operations. However, the main use case for QCG-Notification is to provide asynchronous status notification for cross-cluster job executions managed by the QCG-Broker system.

The bulk of the functionality is provided by the Notification component, offloading logfile management to the logrotate component. QCG-Notification shares common functionality with QCG-Computing through the core:Core and core:DEP components. The Notification consumer component provides the bulk communication with notification consumers, providing the various different transport implementations (E-Mail, XMPP, etc.).

The QCG-Notification documentation provides extensive information about supported use cases, roles, deployments and installation & configuration.

Interfaces & Standards

WS-Notification 1.3
QCG-Notification is a generic implementation of the WS-Notification 1.3 family of standards. The individual standardised interfaces are not provided here; QCG-Notification implements all mandatory and optional parts of the specifications. The exception are three key interfaces: ws-n:RegisterProducer is used to allow any type of notification producer to register itself with QCG-Notification, which in turn uses the ws-n:NotificationBroker interface to dispatch notifications to ws-n:NotificationConsumer instances.
WS-Resource Framework 1.2
Since WS-Notification makes use of the WS-Resource Framework 1.2, QCG-Notification implements WS-Resource Framework 1.2, too.

Deployment scenarios

Due to its nature, there are very limited deployment scenarios for QCG-Notification. It can be used stand-alone, providing a straight-forward notification framework, or it can be deployed together with QCG-Computing, providing notification on Job Execution related topics.

However, QCG-Notification is required to provide cross-cluster capabilities (see below).

QCG-Notification QCG-Compute Capability
Directly on resource (or within VM) Notification
Directly on resource (or within VM) Any QCG-Compute deployment scenario above Notification on QCG-Compute events

QCG-Broker

A QCG-Broker system overview

The QCG-Broker is a system that links and coordinates clusters on a programmatic level, which otherwise would be completely unrelated. In a simple deployment, it can schedule jobs to individual QCG-Computing-managed compute clusters, connecting to several QCG-Computing-managed clusters, the broker provides co-allocation of local resources.

Most of the features are provided by the Broker component, implementing the necessary interfaces in order to interoperate with QCG-Computing and QCG-Notification.

The optional component Coordinator -- when deployed -- is capable of linking compute jobs across clusters, provided that the complex job pattern was submitted through the QCG-Broker (ensuring co-allocated advance reservation or resources). In this context it is important to stress the fact that OpenMPI is not suitable for this type of cross-cluster parallel jobs. PSNC provides an binary compatible version of OpenMPI -- QCG-OMPI --, which provides the necessary extensions for cross-cluster parallel jobs written in C, C++ or Fortran.

Interfaces & standards

WS-Notification 1.3
QCG-Broker uses a subset of the WS-Notification 1.3 family of standards, as a notification consumer.
WS-Resource Framework 1.2
Through WS-Notification, QCG-Broker also implements the WS-Resource Framework.
OGF OGSA HPC Basic Profile 1.0
As a consumer of QCG-Computing services, the QCG-Broker also implements the OGF OGSA HPC Basic Profile 1.0, including its included standards (WS-I Basic Profile 1.1, OGF OGSA BES 1.0, OGF JSDL 1.0, OGF JSDL HPC 1.0 extension).

Deployment scenarios

QCG-Broker can be freely deployed and combined with the other QosCosGrid platform components (some restrictions apply, see below); however the most common deployment is to include the Coordinator component enabling cross-cluster computing. That is, when combining it with parallel job toolkits, the QCG platform will provide cross-cluster parallel job capabilities:

QCG-Broker QCG-Broker Coordinator Capability
Directly on a resource In Grid-space on a free resource (or in a VM) Schedule Job, Co-allocation
Directly on a resource In Grid-space on a free resource (or in a VM) Together with the QCG-Broker Schedule Job, Co-allocation, cross-cluster job

Important note: To enable cross-cluster parallel computing for C, C++ or Fortran, one must deploy QCG-OMPI (QCG OpenMPI) instead of a vanilla OpenMPI library.

QCG accounting

The QCG Accounting agent

The QCG accounting agent is an infrastructure integration service, and not available for direct end-user consumption. It is usually deployed in close proximity to the QCG-Computing system since it queries the computing system's job database and parses the LRMS log files for accounting information.

A number of plugins for accounting infrastructures exist; these are capable of translating internal accounting information into the required output format, as well as contacting the corresponding accounting endpoint. For EGI, a plugin integrating QCG accounting with APEL is included, dropping accounting records stored in an output file in the "outgoing" directory of the APEL/SSM service, which must be installed alongside QCG Accounting on the same resource.

QCG monitoring

The QCG Accounting agent

The QCG platform also provides three Nagios plugins that enable any Nagios deployment to monitor a QCG platform deployment. In the case of EGI, Grid service monitoring is provided by the EGI SAM framework, which includes a Nagios instance and all necessary monitoring plugins bundled with it. Thus the QCG Nagios plugins will be shipped with the EGI SAM system, not with QCG platform components.