QosCosGrid Platform

From EGIWiki
Jump to: navigation, search
Technology Software Component Delivery Software Provisioning UMD Middleware Cloud Middleware Distribution Containers Distribution Technology Glossary

Template:EGIPlatforms submenu

Alert.png This article is Deprecated and should no longer be used, but is still available for reasons of reference.

Platform Integrator: PSNC

Technology Provider: PSNC

More information: http://www.qoscosgrid.org/

Public overview

The QosCosGrid (QCG) platform provides advance resource reservation, co-allocation and management capabilities providing users with HPC like performance and scalability. Connecting many local computing resources, QosCosGrid provides advanced monitoring and job execution capabilities for distributed and parallel C, C++, Fortran and Java applications.

Platform overview

The QCG Platform


The QCG platform comprises of three key middleware services, two infrastructure integration systems, and several community facing services as the user's main entry points to the QCG platform. General integration with the EGI Core Infrastructure platform is provided for Security, Monitoring, Accounting and Service endpoint discovery.


The QCG-Computing system forms the low-level block of the QCG platform providing the main job execution capabilities. It is typically deployed fronting a compute cluster that is managed through an LRMS. QCG supports PBS, PBSPro, SLURM, LL, LSF, (S)GE, Torque out of the box. QCG-Computing is directly integrated with the EGI Core Infrastructure Platform for Security purposes; QCG-Computing accepts user authentication from the EGI's X.509v3 PKI system, and authorises user access to computing resources based on generated grid-mapfiles. QCG-Computing supports parallel compute jobs and multi-scale jobs out of the box, provided that a suitable parallel and/or multi-scale toolkit is installed in the cluster

Job Execution, Advance Reservation
Optional: Parallel Job (ProActive, OpenMPI), Multi-scale Job (MUSCLE)


The QCG-Notification system provides asynchronous notification of job progress to any subscribed notification consumer. The system supports direct end-user notification, provided she is properly subscribed as a notification consumer (out of scope for this documentation); however the typical use case of the QCG-Notification system is to support workflow engines and cross-cluster coordinating services (see below) in tracking the progress of individual tasks. The default underlying message transport protocol is WS messages over SOAP; QCG-Notification also supports E-Mail (SMTP) and Jabber (XMPP) notification delivery. Though the main notification producer is the QCG-Compute service, QCG-Notification accepts any notification producer that implements the WS-Notification family of standards. The QCG-Notification system is a mandatory component of the QosCosGrid cross-platform computing capability.

Optional: Cross-Cluster computing (partial)


Being the main notification consumer of the QCG-Notification system, the QCG-Broker is responsible for finding and consigning compute jobs to resources that are exposed through QCG-Computing systems as per requirements of users or higher-level tools. It does so my monitoring the state of connected QCG-Computing services and then directly submitting the job to the QCG-Computing system that matches best the requirements. Moreover, the QCG-Broker service is capable of co-allocating resources of multiple sites (through advance resource reservation provided by QCG-Computing) enabling cross-cluster computing. If combined with cluster programming toolkits for parallel jobs and multi-scale jobs, QCG-Broker provides cross-cluster parallel computing and cross-cluster multi-scale computing.

Schedule Job, Co-allocation
Optional: Cross-cluster computing (partial)

QCG Accounting

The QCG-Accounting system is not a user facing system. It queries the QCG-Computing system for accounting information and feeds this information to target accounting systems using plugins. Currently, plugins exist for PL-Grid accounting system (BAT), GridSafe, and APEL SSM v0.2 (soon to be replaced by EMI-CAR).


QCG Monitoring

The QCG Platform integrates with the NAGIOS monitoring system by providing Nagios monitoring plugins (not shown). Although allowing individual independent NAGIOS instances, EGI deployments of the QCG platform will integrate with the EGI SAM system by including the QCG Nagios plugins in EGI SAM for NGI-wide deployment.


QCG-ScienceGateways (et al.)

The QCG platform also includes several Research Community services and tools, represented by its most prominent member the QCG-ScienceGateways system. These services typically provide portal services to the consuming end user communities, but also includes a mobile client (QCG-Mobile). Another application gaining popularity is QCG-Icon - a lightweight desktop application for Windows, MAC OSX and Linux platforms, aiming to provide transparent access to applications installed on remote clusters.

Integration with the EGI Core Infrastructure platform

Integrating QCG with the EGI Core Infrastructure

The QCG platform seamlessly integrates with the EGI Core Infrastructure platform as follows:

Authentication & Authorisation
QCG-Computing authorises users by using gridmap files, a technique that is commonly used by Grid communities. User DNs are mapped to local cluster accounts, after the presented user certificate chain is cryptographically validated against the EGI Trust Anchor collection (not shown here).
QCG Computing integrates with EGI Accounting by using a plugin for APEL/SSM. The plugin takes the accounting records generated by the QCG Accounting agent, formats them in the APEL STOMP format, and drops them as a file in the APEL/SSM service's "outgoing" directory.
The QCG platform includes Nagios plugins for all three key QCG platform services; these probes are bundled and invoked with the EGI SAM service. Additional Nagios installations are not needed.
Inforrmation Discovery
QCG Grid services must be registered in EGI's GOC DB. Three service types are available for registration: QCG.Computing, QCG.Notification, and QCG.Broker. The EGI SAM framework will query the GOC DB for QCG services and invoke the Nagios plugins accordingly.
QCG currently is not publishing dynamic information into the BDII system. EGI and PSNC are currently exploring the feasibility of this integration.

Technical Architecture

This section drills in more detail into the architecture of the QosCosGrid platform. The previous section aims to provide an overview of the key subsystems and offered capabilities of the QCG platform, this section describes the fundamental architecture of this platform, how it integrates with the EGI Core Infrastructure, as well as deployment scenarios capturing what needs to be deployed in order to offer certain capabilities of this platform.


The components of the QCG-Computing subsystem

The QCG-Computing system provides the main computing capabilities available with the QosCosGrid platform. The Computing component implements most of the compute functionality; it is supported by several internal components. Gridmapfile provides user authorization based on commonly used grid-map files and is directly integrated with EGI's X.509-based user authentication infrastructure. The core:Core component is shared with the QCG-Notification system and provides shared packages and libraries. core:DEP and core:curl are compatibility packages that were bundled by PSNC to provide more recent versions or missing libraries compared to Scientific Linux 5 baseline.

QCG-Computing integrates with a broad set of Local Resource Management Systems (LRMS) through its LRMS component abstracting away LRMS-specifics using a publicly standardised interface. Implementations of this interface exist for PBS, PBSPro, SLURM, LL, LSF, (S)GE and Torque.

QCG-Computing supports both simple compute jobs and parallel jobs out of the box - provided that a suitable parallel programming toolkit (ProActive/OpenMPI) is installed in the cluster. QCG-Computing also supports multi-scale jobs through the MUSCLE library for multi-scale jobs that do not have heterogeneous requirements.

Interfaces & Standards

WS-Notification 1.3
QCG-Computing uses the WS-Notification 1.3 family of standards to implement the role of a Notification Producer by using the ws-n:RegisterPublisherPublisher interface to register itself, and then the ws-n:NotificationBroker interface to sent notification events to subscribed consumers.
OGF OGSA Basic Execution Service 1.0
QCG-Computing uses the OGF OGSA BES 1.0 specification to expose its computing services to the Grid.
OGF JSDL 1.0 & JSDL HPC 1.0 extension
As mandated by the OGF BES 1.0 specification QCG-Computing accepts compute job descriptions in the JSDL 1.0. As mandated by the OGF OGSA-HPC Basic Profile QCG-Computing also accepts the JSDL HPC 1.0 extension for Job Descriptions.
OGF OGSA-HPC Basic Profile 1.0
QCG-Compute implements the OGF OGSA-HPC BP 1.0 which defines a profile across the following specification by incorporation: WS-I Basic Profile 1.1, OGF OGSA-BES 1.0, JSDL 1.0.
SCG Compute uses the OGF DRMAA 1.0 to integrate with various different LRMSs. QCG Compute acts as a client to the DRMAA-compliant implementations and is implemented against the DRMAA service interface. A number of DRMAA 1.0 integrations for LRMSs are available, either by PSNC directly (e.g. for IBM LoadLeveler, LL), or bundled and sourced from elsewhere. The corresponding DRMAA plugin needs to be installed and configured in the QCG Computing configuration file. An overview of available DRMAA implementations for LRMSs is available at the DRMAA implementations part of the DRMAA WG web site.

Deployment scenarios

This section provides an overview of the deployment requirements and dependencies in order to provide a specific capability.

QCG-Computing ProActive OpenMPI MUSCLE Capability
Fronting a compute cluster Job Execution, Advance Reservation
Fronting a compute cluster On the cluster worker nodes Job Execution, Advance Reservation, Parallel Job (Java)
Fronting a compute cluster On the cluster worker nodes Job Execution, Advance Reservation, Parallel Job (C, C++, Fortran)
Fronting a compute cluster On the cluster worker nodes Job Execution, Advance Reservation, Multi-scale job


An overview of the QCG-Notification system

The QCG-Notification system is a generic implementation of the WS-Notification family of standards (Base Notification, Brokered Notification, Topics). QCG-Notification supports all mandatory and optional elements of WS-Notification specifications, particularly topics, subscriptions, and pull points. Hence it can be integrated with many other WS-Notification compliant systems, even though QCG-Notification extends the WS-Notification standards with some management and discovery operations. However, the main use case for QCG-Notification is to provide asynchronous status notification for cross-cluster job executions managed by the QCG-Broker system.

The bulk of the functionality is provided by the Notification component, offloading logfile management to the logrotate component. QCG-Notification shares common functionality with QCG-Computing through the core:Core and core:DEP components. The Notification consumer component provides the bulk communication with notification consumers, providing the various different transport implementations (E-Mail, XMPP, etc.).

The QCG-Notification documentation provides extensive information about supported use cases, roles, deployments and installation & configuration.

Interfaces & Standards

WS-Notification 1.3
QCG-Notification is a generic implementation of the WS-Notification 1.3 family of standards. The individual standardised interfaces are not provided here; QCG-Notification implements all mandatory and optional parts of the specifications. The exception are three key interfaces: ws-n:RegisterProducer is used to allow any type of notification producer to register itself with QCG-Notification, which in turn uses the ws-n:NotificationBroker interface to dispatch notifications to ws-n:NotificationConsumer instances.
WS-Resource Framework 1.2
Since WS-Notification makes use of the WS-Resource Framework 1.2, QCG-Notification implements WS-Resource Framework 1.2, too.

Deployment scenarios

Due to its nature, there are very limited deployment scenarios for QCG-Notification. It can be used stand-alone, providing a straight-forward notification framework, or it can be deployed together with QCG-Computing, providing notification on Job Execution related topics.

However, QCG-Notification is required to provide cross-cluster capabilities (see below).

QCG-Notification QCG-Compute Capability
Directly on resource (or within VM) Notification
Directly on resource (or within VM) Any QCG-Compute deployment scenario above Notification on QCG-Compute events


A QCG-Broker system overview

The QCG-Broker is a system that links and coordinates clusters on a programmatic level, which otherwise would be completely unrelated. In a simple deployment, it can schedule jobs to individual QCG-Computing-managed compute clusters, connecting to several QCG-Computing-managed clusters, the broker provides co-allocation of local resources.

Most of the features are provided by the Broker component, implementing the necessary interfaces in order to interoperate with QCG-Computing and QCG-Notification.

The optional component Coordinator -- when deployed -- is capable of linking compute jobs across clusters, provided that the complex job pattern was submitted through the QCG-Broker (ensuring co-allocated advance reservation or resources). In this context it is important to stress the fact that OpenMPI is not suitable for this type of cross-cluster parallel jobs. PSNC provides an binary compatible version of OpenMPI -- QCG-OMPI --, which provides the necessary extensions for cross-cluster parallel jobs written in C, C++ or Fortran.

Interfaces & standards

WS-Notification 1.3
QCG-Broker uses a subset of the WS-Notification 1.3 family of standards, as a notification consumer.
WS-Resource Framework 1.2
Through WS-Notification, QCG-Broker also implements the WS-Resource Framework.
OGF OGSA HPC Basic Profile 1.0
As a consumer of QCG-Computing services, the QCG-Broker also implements the OGF OGSA HPC Basic Profile 1.0, including its included standards (WS-I Basic Profile 1.1, OGF OGSA BES 1.0, OGF JSDL 1.0, OGF JSDL HPC 1.0 extension).

Deployment scenarios

QCG-Broker can be freely deployed and combined with the other QosCosGrid platform components (some restrictions apply, see below); however the most common deployment is to include the Coordinator component enabling cross-cluster computing. That is, when combining it with parallel job toolkits, the QCG platform will provide cross-cluster parallel job capabilities:

QCG-Broker QCG-Broker Coordinator Capability
Directly on a resource In Grid-space on a free resource (or in a VM) Schedule Job, Co-allocation
Directly on a resource In Grid-space on a free resource (or in a VM) Together with the QCG-Broker Schedule Job, Co-allocation, cross-cluster job

Important note: To enable cross-cluster parallel computing for C, C++ or Fortran, one must deploy QCG-OMPI (QCG OpenMPI) instead of a vanilla OpenMPI library.

QCG accounting

The QCG Accounting agent

The QCG accounting agent is an infrastructure integration service, and not available for direct end-user consumption. It is usually deployed in close proximity to the QCG-Computing system since it queries the computing system's job database and parses the LRMS log files for accounting information.

A number of plugins for accounting infrastructures exist; these are capable of translating internal accounting information into the required output format, as well as contacting the corresponding accounting endpoint. For EGI, a plugin integrating QCG accounting with APEL is included, dropping accounting records stored in an output file in the "outgoing" directory of the APEL/SSM service, which must be installed alongside QCG Accounting on the same resource.

QCG monitoring

The QCG Accounting agent

The QCG platform also provides three Nagios plugins that enable any Nagios deployment to monitor a QCG platform deployment. In the case of EGI, Grid service monitoring is provided by the EGI SAM framework, which includes a Nagios instance and all necessary monitoring plugins bundled with it. Thus the QCG Nagios plugins will be shipped with the EGI SAM system, not with QCG platform components.