Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "QosCosGrid Platform"

From EGIWiki
Jump to navigation Jump to search
Line 69: Line 69:
=== OGF DRMAA integration with LRMS ===
=== OGF DRMAA integration with LRMS ===


SCG Compute uses the [http://ogf.org/documents/GFD.133.pdf OGF DRMAA 1.0] to integrate with various different LRMSs. DRMAA implementations are bundled with QCG COmputing for the supported LRMSs as listed above.
SCG Compute uses the [http://ogf.org/documents/GFD.133.pdf OGF DRMAA 1.0] to integrate with various different LRMSs. QCG Compute acts as a client to the DRMAA-compliant implementations and is implemented against the DRMAA service interface.


More information is available at the [http://www.drmaa.org/index.php DRMAA WG].
A number of DRMAA 1.0 integrations for LRMSs are available, either by PSNC directly (e.g. for IBM LoadLeveler, LL), or bundled and sourced from elsewhere. The corresponding DRMAA plugin needs to be installed and configured in QCG Computing configuration file.


=== OGF HPC BP integration for Compute job submissions ===
=== OGF HPC BP integration for Compute job submissions ===

Revision as of 15:41, 15 November 2012

Technology Software Component Delivery Software Provisioning UMD Middleware Cloud Middleware Distribution Containers Distribution Technology Glossary
Template:EGIPlatforms submenu 


Platform Integrator: PSNC

Technology Provider: PSNC 
More information: http://www.qoscosgrid.org/

The QosCosGrid (QCG) platform provides advanced resource reservation and management capabilities providing users with HPC like performance and scalability. Connecting many local computing resources, QosCosGrid provides advanced monitoring and job execution capabilities for distributed and parallel C, C++, Fortran and Java applications.

Overview

QCG Platform diagram

The QCG platform comprises of four key middleware services, and several community facing services as the user's main entry points to the QCG platform. General integration with the EGI Core Infrastructure platform is provided for Security, Monitoring, Accounting and Service endpoint discovery.

The QCG Computing system forms the heart of the QCG platform providing the main job execution capabilities. When configured and extended accordingly, cross-cluster parallel jobs as well as multi-scale cross-cluster compute jobs provide the HPC like performance without being backed by a HPC system. It is typically deployed fronting one or more local compute clusters that are managed through LRMSs. QCG supports PBS, PBSPro, SLURM, LL, LSF, (S)GE, TORQUE out of the box. QCG Computing is directly integrated with the EGI Core Infrastructure Platform for Security purposes; QCG Computing accepts user authentication from the EGI's X.509v3 PKI system, and authorises user access to computing resources by generating and maintaining Gridmap files.

The QCG Notification system provides asynchronous notification of job progress to any subscribed notification consumer. The notification system is particularly useful (and required) for deployments suitable for cross-cluster compute jobs. The system supports direct end-user notification, provided she is properly subscribed as a notification consumer (out of scope for this documentation); however the typical use case of the QCG Notification system is to support workflow engines and cross-cluster coordinating services (see below) in tracking the progress of individual tasks. The main Notification Producer is the QCG Computing system, exposing a diverse set of notification topics on Compute Jobs and the QCG Computing service itself.

The QCG Broker is responsible for finding and consigning compute jobs to resources that are exposed through QCG Computing systems as per requirements of higher-level services.It does so my monitoring the state of connected QCG Computing services through corresponding QCG Notification services, and then directly submitting the job to the QCG Computing system that matches best the requirements.

The QCG platform also includes several Research Community services, represented by its most prominent member the QCG Science Gateways system. These services typically provide portal services to the consuming end user communities, but also includes a mobile client (QCG Mobile), a Windows based visualisation tool (QCG Icon). among others.

The QCG Accounting system is not a user facing service. It queries the QCG Computing system for accounting information and feeds them as OGC Usage Records into the EGI Core Infrastructure accounting system.

The QCG Platform integrates with the NAGIOS monitoring system by providing Nagios monitoring plugins (not shown). Although allowing individual independent NAGIOS instances, EGI deployments of the QCG platform will integrate with the EGI SAM system by including the QCG Nagios plugins in EGI SAM for NGI-wide deployment.

QCG Computing

Principal QCG Computing architecture

The functionality of the QCG Computing system is provided by six components; their deployment depends on the needed functionality (see below).

The middleware components are typically deployed together into a QCG Computing service. QCG Compute, Gridmapfile and LRMS/DRMAA provide the essential Compute Job and Parallel Job capabilities - QCG Computing is capable of processing parallel jobs out of the box through OpenMPI. QCG Compute provides the core job processing and management functionality. Gridmapfile is used to interface to the EGI Core Infrastructure platform through the means of X.509v3 PKI and gridmap files. the LRMS/DRMAA component provides the the integration with Local Resource Management Systems (LRMS) deployed by the Resource Provider. QCG Compute supports the following LRMSs:

  1. PBS
  2. PBSPro
  3. SLURM
  4. LL
  5. LSF
  6. (S)GE
  7. TORQUE

QCG Computing extensions

Advanced QCG Computing capabilities

A speciality of the QCG Computing platform is its capability of coordinating parallel jobs that span across multiple computing clusters, and even across multiple Resource Providers. Multiscale computing refers to deploying a complex parallel compute job across several compute resources (including HPC, HTC and clusters) that are correlated and parallel. Particularly the MAPPER project makes heavy use of this capability.

QCG provides these by using a common coordinator component, the QCG Coordinator that must be deployed in "Grid space", i.e. outside Resource Provider firewalls, either as a truly public service, or located in the DMZ of a Resource Provider. This Coordinator then communicates directly with the corresponding libraries deployed on the cluster worker nodes. These in turn communicate with the LRMS deployed on the CLuster head node.

Thus, the QCG extension consists of the QCG Coordinator plus the appropriate cluster worker node libraries as described below.

Cross-cluster parallel Jobs

QCG is able to span parallel jobs across multiple QCG-Computing managed compute clusters, even if these are operated by different Resource Providers. This capability is available for applications written in C, C++, Fortran and Java.

Parallel Java
Parallel Jave is provided by deploying the ProActive library on the cluster worker nodes, complementing a QCG Coordinator deployment.
Parallel C
Parallel C allows the execution of parallel applications written in C, C++ or Fortran. This is done by deploying a patched OpenMPI library on the compute cluster. This patched version is fully compatible to the standard OpenMPI library, but adds the cross-cluster feature for QCG. More information is available here.

Multiscale compute jobs

Similar to cross-cluster parallel jobs, QCG allows to consign and coordinate multiscale compute jobs. Deploying this feature is done in a similar way, in that the QCG-Coordinator is either deployed, or an existing instance is reused. The only thing that is left to deploy is the MUSCLE library in the same way as the QCG-OMPI and ProActive libraries on the cluster worker nodes. This capability supports multi-scale workloads as defined by the Mapper project and the COAST project.

Interfaces & Standards

QCG Computing: Standards based interfaces

QCG Computing employs a number of standards at the interfaces to its external and other modular components. These standards were developed by two well-known standardisation bodies, OASIS and OGF (the WS-I organisation is now integrated into OASIS).

OGF DRMAA integration with LRMS

SCG Compute uses the OGF DRMAA 1.0 to integrate with various different LRMSs. QCG Compute acts as a client to the DRMAA-compliant implementations and is implemented against the DRMAA service interface.

A number of DRMAA 1.0 integrations for LRMSs are available, either by PSNC directly (e.g. for IBM LoadLeveler, LL), or bundled and sourced from elsewhere. The corresponding DRMAA plugin needs to be installed and configured in QCG Computing configuration file.

OGF HPC BP integration for Compute job submissions

QCG Compute implements the OGF HPC BP (High Performance Computing Basic Profile), which includes by reference three standard specifications and one extension: WS-I Basic Profile 1.1, JSDL 1.0, JSDL HPC 1.0, and OGSA BES. Together, these standards ensure common Web Service interoperability (WS-I Basic Profile, by OASIS) and a common (HPC) Compute service interface (OGSA-BES, by OGF) that accepts job descriptions expressed in JSDL and its HPC extension (both OGF).

The OGSA-BES standard allows the integration of notification capabilities, and standardises the use of either WS-Notification or WS-Eventing as a standardised notification interface. QCG Computing integrates with the QCG-Notification system by implementing the WS-Notification::NotificationProducer interface. (see below)

QCG Notification

About PSNC