Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "QosCosGrid Platform"

From EGIWiki
Jump to navigation Jump to search
Line 21: Line 21:
The QCG-Notification system provides asynchronous notification of job progress to any subscribed notification consumer. The system supports direct end-user notification, provided she is properly subscribed as a notification consumer (out of scope for this documentation); however the typical use case of the QCG-Notification system is to support workflow engines and cross-cluster coordinating services (see below) in tracking the progress of individual tasks. The default underlying message transport protocol is WS messages over SOAP; QCG-Notification also supports E-Mail (SMTP) and Jabber (XMPP) notification delivery. Though the main notification producer is the QCG-Compute service, QCG-Notification accepts any notification producer that implements the WS-Notification family of standards. The QCG-Notification system is a mandatory component of the QosCosGrid cross-platform computing capability.
The QCG-Notification system provides asynchronous notification of job progress to any subscribed notification consumer. The system supports direct end-user notification, provided she is properly subscribed as a notification consumer (out of scope for this documentation); however the typical use case of the QCG-Notification system is to support workflow engines and cross-cluster coordinating services (see below) in tracking the progress of individual tasks. The default underlying message transport protocol is WS messages over SOAP; QCG-Notification also supports E-Mail (SMTP) and Jabber (XMPP) notification delivery. Though the main notification producer is the QCG-Compute service, QCG-Notification accepts any notification producer that implements the WS-Notification family of standards. The QCG-Notification system is a mandatory component of the QosCosGrid cross-platform computing capability.
; Capabilities
; Capabilities
: Notification, Cross-Cluster computing (partial)
: Notification <br><u>Optional</u>: Cross-Cluster computing (partial)


==== QCG-Broker ====  
==== QCG-Broker ====  

Revision as of 11:41, 27 November 2012

Technology Software Component Delivery Software Provisioning UMD Middleware Cloud Middleware Distribution Containers Distribution Technology Glossary
Template:EGIPlatforms submenu 


Platform Integrator: PSNC

Technology Provider: PSNC

More information: http://www.qoscosgrid.org/


The QosCosGrid (QCG) platform provides advance resource reservation, co-allocation and management capabilities providing users with HPC like performance and scalability. Connecting many local computing resources, QosCosGrid provides advanced monitoring and job execution capabilities for distributed and parallel C, C++, Fortran and Java applications.

Platform overview

The QCG Platform

The QCG platform comprises of three key middleware services, two infrastructure integration systems, and several community facing services as the user's main entry points to the QCG platform. General integration with the EGI Core Infrastructure platform is provided for Security, Monitoring, Accounting and Service endpoint discovery.

QCG-Computing

The QCG-Computing system forms the low-level block of the QCG platform providing the main job execution capabilities. It is typically deployed fronting a compute cluster that is managed through an LRMS. QCG supports PBS, PBSPro, SLURM, LL, LSF, (S)GE, Torque out of the box. QCG-Computing is directly integrated with the EGI Core Infrastructure Platform for Security purposes; QCG-Computing accepts user authentication from the EGI's X.509v3 PKI system, and authorises user access to computing resources based on generated grid-mapfiles. QCG-Computing supports parallel compute jobs and multi-scale jobs out of the box, provided that a suitable parallel and/or multi-scale toolkit is installed in the cluster

Capabilities
Job Execution, Advance Reservation
Optional: Parallel Job (ProActive, OpenMPI), Multi-scale Job (MUSCLE)

QCG-Notification

The QCG-Notification system provides asynchronous notification of job progress to any subscribed notification consumer. The system supports direct end-user notification, provided she is properly subscribed as a notification consumer (out of scope for this documentation); however the typical use case of the QCG-Notification system is to support workflow engines and cross-cluster coordinating services (see below) in tracking the progress of individual tasks. The default underlying message transport protocol is WS messages over SOAP; QCG-Notification also supports E-Mail (SMTP) and Jabber (XMPP) notification delivery. Though the main notification producer is the QCG-Compute service, QCG-Notification accepts any notification producer that implements the WS-Notification family of standards. The QCG-Notification system is a mandatory component of the QosCosGrid cross-platform computing capability.

Capabilities
Notification
Optional: Cross-Cluster computing (partial)

QCG-Broker

Being the main notification consumer of the QCG-Notification system, the QCG-Broker is responsible for finding and consigning compute jobs to resources that are exposed through QCG-Computing systems as per requirements of users or higher-level tools. It does so my monitoring the state of connected QCG-Computing services and then directly submitting the job to the QCG-Computing system that matches best the requirements. Moreover, the QCG-Broker service is capable of co-allocating resources of multiple sites (through advance resource reservation provided by QCG-Computing) enabling cross-cluster computing. If combined with cluster programming toolkits for parallel jobs and multi-scale jobs, QCG-Broker provides cross-cluster parallel computing and cross-cluster multi-scale computing.

Capabilities
Schedule Job, Co-allocation, cross-cluster computing (partial)

QCG Accounting

The QCG-Accounting system is not a user facing system. It queries the QCG-Computing system for accounting information and feeds this information to target accounting systems using plugins. Currently, plugins exist for PL-Grid accounting system (BAT), GridSafe, and APEL SSM v0.2 (soon to be replaced by EMI-CAR).

Capabilities
Accounting

QCG Monitoring

The QCG Platform integrates with the NAGIOS monitoring system by providing Nagios monitoring plugins (not shown). Although allowing individual independent NAGIOS instances, EGI deployments of the QCG platform will integrate with the EGI SAM system by including the QCG Nagios plugins in EGI SAM for NGI-wide deployment.

Capabilities
Monitoring

QCG-ScienceGateways (et al.)

The QCG platform also includes several Research Community services and tools, represented by its most prominent member the QCG-ScienceGateways system. These services typically provide portal services to the consuming end user communities, but also includes a mobile client (QCG-Mobile). Another application gaining popularity is QCG-Icon - a lightweight desktop application for Windows, MAC OSX and Linux platforms, aiming to provide transparent access to applications installed on remote clusters.

Technical Architecture

This section drills in more detail into the architecture of the QosCosGrid platform. The previous section aims to provide an overview of the key subsystems and offered capabilities of the QCG platform, this section describes the fundamental architecture of this platform, how it integrates with the EGI Core Infrastructure, as well as deployment scenarios capturing what needs to be deployed in order to offer certain capabilities of this platform.

QCG-Computing

The components of the QCG-Computing subsystem

The QCG-Computing system provides the main computing capabilities available with the QosCosGrid platform. The Computing component implements most of the compute functionality; it is supported by several internal components. Gridmapfile provides user authorization based on commonly used grid-map files and is directly integrated with EGI's X.509-based user authentication infrastructure. The core:Core component is shared with the QCG-Notification system and provides shared packages and libraries. core:DEP and core:curl are compatibility packages that were bundled by PSNC to provide more recent versions or missing libraries compared to Scientific Linux 5 baseline.

QCG-Computing integrates with a broad set of Local Resource Management Systems (LRMS) through its LRMS component abstracting away LRMS-specifics using a publicly standardised interface. Implementations of this interface exist for PBS, PBSPro, SLURM, LL, LSF, (S)GE and Torque.

QCG-Computing supports both simple compute jobs and parallel jobs out of the box - provided that a suitable parallel programming toolkit (ProActive/OpenMPI) is installed in the cluster. QCG-Computing also supports multi-scale jobs through the MUSCLE library for multi-scale jobs that do not have heterogeneous requirements.

Interfaces & Standards

WS-Notification 1.3
QCG-Computing uses the WS-Notification 1.3 family of standards to implement the role of a Notification Producer by using the ws-n:RegisterPublisherPublisher interface to register itself, and then the ws-n:NotificationBroker interface to sent notification events to subscribed consumers.
OGF OGSA Basic Execution Service 1.0
QCG-Computing uses the OGF OGSA BES 1.0 specification to expose its computing services to the Grid.
OGF JSDL 1.0 & JSDL HPC 1.0 extension
As mandated by the OGF BES 1.0 specification QCG-Computing accepts compute job descriptions in the JSDL 1.0. As mandated by the OGF OGSA-HPC Basic Profile QCG-Computing also accepts the JSDL HPC 1.0 extension for Job Descriptions.
OGF OGSA-HPC Basic Profile 1.0
QCG-Compute implements the OGF OGSA-HPC BP 1.0 which defines a profile across the following specification by incorporation: WS-I Basic Profile 1.1, OGF OGSA-BES 1.0, JSDL 1.0.
OGF DRMAA 1.0
SCG Compute uses the OGF DRMAA 1.0 to integrate with various different LRMSs. QCG Compute acts as a client to the DRMAA-compliant implementations and is implemented against the DRMAA service interface. A number of DRMAA 1.0 integrations for LRMSs are available, either by PSNC directly (e.g. for IBM LoadLeveler, LL), or bundled and sourced from elsewhere. The corresponding DRMAA plugin needs to be installed and configured in the QCG Computing configuration file. An overview of available DRMAA implementations for LRMSs is available at the DRMAA implementations part of the DRMAA WG web site.

Deployment scenarios

This section provides an overview of the deployment requirements and dependencies in order to provide a specific capability.

QCG-Computing ProActive OpenMPI MUSCLE Capability
Fronting a compute cluster Job Execution, Advance Reservation
Fronting a compute cluster On the cluster worker nodes Job Execution, Advance Reservation, Parallel Job (Java)
Fronting a compute cluster On the cluster worker nodes Job Execution, Advance Reservation, Parallel Job (C, C++, Fortran)
Fronting a compute cluster On the cluster worker nodes Job Execution, Advance Reservation, Multi-scale job

QCG Computing extensions

Advanced QCG Computing capabilities

A speciality of the QCG Computing platform is its capability of coordinating parallel jobs that span across multiple computing clusters, and even across multiple Resource Providers. Multiscale computing refers to deploying a complex parallel compute job across several compute resources (including HPC, HTC and clusters) that are correlated and parallel. Particularly the MAPPER project makes heavy use of this capability.

QCG provides these by using a common coordinator component, the QCG Coordinator that must be deployed in "Grid space", i.e. outside Resource Provider firewalls, either as a truly public service, or located in the DMZ of a Resource Provider. This Coordinator then communicates directly with the corresponding libraries deployed on the cluster worker nodes. These in turn communicate with the LRMS deployed on the CLuster head node.

Thus, the QCG extension consists of the QCG Coordinator plus the appropriate cluster worker node libraries as described below.

Cross-cluster parallel Jobs

QCG is able to span parallel jobs across multiple QCG-Computing managed compute clusters, even if these are operated by different Resource Providers. This capability is available for applications written in C, C++, Fortran and Java.

Parallel Java
Parallel Jave is provided by deploying the ProActive library on the cluster worker nodes, complementing a QCG Coordinator deployment.
Parallel C
Parallel C allows the execution of parallel applications written in C, C++ or Fortran. This is done by deploying a patched OpenMPI library on the compute cluster. This patched version is fully compatible to the standard OpenMPI library, but adds the cross-cluster feature for QCG. More information is available here.

Multiscale compute jobs

Similar to cross-cluster parallel jobs, QCG allows to consign and coordinate multiscale compute jobs. Deploying this feature is done in a similar way, in that the QCG-Coordinator is either deployed, or an existing instance is reused. The only thing that is left to deploy is the MUSCLE library in the same way as the QCG-OMPI and ProActive libraries on the cluster worker nodes. This capability supports multi-scale workloads as defined by the Mapper project and the COAST project.

QCG-Notification

The QCG-Notification system is a reference implementation of the WS-Notification family of standards (Base Notification, Brokered Notification, Topics).

It can be integrated with many other WS-Notification compliant systems, even though QCG-Notification extends the WS-Notification standards with some management and discovery operations. QCG-Notification supports all mandatory and optional elements of WS-Notification specifications, particularly topics, subscriptions, and pull points.

The QCG-Notification documentation provides extensive information about supported use cases, roles, deployments and installation & configuration.

Standards & interfaces

QCG-Notification supports all standardised interfaces that are defined by the WS-Notification family of specifications. Detailed information is available in the specification documents at OASIS. In short, the following WS-Notification interfaces and roles are implemented:

  1. BaseNotification
    1. NotificationConsumer
    2. NotificationProducer (sources of notifications are required to assume the role of a publisher in the brokered notification model)
    3. PullPoint
    4. CreatePullPoint
    5. SubscriptionManager
    6. PausableSubscriptionManager
  2. BrokeredNotification
    1. NotificationBroker
    2. RegisterPublisher
    3. PublisherRegistrationManager
  3. Topics
    1. TopicNamespace
    2. TopicType
    3. TopicSet
    4. TopicExpression

The WS-Notification specifications make use of the WS-Resource family of specifications. Therefore, QCG-Notification also implements the WS-Resource set of standards and interfaces.

QCG-Broker

The QCG-Broker is a system providing advance reservation and cross-cluster job submission capabilities. It interacts with QCG-Computing systems via the OGSA HPC BP interface.

QCG-Accounting

The QCG-Accounting system is an independent service; it is usually deployed in close proximity of the QCG-Computing system, since it queries the computing system's job database and parses the LRMS log files for accounting information.

A number of plugins for accounting infrastructures exist; these are capable of translating internal accounting information into the required output format, as well as contacting the corresponding accounting endpoint.

Integration with the EGI Core Infrastructure platform

QCG integration with the Core Infrastructure platform

The QCG platform integrates seamlessly with the EGI Core Infrastructure platform.

Accounting

The QCG Accounting service features an APEL SSM plugin, so that it can store accounting records in UR format for the APEL SSM to transfer to the EGI APEL database. This requires that the APEL SSM upload directory is accessible by the QCG Accounting service either directly (i.e. deployed on the same server) or by means of mounting remote directories via e.g. pNFS.

Monitoring

QCG does not provide its own monitoring system. Instead, NAGIOS plugins are provided for the three customer facing QCG services Computing, Notification and Broker. The deployment and configuration of these plugins is intentionally left to the system administrator.

For EGI, this requires the SAM technology provider to regularly pull the Nagios plugins and integrate these into the SAM system that is deployed by each Resource Provider, who then configures the plugins according to local QCG deployments.

Information Discovery

QCG system deployments must be registered in EGI's GOC DB]. For this, the following three service types are available in GOC DB:

  • QCG.Computing
  • QCG.Notification
  • QCG.Broker