Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

QosCosGrid Platform

From EGIWiki
Jump to navigation Jump to search
Technology Software Component Delivery Software Provisioning UMD Middleware Cloud Middleware Distribution Containers Distribution Technology Glossary
Template:EGIPlatforms submenu 


Platform Integrator: PSNC

Technology Provider: PSNC

More information: http://www.qoscosgrid.org/


The QosCosGrid (QCG) platform provides advance resource reservation, co-allocation and management capabilities providing users with HPC like performance and scalability. Connecting many local computing resources, QosCosGrid provides advanced monitoring and job execution capabilities for distributed and parallel C, C++, Fortran and Java applications.

Platform overview

QCG Platform diagram

The QCG platform comprises of three key middleware services, two infrastructure integration systems, and several community facing services as the user's main entry points to the QCG platform. General integration with the EGI Core Infrastructure platform is provided for Security, Monitoring, Accounting and Service endpoint discovery.

QCG-Computing

The QCG-Computing system forms the low-level block of the QCG platform providing the main job execution capabilities. It is typically deployed fronting a compute cluster that is managed through an LRMS. QCG supports PBS, PBSPro, SLURM, LL, LSF, (S)GE, Torque out of the box. QCG-Computing is directly integrated with the EGI Core Infrastructure Platform for Security purposes; QCG-Computing accepts user authentication from the EGI's X.509v3 PKI system, and authorises user access to computing resources based on generated grid-mapfiles. QCG-Computing supports parallel compute jobs and multi-scale jobs out of the box, provided that a suitable parallel and/or multi-scale toolkit is installed in the cluster

Capabilities
Execute Job, Parallel Job (ProActive, OpenMPI), Multi-scale Job (MUSCLE), Advance Reservation

QCG-Notification

The QCG-Notification system provides asynchronous notification of job progress to any subscribed notification consumer. The system supports direct end-user notification, provided she is properly subscribed as a notification consumer (out of scope for this documentation); however the typical use case of the QCG-Notification system is to support workflow engines and cross-cluster coordinating services (see below) in tracking the progress of individual tasks. The default underlying message transport protocol is WS messages over SOAP; QCG-Notification also supports E-Mail (SMTP) and Jabber (XMPP) notification delivery. Though the main notification producer is the QCG-Compute service, QCG-Notification accepts any notification producer that implements the WS-Notification family of standards. The QCG-Notification system is a mandatory component of the QosCosGrid cross-platform computing capability.

Capabilities
Notification, Cross-Cluster computing (partial)

QCG-Broker

Being the main notification consumer of the QCG-Notification system, the QCG-Broker is responsible for finding and consigning compute jobs to resources that are exposed through QCG-Computing systems as per requirements of users or higher-level tools. It does so my monitoring the state of connected QCG-Computing services and then directly submitting the job to the QCG-Computing system that matches best the requirements. Moreover, the QCG-Broker service is capable of co-allocating resources of multiple sites (through advance resource reservation provided by QCG-Computing) enabling cross-cluster computing. If combined with cluster programming toolkits for parallel jobs and multi-scale jobs, QCG-Broker provides cross-cluster parallel computing and cross-cluster multi-scale computing.

Capabilities
Schedule Job, Co-allocation, cross-cluster computing (partial)

QCG Accounting

The QCG-Accounting system is not a user facing system. It queries the QCG-Computing system for accounting information and feeds this information to target accounting systems using plugins. Currently, plugins exist for PL-Grid accounting system (BAT), GridSafe, and APEL SSM v0.2 (soon to be replaced by EMI-CAR).

Capabilities
Accounting

QCG Monitoring

The QCG Platform integrates with the NAGIOS monitoring system by providing Nagios monitoring plugins (not shown). Although allowing individual independent NAGIOS instances, EGI deployments of the QCG platform will integrate with the EGI SAM system by including the QCG Nagios plugins in EGI SAM for NGI-wide deployment.

Capabilities
Monitoring

QCG-ScienceGateways (et al.)

The QCG platform also includes several Research Community services and tools, represented by its most prominent member the QCG-ScienceGateways system. These services typically provide portal services to the consuming end user communities, but also includes a mobile client (QCG-Mobile). Another application gaining popularity is QCG-Icon - a lightweight desktop application for Windows, MAC OSX and Linux platforms, aiming to provide transparent access to applications installed on remote clusters.

Technical Architecture

This section provides more information on the key systems of the QosCosGrid platform. Its target audience are Platform Operators, validators and site admins for Staged Rollout activities.

QCG-Computing

Principal QCG Computing architecture

The functionality of the QCG Computing system is provided by six components; their deployment depends on the needed functionality (see below).

The middleware components are typically deployed together into a QCG Computing service. QCG Compute, Gridmapfile and LRMS/DRMAA provide the essential Compute Job and Parallel Job capabilities - QCG Computing is capable of processing parallel jobs out of the box through OpenMPI. QCG Compute provides the core job processing and management functionality. Gridmapfile is used to interface to the EGI Core Infrastructure platform through the means of X.509v3 PKI and gridmap files. the LRMS/DRMAA component provides the the integration with Local Resource Management Systems (LRMS) deployed by the Resource Provider. QCG Compute supports the following LRMSs:

  1. PBS
  2. PBSPro
  3. SLURM
  4. LL
  5. LSF
  6. (S)GE
  7. TORQUE

QCG Computing extensions

Advanced QCG Computing capabilities

A speciality of the QCG Computing platform is its capability of coordinating parallel jobs that span across multiple computing clusters, and even across multiple Resource Providers. Multiscale computing refers to deploying a complex parallel compute job across several compute resources (including HPC, HTC and clusters) that are correlated and parallel. Particularly the MAPPER project makes heavy use of this capability.

QCG provides these by using a common coordinator component, the QCG Coordinator that must be deployed in "Grid space", i.e. outside Resource Provider firewalls, either as a truly public service, or located in the DMZ of a Resource Provider. This Coordinator then communicates directly with the corresponding libraries deployed on the cluster worker nodes. These in turn communicate with the LRMS deployed on the CLuster head node.

Thus, the QCG extension consists of the QCG Coordinator plus the appropriate cluster worker node libraries as described below.

Cross-cluster parallel Jobs

QCG is able to span parallel jobs across multiple QCG-Computing managed compute clusters, even if these are operated by different Resource Providers. This capability is available for applications written in C, C++, Fortran and Java.

Parallel Java
Parallel Jave is provided by deploying the ProActive library on the cluster worker nodes, complementing a QCG Coordinator deployment.
Parallel C
Parallel C allows the execution of parallel applications written in C, C++ or Fortran. This is done by deploying a patched OpenMPI library on the compute cluster. This patched version is fully compatible to the standard OpenMPI library, but adds the cross-cluster feature for QCG. More information is available here.

Multiscale compute jobs

Similar to cross-cluster parallel jobs, QCG allows to consign and coordinate multiscale compute jobs. Deploying this feature is done in a similar way, in that the QCG-Coordinator is either deployed, or an existing instance is reused. The only thing that is left to deploy is the MUSCLE library in the same way as the QCG-OMPI and ProActive libraries on the cluster worker nodes. This capability supports multi-scale workloads as defined by the Mapper project and the COAST project.

Interfaces & Standards

QCG Computing: Standards based interfaces

QCG Computing employs a number of standards at the interfaces to its external and other modular components. These standards were developed by two well-known standardisation bodies, OASIS and OGF (the WS-I organisation is now integrated into OASIS).

OGF DRMAA integration with LRMS

SCG Compute uses the OGF DRMAA 1.0 to integrate with various different LRMSs. QCG Compute acts as a client to the DRMAA-compliant implementations and is implemented against the DRMAA service interface.

A number of DRMAA 1.0 integrations for LRMSs are available, either by PSNC directly (e.g. for IBM LoadLeveler, LL), or bundled and sourced from elsewhere. The corresponding DRMAA plugin needs to be installed and configured in QCG Computing configuration file.

An overview of available DRMAA implementations for LRMSs is available at the DRMAA implementations part of the DRMAA WG web site.

OGF HPC BP integration for Compute job submissions

QCG Compute implements the OGF HPC BP (High Performance Computing Basic Profile), which includes by reference three standard specifications and one extension: WS-I Basic Profile 1.1, JSDL 1.0, JSDL HPC 1.0, and OGSA BES. Together, these standards ensure common Web Service interoperability (WS-I Basic Profile, by OASIS) and a common (HPC) Compute service interface (OGSA-BES, by OGF) that accepts job descriptions expressed in JSDL and its HPC extension (both OGF).

The OGSA-BES standard allows the integration of notification capabilities, and standardises the use of either WS-Notification or WS-Eventing as a standardised notification interface. QCG-Computing integrates with the QCG-Notification system by assuming the WS-Notification::Publisher role (see below).

QCG-Notification

The QCG-Notification system is a reference implementation of the WS-Notification family of standards (Base Notification, Brokered Notification, Topics).

It can be integrated with many other WS-Notification compliant systems, even though QCG-Notification extends the WS-Notification standards with some management and discovery operations. QCG-Notification supports all mandatory and optional elements of WS-Notification specifications, particularly topics, subscriptions, and pull points.

The QCG-Notification documentation provides extensive information about supported use cases, roles, deployments and installation & configuration.

Standards & interfaces

QCG-Notification supports all standardised interfaces that are defined by the WS-Notification family of specifications. Detailed information is available in the specification documents at OASIS. In short, the following WS-Notification interfaces and roles are implemented:

  1. BaseNotification
    1. NotificationConsumer
    2. NotificationProducer (sources of notifications are required to assume the role of a publisher in the brokered notification model)
    3. PullPoint
    4. CreatePullPoint
    5. SubscriptionManager
    6. PausableSubscriptionManager
  2. BrokeredNotification
    1. NotificationBroker
    2. RegisterPublisher
    3. PublisherRegistrationManager
  3. Topics
    1. TopicNamespace
    2. TopicType
    3. TopicSet
    4. TopicExpression

The WS-Notification specifications make use of the WS-Resource family of specifications. Therefore, QCG-Notification also implements the WS-Resource set of standards and interfaces.

QCG-Broker

The QCG-Broker is a system providing advance reservation and cross-cluster job submission capabilities. It interacts with QCG-Computing systems via the OGSA HPC BP interface.

QCG-Accounting

The QCG-Accounting system is an independent service; it is usually deployed in close proximity of the QCG-Computing system, since it queries the computing system's job database and parses the LRMS log files for accounting information.

A number of plugins for accounting infrastructures exist; these are capable of translating internal accounting information into the required output format, as well as contacting the corresponding accounting endpoint.

Integration with the EGI Core Infrastructure platform

QCG integration with the Core Infrastructure platform

The QCG platform integrates seamlessly with the EGI Core Infrastructure platform.

Accounting

The QCG Accounting service features an APEL SSM plugin, so that it can store accounting records in UR format for the APEL SSM to transfer to the EGI APEL database. This requires that the APEL SSM upload directory is accessible by the QCG Accounting service either directly (i.e. deployed on the same server) or by means of mounting remote directories via e.g. pNFS.

Monitoring

QCG does not provide its own monitoring system. Instead, NAGIOS plugins are provided for the three customer facing QCG services Computing, Notification and Broker. The deployment and configuration of these plugins is intentionally left to the system administrator.

For EGI, this requires the SAM technology provider to regularly pull the Nagios plugins and integrate these into the SAM system that is deployed by each Resource Provider, who then configures the plugins according to local QCG deployments.

Information Discovery

QCG system deployments must be registered in EGI's GOC DB]. For this, the following three service types are available in GOC DB:

  • QCG.Computing
  • QCG.Notification
  • QCG.Broker