Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "QosCosGrid Platform"

From EGIWiki
Jump to navigation Jump to search
Line 122: Line 122:


= Integration with the EGI Core Infrastructure platform =
= Integration with the EGI Core Infrastructure platform =
[[Image:QCGEGICoreInfraIntegration.png|thumb|right|300px|QCG integration with the Core Infrastructure platform]]
The QCG platform integrates seamlessly with the EWGI Core Infrastructure platform.
== Accounting ==
The QCG Accounting service features an APEL SSM plugin, so that it can store accounting records in UR format for the APEL SSM to transfer to the EGI APEL database. This requires that the APEL SSM upload directory is accessible by the QCG Accounting service either directly (i.e. deployed on the same server) or by means of mounting remote directories via e.g. pNFS.
== Monitoring ==
QCG does not provide its own monitoring system. Instead, NAGIOS plugins are provided for the three customer facing QCG services Computing, Notification and Broker. THe deployment and configuration of thse plugins is intentionally left to the system administrator.
For EGI, this requires the SAM technology provider to regularly pull the Nagios plugins and integrate these into the SAM system that is deployed by each Resource Provider, who then configures the plugins according to local QCG deployments.
== Information Discovery ==
QCG system deployments must be registered in EGI's [https://goc.egi.eu GOC DB]]. For this, the following three service types are available in GOC DB:
* QCG.Computing
* QCG.Notification
* QCG.Broker

Revision as of 18:23, 15 November 2012

Technology Software Component Delivery Software Provisioning UMD Middleware Cloud Middleware Distribution Containers Distribution Technology Glossary
Template:EGIPlatforms submenu 


Platform Integrator: PSNC

Technology Provider: PSNC 
More information: http://www.qoscosgrid.org/

The QosCosGrid (QCG) platform provides advanced resource reservation and management capabilities providing users with HPC like performance and scalability. Connecting many local computing resources, QosCosGrid provides advanced monitoring and job execution capabilities for distributed and parallel C, C++, Fortran and Java applications.

The QosCosGrid platform

Overview

QCG Platform diagram

The QCG platform comprises of four key middleware services, and several community facing services as the user's main entry points to the QCG platform. General integration with the EGI Core Infrastructure platform is provided for Security, Monitoring, Accounting and Service endpoint discovery.

The QCG Computing system forms the heart of the QCG platform providing the main job execution capabilities. When configured and extended accordingly, cross-cluster parallel jobs as well as multi-scale cross-cluster compute jobs provide the HPC like performance without being backed by a HPC system. It is typically deployed fronting one or more local compute clusters that are managed through LRMSs. QCG supports PBS, PBSPro, SLURM, LL, LSF, (S)GE, TORQUE out of the box. QCG Computing is directly integrated with the EGI Core Infrastructure Platform for Security purposes; QCG Computing accepts user authentication from the EGI's X.509v3 PKI system, and authorises user access to computing resources by generating and maintaining Gridmap files.

The QCG Notification system provides asynchronous notification of job progress to any subscribed notification consumer. The notification system is particularly useful (and required) for deployments suitable for cross-cluster compute jobs. The system supports direct end-user notification, provided she is properly subscribed as a notification consumer (out of scope for this documentation); however the typical use case of the QCG Notification system is to support workflow engines and cross-cluster coordinating services (see below) in tracking the progress of individual tasks. The main Notification Producer is the QCG Computing system, exposing a diverse set of notification topics on Compute Jobs and the QCG Computing service itself.

The QCG Broker is responsible for finding and consigning compute jobs to resources that are exposed through QCG Computing systems as per requirements of higher-level services.It does so my monitoring the state of connected QCG Computing services through corresponding QCG Notification services, and then directly submitting the job to the QCG Computing system that matches best the requirements.

The QCG platform also includes several Research Community services, represented by its most prominent member the QCG Science Gateways system. These services typically provide portal services to the consuming end user communities, but also includes a mobile client (QCG Mobile), a Windows based visualisation tool (QCG Icon). among others.

The QCG Accounting system is not a user facing service. It queries the QCG Computing system for accounting information and feeds them as OGC Usage Records into the EGI Core Infrastructure accounting system.

The QCG Platform integrates with the NAGIOS monitoring system by providing Nagios monitoring plugins (not shown). Although allowing individual independent NAGIOS instances, EGI deployments of the QCG platform will integrate with the EGI SAM system by including the QCG Nagios plugins in EGI SAM for NGI-wide deployment.

QCG Computing

Principal QCG Computing architecture

The functionality of the QCG Computing system is provided by six components; their deployment depends on the needed functionality (see below).

The middleware components are typically deployed together into a QCG Computing service. QCG Compute, Gridmapfile and LRMS/DRMAA provide the essential Compute Job and Parallel Job capabilities - QCG Computing is capable of processing parallel jobs out of the box through OpenMPI. QCG Compute provides the core job processing and management functionality. Gridmapfile is used to interface to the EGI Core Infrastructure platform through the means of X.509v3 PKI and gridmap files. the LRMS/DRMAA component provides the the integration with Local Resource Management Systems (LRMS) deployed by the Resource Provider. QCG Compute supports the following LRMSs:

  1. PBS
  2. PBSPro
  3. SLURM
  4. LL
  5. LSF
  6. (S)GE
  7. TORQUE

QCG Computing extensions

Advanced QCG Computing capabilities

A speciality of the QCG Computing platform is its capability of coordinating parallel jobs that span across multiple computing clusters, and even across multiple Resource Providers. Multiscale computing refers to deploying a complex parallel compute job across several compute resources (including HPC, HTC and clusters) that are correlated and parallel. Particularly the MAPPER project makes heavy use of this capability.

QCG provides these by using a common coordinator component, the QCG Coordinator that must be deployed in "Grid space", i.e. outside Resource Provider firewalls, either as a truly public service, or located in the DMZ of a Resource Provider. This Coordinator then communicates directly with the corresponding libraries deployed on the cluster worker nodes. These in turn communicate with the LRMS deployed on the CLuster head node.

Thus, the QCG extension consists of the QCG Coordinator plus the appropriate cluster worker node libraries as described below.

Cross-cluster parallel Jobs

QCG is able to span parallel jobs across multiple QCG-Computing managed compute clusters, even if these are operated by different Resource Providers. This capability is available for applications written in C, C++, Fortran and Java.

Parallel Java
Parallel Jave is provided by deploying the ProActive library on the cluster worker nodes, complementing a QCG Coordinator deployment.
Parallel C
Parallel C allows the execution of parallel applications written in C, C++ or Fortran. This is done by deploying a patched OpenMPI library on the compute cluster. This patched version is fully compatible to the standard OpenMPI library, but adds the cross-cluster feature for QCG. More information is available here.

Multiscale compute jobs

Similar to cross-cluster parallel jobs, QCG allows to consign and coordinate multiscale compute jobs. Deploying this feature is done in a similar way, in that the QCG-Coordinator is either deployed, or an existing instance is reused. The only thing that is left to deploy is the MUSCLE library in the same way as the QCG-OMPI and ProActive libraries on the cluster worker nodes. This capability supports multi-scale workloads as defined by the Mapper project and the COAST project.

Interfaces & Standards

QCG Computing: Standards based interfaces

QCG Computing employs a number of standards at the interfaces to its external and other modular components. These standards were developed by two well-known standardisation bodies, OASIS and OGF (the WS-I organisation is now integrated into OASIS).

OGF DRMAA integration with LRMS

SCG Compute uses the OGF DRMAA 1.0 to integrate with various different LRMSs. QCG Compute acts as a client to the DRMAA-compliant implementations and is implemented against the DRMAA service interface.

A number of DRMAA 1.0 integrations for LRMSs are available, either by PSNC directly (e.g. for IBM LoadLeveler, LL), or bundled and sourced from elsewhere. The corresponding DRMAA plugin needs to be installed and configured in QCG Computing configuration file.

OGF HPC BP integration for Compute job submissions

QCG Compute implements the OGF HPC BP (High Performance Computing Basic Profile), which includes by reference three standard specifications and one extension: WS-I Basic Profile 1.1, JSDL 1.0, JSDL HPC 1.0, and OGSA BES. Together, these standards ensure common Web Service interoperability (WS-I Basic Profile, by OASIS) and a common (HPC) Compute service interface (OGSA-BES, by OGF) that accepts job descriptions expressed in JSDL and its HPC extension (both OGF).

The OGSA-BES standard allows the integration of notification capabilities, and standardises the use of either WS-Notification or WS-Eventing as a standardised notification interface. QCG Computing integrates with the QCG-Notification system by assuming the WS-Notification::Publisher role (see below).

QCG Notification

The QCG Notification system is a reference implementation of the QS-Notification family of standards (Base Notification, Brokered Notification, Topics).

It can be integrated with many other WS-Notification compliant systems, even though QCG Notification extends the WS-Notification standards with some management and discovery operations. QCG Notification supports all mandatory and optional elements of WS-Notification specifications, particularly topics, subscriptions, and pull points.

The QCG Notification documentation provides extensive information about supported use cases, roles, deployments and installation & configuration.

Standards & interfaces

QCG Notification supports all standardised interfaces that are defined by the WS-Notification family of specifications. Detailed information is available in the specification documents at OASIS. In short, the following WS-Notification interfaces and roles are implemented:

  1. BaseNotification
    1. NotificationConsumer
    2. NotificationProducer (sources of notifications are required to assume the role of a publisher in the brokered notification model)
    3. PullPoint
    4. CreatePullPoint
    5. SubscriptionManager
    6. PausableSubscriptionManager
  2. BrokeredNotification
    1. NotificationBroker
    2. RegisterPublisher
    3. PublisherRegistrationManager
  3. Topics
    1. TopicNamespace
    2. TopicType
    3. TopicSet
    4. TopicExpression

The WS-Notification specifications make use of the WS-Resource family of specifications. Therefore QCG Notification implements the WS-Resource set of standards and interfaces.

QCG Broker

The QCG Broker is a system providing advanced reservation and cross-cluster job submission capabilities. It interacts with QCG Computing systems via the OGSA HPC BP interface.

QCG Accounting

The QCG Accounting system is an independent service; it is usually deployed in close proximity of the QCG Computing system, since it queries the computing system's job database and parses the LRMS log files for accounting information.

A number of plugins for accounting infrastructures exist; these are capable of translating internal accounting information into the required output format, as well as contacting the corresponding accounting endpoint.

Integration with the EGI Core Infrastructure platform

QCG integration with the Core Infrastructure platform

The QCG platform integrates seamlessly with the EWGI Core Infrastructure platform.

Accounting

The QCG Accounting service features an APEL SSM plugin, so that it can store accounting records in UR format for the APEL SSM to transfer to the EGI APEL database. This requires that the APEL SSM upload directory is accessible by the QCG Accounting service either directly (i.e. deployed on the same server) or by means of mounting remote directories via e.g. pNFS.

Monitoring

QCG does not provide its own monitoring system. Instead, NAGIOS plugins are provided for the three customer facing QCG services Computing, Notification and Broker. THe deployment and configuration of thse plugins is intentionally left to the system administrator.

For EGI, this requires the SAM technology provider to regularly pull the Nagios plugins and integrate these into the SAM system that is deployed by each Resource Provider, who then configures the plugins according to local QCG deployments.

Information Discovery

QCG system deployments must be registered in EGI's GOC DB]. For this, the following three service types are available in GOC DB:

  • QCG.Computing
  • QCG.Notification
  • QCG.Broker