Resource Allocation Task Force brainstorming page

From EGIWiki
Jump to: navigation, search
EGI Activity groups Special Interest groups Policy groups Virtual teams Distributed Competence Centres


Alert.png This article is Deprecated and should no longer be used, but is still available for reasons of reference.


<< Resource Allocation Task Force


This page is a brainstorming space for Resource Allocation Task Force. It contains ideas and issues for discussion.

Terminology

Service Level Agreement (SLA) - An agreement between a Service Provider and a customer/client. The SLA describes the IT Service, documents service level targets and specifies the responsibilities of the Service Provider and the customer/client.. A single SLA may cover multiple IT Services or multiple customers/clients.

Operational Level Agreement (OLA) - An agreement between an IT Service Provider and another part of the same organisation. An OLA supports the IT Service Provider’s delivery of IT Services to Users. The OLA defines the goods or Services to be provided and the responsibilities of both parties.

Resource Allocation - is the distribution of resources among competing groups of people or programs. It is used to assign the available resources in an economic way and to satisfy customer's needs.

Please refer to the EGI Glossary for the definitions of the terms used.

Goal

The main goals are:

  • Attract new users and allocate them some grid resources
  • Provide users with a simplified procedure to find grid resources for their needs and by means of a central unique reference point
  • Foster a virtuous cycle among ‘new’ scientific communities, EGI, NGIs, Resource Centers and funding agencies in order to attract new funding to strengthen and expand the European grid infrastructure according to the needs of those scientific communities
  • Demonstrate to the national funding agencies, EC (and everybody) that the EGI infrastructure and services are valuable, useful and used by a vast variety of different users


The high level concept

RA overview.png

EGI is "a place" where Customer and Resource Provider meet and negotiate resource allocation.


Resource allocation high level process

RA processes.png


Roles and activities

Role Customer

Resource Provider

Supporter Supervisor Broker
Activities
  • Sends a request for resources
  • Uses the resources
  • Is involved in negotiation process to establish SLAs
  • Provides resources to the Customer
  • Is involved (directly or indirectly) in negotiation process to establish OLAs
  • Provides tools for RA process
  • Supports Customers and RPs in the negotiation and in defining resource requirements
  • Provides periodic monitoring and reporting of usage and service quality
  • Enforces OLAs and SLAs
  • Capacity management
  • Collects and manages Customers demand
  • Collects and manages RP offer
  • Negotiates SLAs and OLAs
  • Selects RPs to satisfy demand
Who
  • National or international VO
  • New or existing VO
  • NGIs that own resources or can directly decide on resource allocation policies for a pool of resources of their sites
  • Resource Centres (can be a RP even if NGI is not providing coordination of resource allocation nationally)
  • User communities (can be RPs for other user communities - see the WeNMR case)
  • EGI.eu for international and/or national VOs
  • NGIs for national VOs
  • EGI.eu for international and/or national VOs
  • NGIs for national VOs
  • EGI.eu for international VOs
  • NGIs that can coordinate resources allocation with their sites
  • In the Open market and Freedom of choice models - Customers and/or RPs  
Prerequisits

VOs entitled to use resources should be registered and in production status in Operations Portal

Resource Centres offering resources must be registered in GOCDB as production entities, this ensure that all security policies will be enforced. 


Resource Allocation Models

3 models were introduced to examine best option for resource allocation process in EGI project.

Different models may be applied depending on

  • types of Customer
    • Active or New VO
    • International or national VO
  • type of request
    • "Big" or "Small"
  • type of NGIs' role in the allocation process.

Model 1 - Open market

RA Model1.png

RP’ resources and Customers’ requests are published on an open market (e.g. shopping website like for example ebay) – are visible for everyone.

Supervisor regulates and supervises the market. It determines the rules (e.g. tools, procedures, policies) of the market and acts as a referee in case of agreement violation/negotiation.


Role Supervisor Resource Provider Customer
Pros
  • Knows what is going on in the infrastructure
  • There are clear rules for resource allocation
  • In case of agreement violation/negotiations Supervisor is a referee
  • Knows what are the Customers' current needs
  • There are clear rules for resource allocation
  • In case of agreement violation/negotiations Supervisor is a referee
  • Knows which resources are now available
Con
  • Negotiation process cannot be adapted to specific needs
  • Lack of external broker to help match needs to resources
  • Need to sign mutliple SLAs with different Customers.
  • Lack of external broker to help match needs to resources
  • Need to sign mutliple SLAs with different Providers.


Suitable to:

  • "Small" requests
  • Active Customers
  • National Customer

Options

International Customers

Regional Customers
E Supervicer-N Broker.png
E Supervisor-N Supporter.png
N Supervisor-E Supporter.png
  • EGI is a Supervisor
  • NGI is a Broker for its RC
  • RC provides resources
  • EGI is a Supervisor
  • NGI is Supporter for its RC
  • RC provides resources
  • EGI is a Supporter
  • NGI is a Supervisor for its RC
  • RC provides resources

Model 2 - Broker

RA Model2.png

Providers’ resources and Customers’ requests are visible to Broker.

Broker matches Providers’ resources to Customers’ requests needs.

Interaction Customer <-> Provider limited.


Role Broker Resource Provider Customer
Pros
  • Knows what is going on in the infrastructure
  • There are clear rules for resource allocation
  • First round of negotiation is pass to Broker
  • Can have cumulative OLA with Broker and not have to negotiate SLA for a single Customer
  • There are clear rules for resource allocation
  • Gets single point of contact and support from one entity
  • Can have cumulative SLA with Broker and not have to negotiate SLA for a single Provider
Con
  • Most time consuming for Broker
  • Broker has to get and manage the knowledge what is available in the infrastructure (each site)
  • If Broker asks Providers to express their interest to support Customer when request arrives, there is a risk of opportunistic model



Suitable to:

  • "Big" requests
  • Generic and specialised requests
  • Active and New Customers

Options

International Customers

Regional Customers
E Broker-N Supporter.png
E Broker-N Broker.png
E Broker-N Supporter.png
  • EGI is a Broker
  • NGI is a Supporter for its RC
  • RC provides resources
  • EGI is a Broker
  • NGI is a Broker for its RC
  • RC provides resources
  • EGI is a Supporter
  • NGI is a Broker for its RC
  • RC provides resources


Model 3 - Freedom of choice

RA Model3.png

Providers’ resources and Customers’ requests transparent - visible for everyone.

The parties have freedom to decide how they want to negotiate resources and under which conditions (no regulation on Supporter side).

Supporter role is to support Providers and/or Customers on demand. E.g. providing tool for resource allocation, helping in request fulfilment, specialised requests etc.



Role Supporter Resource Provider Customer
Pros
  • Does not require a significant commitment of Supporter in the process




  • Freedom to apply own RA procedures

Con
  • We can end up with lots of different SLA and negotiation procedures
  • Has no influence on the RA process
  • Has to define own RA processes
  • Need to sign mutliple SLAs with different Customers.
  • We can end up with lots of different SLA and negotiation procedures
  • Has to perform Broker and Supervisor activities with each of the Providers separately
  • Need to sign mutliple SLAs with different Providers.



Suitable to:

  • National Customer

Options

International Customers

Regional Customers
N Supervisor-E Supporter.png
E Supporter-N Broker.png
E Supporter-N Supporter.png
  • EGI is a Supporter
  • NGI is a Supervisor for its RC
  • RC provides resources
  • EGI is a Supporter
  • NGI is Broker for its RC
  • RC provides resources
  • EGI is a Supporter
  • NGI is Supporter
  • RC provides resources


Issues for discussion


Issue
Comment
Requirement
Open questions

The roles of NGI, EGI, RC in the process


  • Provide a single broker contact point to the user to hide the complexity of a network of heterogeneous RPs and to establish a single Service Level Agreement, instead of multiple ones with multiple RPs.

Scientific request review

Scientific Review Committee (SRC) - Terms Of Reference https://documents.egi.eu/secure/ShowDocument?docid=1472&version=2

  • Ascientific review may not always needed for example when (1) the resource requirements are small, (2) the user community/project is officially supported by one or more NGIs

Service Level Management
  • A single entity (e.g. EGI) must be responsible of ensuring the enforcement of OLAs and SLAs.
  • EGI should be responsible of all activities around service level management for the federated pool to offload VOs and RPs.
  • Simple and standardized SLA. Additional requirements should be allowed where needed by the VO. Many service levels should be made optional, as we do now in the VO ID card.
What should be included in SLA?

Technical request review

Needed to be sure that what we promise can be delivered to the Customer

  • A technical review of demand is needed complementing the initial scientific peer review (where applicable). RPs must be involved in this, to technically inspect VO requirements to understand the impact on the resource configurations (cluster, storage set-up, network etc.), to understand if the demand complies to the local policies, and to acknowledge that resource allocation is possible for them.

The acknowledgement for resources use

VT_Scientific_Publications_Repository

the recommendation about citing EGI in publications is defined here: https://documents.egi.eu/document/1369the implementation requires a change in the Grid AUP; the issue has been raised with the Security Coordination Group and David Kelsey is checking with the big customer (WLCG) if/how they are gonna accept it before making the change... work in progress.

  • VOs must periodically acknowledge usage of resources through scientific publication, press releases. The entities to be acknowledged are EGI, the NGIs and the individual RPs contributing resources. See the wenmr best-practice and the recommendations for a scientific publication repository.

Resources allocations


  • Coexistence of dynamic pool (bigger requests) and small static pool for smaller VOs
  • The resource allocation process should not only allow to address a resource request with an offer, but also vice versa a site with free resources to advertise it.
  • The allocation processes involving multiple RPs and the liaison with other RPs should be made transparent to one RP. EGI is responsible of ensuring that the federated RPs collectively deliver the service requested. One RP is not responsible of or not involved with the services provided by other RPs. The provisioning of resource to a federated pool should be with zero overhead to the RP

Monitoring of usage, Reporting and evaluation
  • An external entity (e.g. EGI.eu) is responsible of reporting on usage of the federated resource pool (rather than the VO itself). Enforcement of SLAs and OLAs does not involve VOs (it is devolved to EGI.eu).
  • Efficient usage is a requirement for extension of the grant.

Customer support



  • The broker (EGI.eu) should support the customer to express technical requirements, where needed.

Quality of Service
  • Different levels of Quality of Service could be offered by the federated pool. For example: Level I. Best effort allocation without minimum number of slots allocated (opportunistic usage). Level II. Minimum number of slots allocated at high priority, jobs arriving to that site will enter in execution at the earliest time the cluster occupation permits.

Meeting specific needs of the VO in terms of OS, configuration, software etc.


  • The same processes that are being defined for resource allocation, could be generalized to request services, i.e. to allow new user communities to request services to be provided by NGIs, such as application porting.

Regional VOs


  • Should be up to NGIs to help the VO and allocate resources within NGI
  • EGI provides single point of contact for resource allocation for Customers
  • It is desirable that the same interfaces and tools offered to support international users can also be used to support national user communities. In this case the supervisioning role of EGI.eu (Model 2) and all related service level management functions should be delegated to the NGI.

International VOs

EGI fully support international VOs





Requirements

EGI

  1. EGI is responsible of negotiating with the RPs the requirements collected from VOs. E.g. it is the body who investigates which RPs could be willing to implement the needed server or software setup or the required storage system etc... and a single point of contact for new users (babysitting)
  2. EGI should be responsible of all activities around service level management for the federated pool to offload VOs and RPs.
  3. All EGI security policies are enforced by the RPs contributing resources to the federated pool (e.g. individual RCs must be registered as production sites in GOCDB).
  4. Processes for collection of demand, offer and negotiation of SLAs and OLAs should be automated as much as possible to ensure the service scales.

Resource Provider

  1. Elastic resource provisioning: the federated pool does not need to be defined statically. The pool should include a statically defined minimum set of resources, and additional resources could be contributed on-demand by partners so that resources can be contributed or withdraw in a short time scale depending on which VOs submitted a request and on the short-term resource occupancy of a cluster. By doing so EGI has a chance of attracting more resources.
  2. Scientific review: a scientific review may not always needed for example when (1) the resource requirements are small and do not justify the overhead of a review (threshold can be defined), (2) when resources are requested to kickoff the activities on a new VO (3) the user community/project is already supported by one or more NGIs
  3. Technical review: A technical review of demand is needed to complement the initial scientific peer review (where applicable) and decide which RPs can meet the technical requirements. RPs need to directly interact with a VO representative for this. RPs must be involved in this, to technically inspect VO requirements to understand the impact on the resource configurations (cluster, storage set-up, network etc.), to understand if the demand complies to the local policies, and to acknowledge that resource allocation is possible for them.
  4. Match demand and offer. The resource allocation process should not only allow to address a resource request with an offer, but also vice versa a site with free resources to advertise it.
  5. Quality of Service. Different levels of Quality of Service could be offered by the federated pool (the type and number of levels should be discussed). Two opposite examples of levels are: (1) Best effort allocation without minimum number of slots allocated (opportunistic usage), (2) Minimum number of slots allocated at high priority, jobs arriving to that site will enter in execution at the earliest time the cluster occupation permits.
  6. Acknowledgement of usage. VOs must periodically acknowledge usage of resources through scientific publication, press releases. The entities to be acknowledged are EGI, the NGIs and the individual RPs contributing resources. See the wenmr best-practice and the recommendations for a scientific publication repository.
  7. Efficient usage of resources. Resources allocated must be used by VOs. If usage is insufficient or not acknowledged properly, the agreement can be terminated.
  8. A single entity (e.g. EGI) must be responsible of ensuring the enforcement of OLAs and SLAs.
  9. The allocation processes involving multiple RPs and the liaison with other RPs should be made transparent to one RP. EGI is responsible of ensuring that the federated RPs collectively deliver the service requested. One RP is not responsible of or not involved with the services provided by other RPs. The provisioning of resource to a federated pool should be with minimum overhead to the RP.
  10. Lightweight OLA negotiation: RPs (typically sites) should be relieved from the burden of negotiating OLAs for each VO request. The NGI could function as a proxy and sign the OLA on behalf of the individual sites.
  11. It is desirable that the same interfaces and tools offered to support international users can also be used to support national user communities. In this case the supervisioning role of EGI.eu (Model 2) and all related service level management functions should be delegated to the NGI.

Customer

  1. Provide a single broker contact point to the user to hide the complexity of a network of heterogeneous RPs and to establish a single Service Level Agreement, instead of multiple ones with multiple RPs.
  2. The federated pool of resources should formally support opportunistic usage of resources (opportunistic is intended here as the minimum level of service that can be offered by a RP). This is needed to provide a formal level of engagement between VOs and RPs (which does not exist currently).
  3. Support multiple levels of services in one resource demand: users should be provided with the flexibility of requiring a minimum guaranteed amount of resources complemented by an additional amount of resources to be allocated elastically (for example opportunistically).
  4. Simple and standardized SLA. Additional requirements should be allowed where needed by the VO. Many service levels should be made optional, as we do now in the VO ID card. The broker (EGI.eu for international VOs, NGIs for national VOs) should support the customer to express technical requirements, where needed.
  5. An external entity (e.g. EGI.eu or the NGI) is responsible of reporting on usage of the federated resource pool (rather than the VO itself). Enforcement of SLAs and OLAs does not involve VOs (it is devolved to EGI.eu).
  6. Usage of a federated distributed pool comes with minimum overhead for the user community.
  7. Efficient usage and acknowledgement of usage are two requirements for extension of the grant.

Note. The same processes that are being defined for resource allocation, could be generalized to request services, i.e. to allow new user communities to request services to be provided by NGIs, such as application porting.


'Coordinated offering of a federated resource pool' session

List of concerns collected during discussion during Coordinated offering of a federated resource pool (Evolving EGI Workshop Workshop)

  1. Concern 1: Do you require that all NGIs have the same model?
    • A: No, we will allow different models base of how much NGI want to be involved in the process.
  2. Concern 2: There is no free resources on the market, as majority is allocated. Our most concerns are "political" not "technical".
    • A: NGI_IT advocates an engagement by the EC for contributing funding to capacity building. Some regions in Europe already benefiting from structural funds for building compute capacity.
  3. Concern 3: There is culture to apply resources for free, but we need to change this
    • A: resource allocation can be used to support a pay-per-use model.
  4. Concern 4: Mixing Executive Board in RA is not good idea.
  5. Concern 5: How RA is related to the existing OLA? Are we going to change it or have another OLAs?
    • A: Existing OLAs are default ones, and but be applicable to any production site regardless of it being contributing resources to a distributed pool centrally managed or not. A modified (extended) version also encompassing sites that contribute to the pool is a possibility.
    • WeNMR: different (few) templates of OLAs could be predefined according to some parameters, like the amount of resources contributed.
  6. Concern 6: Review system should be kept as simple as possible - all should go the same path independently from type of request or requester.
  7. Concern 7: Would be good to have periodic review. For long-term SLAs we should check if the resources are used sufficiently.
    • A: the scientific review process already requires that a usage review is performed periodically, and that extensions are subject to the outcome of this assessment.
  8. Concern 8: We should have a pool of resources which can be allocated quickly (not waiting few months) to the VO so that they can start their work asap. If needed then they can apply for more.
  9. Concern 9: Elastic central pool within 10% it good idea. Some of the NGIs are obligated to allocate their resources to a specific VRCs.
  10. Concern 10 (WeNMR): allocation processes must be applicable not only at a VO level but also a single scientist, like in PRACE. PRACE defines a minimum resources requirement to make a resource request eligible.
  11. Concern 11 (WeNMR): defining short cuts for some requests (e.g. if they were already peer reviewed at a national level) should be avoided for transparency. All requests should undergo the same review cycle.

Template for request submission

The following is the template under discussion for a form to be used by user communities to ask for resources shares.

Field Comments
Contacts: Applicant (main contact for the application), SSO should allow us to have the SSO account on top of what the applicant would write.
Contacts: User Community/project: name and descriptive links
Contacts: Description of the activity why the community is requesting resources, what is the use case they want to implement?
Time limits: When would you like to start using the pool resources?
Time limits: For how long do you expect to use the pool resources? (shall we add a max here, like 12 months?)
CPU Capacity: type of allocation Opportunistic usage or guaranteed resources?
If guaranteed resources, total CPU time required HEP_SPEC/hours wall clock time
Max job duration: hepspec-hours
Min local storage: GB
Min physical memory per core: GB
Min swap size: GB
Other technical requirements: software installation/libraries etc
Storage Capacity: type of allocation Opportunistic usage or reserved space?
Amount of space requested (estimated in case of opportunistic request): GB