VT VAPOR:VAPOR features description

From EGIWiki
Jump to: navigation, search

The following provides a preliminary description of the VAPOR features.



Status of this document

This document is the result discussions held between VAPOR partners during the first 3 months of the project (see Indico meetings). Its goal is to scope the features to be implemented in VAPOR and to provide a high level description of each of them.

What does 'VO Support' consist in?

The mission of VO Support teams is generally to make the liaison between three actors: the EGI instances (UCST, UCB), the resource centres, and the VO users. The VO Support function consists in several tasks, addressed at various extend depending on VOs. Hereafter we provide a non exhaustive list of VO Support tasks:

Depending on the size (in terms of number of users) of each VO, on its experience and funding model, the VO Support may be performed by teams varying from a single VO manager to a strong dedicated IT support team. As result, the tasks covered may span from only a fraction to the whole set (or more) of tasks described above.

VAPOR preferably addresses VOs with no or few IT support, where the VO Support is performed by a single VO manager or a team of VO users contributing the effort on a volunteering model. VAPOR remains independent of any science gateway or scientific application.

Note: it is assumed that the identification of VO Support teams members is done using a specific 'team' role in the VOMS server. It must be checked if this applies to any VO.

VO users management

VAPOR proposes to implement a Users Database intended to store informative data about the VO users, besides the rather simple administrative data available in the VOMS. The users database has two main goals:

Below we provide some guidelines for the design of the database, and the functions that will exploit it.

Users database

The database is multi-VO, i.e. any number of VOs can be supported and managed in VAPOR. Users should keep on using the VOMS interface to register, only VO administrators are allowed to access and interact with the VAPOR users database.

Users data to be stored includes:

When possible, it should be avoided to duplicate existing information from other data sources (specifically the VOMS) into the VAPOR database, although this may be unavoidable in several cases for implementation concerns. A process should be able to track and cure inconsistencies between the VOMS and the VAPOR user database. A synchronization mechanism can be developed, based on the VOMS APIs available (GUI, CLI, any other?).

Related works and material:

Identification of users behind robot certificates

Periodically, robot certificate holders must be asked to enter information about real users in the system: at least a number of users, at most individual data (email, etc.). Exact content of this information is to be detailed with robot users. A check box that the robot certificate owner can tick guarantees that he/she is allowed to enter such personal user information with regards to the legislation of his/her country.

Related works and material:

User life cycle management

Currently, a new user registers on the VOMS of the VO he/she wishes to join. Then, the VOMS sends an email notification to the VO administrator who may ask the user to provide additional details about his/her activity and affiliation. Then, the VO administrator manually approves the request.

The goal is to keep the VOMS as the service to initiate the subscription of a user. The features below automate the management of subscription requests received from the VOMS:

Optional additional features:

VO Operations management for VO Support teams

Existing tools

Below we provide a short review of existing VO operation tools, and we make a specific focus on the VO Operations Dashboard, as it has been agreed that VAPOR be integrated into this existing dashboard (see MoM with the dashboard developers team).

VO Nagios

For VOs using a dedicated VO Nagios instance, the VO Support work may be based on the Nagios GUI itself. Additionally, operation tools are able to sort and display Nagios alarms in a more user friendly manner. VAPOR does not wish to develop such a tool, rather it focuses on complementing existing tools with additional tasks.

VO Operations Dashboard

The VO Operations Dashboard provides features that largely helps in the daily monitoring of the resources. As an information, here are some of its features:

Some other features are not implemented yet but under study:

VO Admin Dashboard

The VO Dashboard project is developed by LIP to integrate all the relevant information to the VO management. It provides a single entry point for viewing all the data, providing all the key information at a glance.

Resource status indicators, statistical reports

Report GOCDB and BDII status

Before they submit a new GGUS ticket to report a problem, members of a VO Support team should know if the faulty resource is in proper production status, or on the contrary if it is in a non-production status of any kind. Two different information sources must be checked:

Operation tools filter out resources in downtime (GOCDB status). They generally do not report the BDII status of the resources, assuming that the BDII and GOCDB statuses are consistent. However, it happens that the BDII status is not in production, although the GOCDB status is in proper production. For instance, a site administrator may decide to close a CE queue for a VO to prevent the queue from being overloaded by a nasty user tool, although the CE remains in production for other VOs and thus remains in production status in the GOCDB. Conversely, when a resource centre (site) is suspended, the site is marked as suspended in the GOCDB although resources may remain in production in the BDII.

For this reason, VAPOR proposes that both statuses, GOCDB and BDII, be displayed to the VO Support team, in order to avoid submitting tickets about a resource that is not in proper production. This information must be easily accessed, in a synthetic way.

Related works and material:

The VOMS and LFC are the most critical services of a VO. To be informed of scheduled and unscheduled downtimes of such services, any user can register with the notifications sent automatically by the EGI Operations Portal. However, most users do not do that and probably even do not know this service. Therefore, such downtimes should be advertised to the VO users in an automatic way.

To implement this feature, it is necessary to check if some API allows an application to subscribe to some notifications using a defined filter. If no API is available, a solution is to monitor downtime email broadcasts sent by the EGI Operations Portal, and automatically filter those impacting the critical services of the VO. An warning email is then sent to the VO users. This action should preferably be validated by an administrator before sending the email.

Related works and material: to do that in the Operations Portal itself, filter only a subset of services.

Monitor resources availability

VO Support teams need tools that help them have a clear overview of the activity of their VO in terms of storage and computing resources. VAPOR proposes to generate on-demand reports of the following types:

Related works and material:

VO Data Management: file migration and cleaning procedures

Storage Elements are often critical resources during the execution of scientific workflows on the grid infrastructure. It is therefore critical to handle common issues like storage element filling up or decommissioning. Taking care of such issues requires specific procedures related to the tracking, cleaning up or migration of VO users' data. Overall this is called "VO Data Management".

HEP VOs have set up some procedures to address those issues, thus enforcing a data management policy at the VO level. This relies on the fact that users manage their data through a controlled set of tools and documented procedures. In VOs with fragmented user groups, it is hardly possible to make this overall control, therefore the data management must rely on procedures that are agnostic of users tools and scientific applications.

In the following, we describe the issues to be tackled in VAPOR, and describe strategies adopted by other communities.

Remove old files

The fact that a user has left a VO is generally detected only when the user's membership expires, either because they were suspended automatically or because they did not sign the Acceptable Usage Policy (AUP) in time. To avoid that SEs fill up with useless data of former users, the files left behind must be tracked and deleted.

VAPOR helps to automate this process. The envisaged procedure is described below:

  1. Periodically, the VO Manager produces a list of DNs of users with membership expired for more than a given duration, e.g. 12 months: they may have failed to sign the AUP, or were suspended for some other reason.
  2. The LFC administrator (i) retrieves LFC entries belonging to those users, possibly filtering only files older than some minimum age; (ii) changes the ownership of those entries to the VO Manager DN; (iii) provides the VO Manager with this list of entries.
  3. In turn, the VO Manager can remove (lcg-del) the files from the LFC and the SEs simultaneously.

It must be studied if this procedure would meet the needs of different VOs, and is acceptable for VO managers and LFC managers when they are not the same person or group of persons.

Deal with SEs filling up

To avoid that SEs fill up, it is necessary to (i) monitor SEs with only few remaining free space, and (ii) trigger a procedure to track VO files and contact the file owners so that they migrate their files to some other SE. Ideally, the procedure should propose alternative SEs with enough free space would help users in the migration.

Related works and material:

SE decommissioning

In case of SE decommissioning or resource centre decommissioning, VOs must start a decommissioning procedure to track VO files and contact the file owners so that they migrate their files to some other SE. Ideally, the procedure should propose alternative SEs with enough free space would help users in the migration.

This is very close to the management of full SEs. The same related tools apply.

Consistency between LFC and SEs: zombie & ghost files

Zombie files, aka. dark data, are physical file replicas with no more entry in any file catalog of the VO. Such files waste space on SEs and are likely to remain forever. A proc edure must take care of this issue by regularly (once every 1 or 2 months typically) by tracking and removing the zombie replicas.

Conversely, ghost files are entries of the file catalog with no more corresponding physical replica. Such entries load the file catalog with useless entries that may hamper the overall catalog performances.

Users of a VO may not all use the same LFC or may use any other kind of file catalog. In this case, the procedure set up to delete ghost files should allow for the configuration of (i) file names pattern described by regular expressions that should be ignored by the clean up procedure, (ii) a minimum age, e.g. only files older than 1 year should be deleted.

Related works and material:

The LHCb, CMS and ATLAS VOs use the following approach: (i) supporting sites periodically produce a full dump of their SE (file names); (ii) the VO data manager compares this content with the file catalog content, removes zombie replicas from the SE (scalability issues with large SEs), and removes ghost entries from the catalog. The procedure and format of the storage dumps is generic and reusable. The comparison tools for LHCb are very experiment-specific and probably hardly adaptable. But the ATLAS comparison scripts are probably reusable. Very interesting material is described in this page. More particularly:

VAPOR needs to assess how this procedure could be adapted and automated in the context of other VOs. A test phase should involve a limited set of sites willing to participate. First, the SE dump should be done on demand to check the procedure validity, then dumps may be automated. QMUL (UK) is ok to start this way. GRIF (FR) has proposed to test/tune the procedures before transmitting to sites, because: (i) some sites lack tooling/knowledge, (ii) procedures that perform too many requests to SE databases may be “SE killer”.

Community Accounting

Today, the only accounting tool available grid-wide is the EGI Accounting Portal. It produces per site, per NGI, and per VO resource usage reports. VAPOR intends to provide accounting reports at levels of organisation closer to the user communities: per VRC resource usage, per VO sub-group resource usage.

In VOs with fragmented user groups, estimating the future needs for resources is hardly possible as user groups remain independent. However, being making such estimations should be useful for catch-all VOs to be able to anticipate their needs and negotiate resources with sites and NGIs. In particular, the Resource Allocation Work Group currently works on the model of pools of resources contributed by NGIs. Such resources would be allocated to VOs in response to a "request for resources", in which VOs estimate the amount of CPU hours that they need.

VAPOR proposes projections of future needs for storage and computing resources of a VO, based on the extrapolation of passed accounting data from.

Personal tools