Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "VT VAPOR:VAPOR features description"

From EGIWiki
Jump to navigation Jump to search
Line 95: Line 95:
* Store and share notes that can serve as take-over reports between team that relay from one duty shift to another.
* Store and share notes that can serve as take-over reports between team that relay from one duty shift to another.
* Allow to send an email to a VO user by giving their DN (exists in the EGI Operations Portal, to be include into the VO Operations Dashboard).
* Allow to send an email to a VO user by giving their DN (exists in the EGI Operations Portal, to be include into the VO Operations Dashboard).
=== VO Dashboard ===
The [http://www.lip.pt/computing/apps/monitoring_vo_testing/index.php VO Dashboard] was a project started by LIP to integrate all the relevant information to the VO management.
It provides:
* A single entry point for viewing all the data.
* Single-view snapshot of all the key information.


=== Resource status indicators, statistical reports ===
=== Resource status indicators, statistical reports ===

Revision as of 11:55, 6 May 2013

The following provides a preliminary description of the VAPOR features.

Introduction

What does 'VO Support' consist in?

The mission of VO Support teams is generally to make the liaison between three actors: the EGI instances (UCST, UCB), the resource centres and the VO users. The VO Support function consists in several tasks, addressed at various extend depending on VOs. Hereafter we provide a non exhaustive list of VO Support tasks:

  • VO users administration: accepting or refusing subscription requests, manage the Acceptable Usage Policyc (AUP), handle membership expiration.
  • Provide users with support and expertise on VO-specific applications or science gateways.
  • Monitor the services and resources allocated to the VO by NGIs and resource centres, submit ticket to the resource centre responsible for a faulty resource and help investigate the issue. Monitoring may be performed either using a dedicated VO Nagios instance, VO custom tools or any other test framework. Tests, such as Nagios probes, may be generic VO probes or VO specific.
  • Negotiate with NGIs and resource centres the resources allocated to the VO.
  • Deal with the problems reported by supporting resource centres regarding VO users, such a excessive job submission or storage space used, jobs low efficiency etc. In such cases, the VO Support is in charge of contacting the user and helping him/her fix the problem.

Depending on the size (in terms of number of users) of each VO, on its experience and funding model, the VO Support may be performed by teams varying from a single VO manager to a strong dedicated IT support team. As result, the tasks covered may span from only a fraction to the whole set (or more) of tasks described above.

VAPOR preferably addresses VOs with no or few IT support, where the VO Support is performed by a single VO manager or a team of VO users contributing the effort on a volunteering model.

Note: it is assumed that the identification of VO Support teams members is done using a specific 'team' role in the VOMS server. It must be checked if this applies to any VO.

VO users management

VAPOR proposes to implement a Users Database intended to store informative data about the VO users, besides the rather simple administrative data available in the VOMS. The users database has two main goals:

  • Manage and follow up on users registration life-cycle: registration (VO membership), group membership, membership expiration. The life cycle workflow integrates interactions with third party services: VOMS, LCG File Catalog, EGI Applications Database.
  • Track information about users "hidden behind" a robot certificate. This is necessary to have a realistic idea of the number of actual users in a VO.
  • Track information about scientific publications to encourage users to acknowledge the usage of EGI resources.

Below we provide some guidelines for the design of the database, and the functions that will exploit it.

Users database

The database is multi-VO, i.e. any number of VOs can be supported and managed in a single instance of VAPOR. When possible, it should be avoided to duplicate existing information from other data sources into the VAPOR database, although this may be unavoidable in several cases for implementation concerns.

Users data to be stored includes:

  • General administrative data (DN, email, affiliation, membership duration...): important administrative fields may be made mandatory.
  • Scientific publications, scientific collaborations
  • Robot certificate (if any)
  • Scientific application used (linked with the EGI Applications Database)
  • LCG File Catalog base directory

Related works and material:

  • Some Czech national VOs use the Perun user and resource management system. Jiri Chudoba (VO AUGER) can provide information about this portal, as to how this could fit and be reused in the context of VAPOR.

Identification of users behind robot certificates

  • Keep track of their contact information
  • Periodically ask robot certificate holders to enter information about their real users in the system: at least a number of users, at most individual data (email, etc.). Exact content of this information is to be detailed with robot users.
  • Add a check box that the robot certificate owner can tick to guarantee that he/she is allowed to enter such personal user information with regards to the legislation of his/her country.

Related works and material:

  • WeNMR (VAPOR partner) may already have set up solutions to deal with robot certificates.
  • The EGI Operations Portal identifies robot certificates according to a specific DN scheme.
  • Some Slovenian projects have also dealt with robot certificates already, Jan Jona Javorsek (Jozef Stefan Institute, Slovenia) proposed to have a discussion on this subject.

User life cycle management

Currently, a new user registers on the VOMS of the VO he/she wishes to join. Then, the VOMS sends an email notification to the VO administrator who may ask the user to provide additional details about his/her activity and affiliation. Then, the VO administrator manually approves the request.
Note: this is the process implemented in biomed, to be checked with other partners if this is their case too.

Main features

The goal is to keep the VOMS as the service to initiate the subscription of a user. It may also be considered to start registration process through the VAPOR interface by an administrator, however the benefit of such a feature must be discussed and prioritized.

The main features below automate the management of subscription requests received from the VOMS:

  • Filter registration emails sent automatically by the VOMS to the VO administrators. To be checked if some other API would exist and be more appropriate (e.g. possibility to register an application to VOMS notifications?).
  • Check the validity of the user's DN: no quotes or double-quotes allowed as they are not supported by some services in the infrastructure. In this case, automatically reply and ask the user to require a new certificate.
  • VAPOR automatically sends an email asking the user to provide details about his activity and affiliation.
  • VAPOR displays a list of pending subscriptions that the administrator can manually approve or reject. The action of the administrator is reflected on the VOMS server.

Note: The features proposed here depend on the VOMS API available and the extend of actions allowed though this API.

Integration with external third party services

  • VOMS:
    • registration request: described above.
    • filter user membership expiration email notifications automatically sent by the VOMS, and trigger the cleaning up (if technically feasible, see Data Management features) of files of expired users after a grace period.
  • LCG File Catalog (LFC): automatically create the user's root directory in the LFC with appropriate ACLs. ACLS may be manually relaxed later depending on user's needs.
  • EGI Applications Database: link user with existing applications or give the ability to create a new application.
  • Maintain VO users mailing lists in sync with the database: add user's email address on subscription, remove on membership expiration. The integration with different mailing list systems should be studied starting with Mailman and GoogleGroups.

Additional features

  • Show a summary of the activity on the VO management mailing list: filter out VOMS email notifications that are already taken care of, and display other messages so that someone deals with them.
  • Collect feedback on the infrastructure, and scientific production (publications).

VO Operations management for VO Support teams

VO Nagios and VO Operations Dashboard

For VOs using a dedicated VO Nagios instance, the VO Support work may be based on the Nagios GUI itself, or on operation tools used to sort and display Nagios alarms in a more user friendly manner. VAPOR focuses on complementing such tools with more additional tasks, but still it remains independent of any science gateway of scientific application.

The VO Operations Dashboard provides features that largely helps in the daily monitoring of the resources. As an information, here are some of its features:

  • Filter and classify alarms from the VO Nagios box: keep track of alarms status, detailed log traces of the probe that triggered an alarm.
  • Cross data with GGUS: show open tickets for a given resource, in the scope of a specific VO. Specific marker for alarms that are currently managed (typically when a GGUS ticket has been submitted and is in "waiting for response" status).
  • Cross data with GOCDB status (downtimes, "not in production" and "not monitored") in order to filter out resources in such status from the alarms displayed.
  • Cross data with BDII status: show the status when it is not "production" along with the alarm.
  • Wizard to speed up the submission of TEAM tickets in GGUS: use of configurable template per type of resource (SE, CE, WMS, etc.)
  • Integrate hostname-to-sitename resolution to speed-up ticket submission process: based on a resource hostname (a SE or CE for instance), the ticket is automatically assigned to the resource centre responsible for that resource.
  • Store and share notes that can serve as take-over reports between team that relay from one duty shift to another.
  • Allow to send an email to a VO user by giving their DN (exists in the EGI Operations Portal, to be include into the VO Operations Dashboard).

VO Dashboard

The VO Dashboard was a project started by LIP to integrate all the relevant information to the VO management.

It provides:

  • A single entry point for viewing all the data.
  • Single-view snapshot of all the key information.


Resource status indicators, statistical reports

Report GOCDB and BDII status

Before they submit a new GGUS ticket to report a problem, members of a VO Support team should know if the faulty resource is in proper production status, or on the contrary if it is in a non-production status of any kind. Two different information sources must be checked:

  • The GOCDB provides administrative information as to the resource status: site suspended, resource or site downtime, resource not in production, resource not monitored
  • The BDII provides the resource dynamic status: not in production, closed (SE), drained (CE).

To avoid submitting tickets erroneously, operation tools filter out resources in downtime (GOCDB status). They generally do not report the BDII status of the resources, assuming that the BDII and GOCDB statuses are consistent, or at least that when the BDII status is not in production, then the GOCDB status is also not in production. But the opposite is not true, a resource may be in production according to the BDII while the is in downtime, from an administrative point of view. This is fine.

However, it happens that the BDII status is not in production, although the GOCDB status is in proper production. Typically, a site administrator may decide to close a CE queue for a given VO to prevent the queue from being overloaded by a nasty user tool.

For this reason, VAPOR proposes that both statuses, GOCDB and BDII, be available to the VO Support team, in order to avoid submitting tickets regarding a resource that is not in proper production. This information must be easily accessed, in a synthetic way.

Related works and material:

  • The biomed support tools developed by I3S provide a python script that collects the status of all resources (CE/SE/WMS) that support a given VO (configurable) and reports any resource in non-production status.
  • The VO Operations Dashboard provides views of Nagios alarms along with non-production status retrieved from the GOCDB. At the time of writing, it is proposed to also integrate the status from the BDII. In any case, VAPOR should be able to provide such a consolidated list, in order to provide VO Support teams with a quick overview of non-production statuses.

The VOMS and LFC are the most critical services of a VO. Scheduled and possibly unscheduled downtimes of such services must be advertised to the VO users in an automatic way.

A solution to implement this feature is to monitor downtime email broadcasts sent by the EGI Operations Portal, and automatically select those impacting the critical services of the VO. An warning email is then sent to the VO users. This action should preferably be validated by an administrator before sending the email.

Monitor resources availability

VO managers need tools that help them have a clear overview of the activity of their VO in terms of storage and computing resources. VAPOR proposes to generate on-demand reports of the following types:

  • Storage resources: total, used and free space on SEs supporting the VO, globally and by individual SE.
  • Computing resources:
    • Follow up of running vs. waiting jobs of the CEs supporting the VO, either globally or by individual CE: this helps monitor peaks of activity by the VO users, and possibly figure out bottlenecks.
    • Status of jobs submitted on CEs supporting the VO: in order to monitor CEs more precisely than what Nagios provides, it is proposed to submit jobs very regularly (several times a day) on all CEs supporting the VO, and report the ratio of successful jobs, failed jobs, timed out jobs (waiting for more than a configurable duration), average waiting time for successful jobs.
  • Quality of service of storage and computing resources: number of events, percentage of time down, number of jobs successful/failed/timed out, etc.), to help identify resources with recurring issues, and eventually consider corrective actions.
  • Handle black-list/white-list of resources (?)

Related works and material:

  • The GStat portal already provides some of the reports described above. However data in GStat cannot be exported, only the last year period is reported, and the accuracy of the data stored decreases with time. Having the monitoring data stored in VAPOR allows for a more flexible use of the data in order to compute specific statistical reports later on.
  • The biomed support tools developed by I3S provide python scripts to collect data about storage resources (total, used, free space), data about computing resources (number of waiting and running jobs). The data is collected in csv files. In addition, some analysis scripts produce csv files ready to be drawn as charts.
  • The MonCE tool, developed at IPHC, is a framework that submits jobs to any resource supporting a given VO and report the number of successful jobs, failed jobs, timed out jobs, and the average waiting time for successful jobs. This tools is already running for biomed, the data is stored into a MySQL database. Some analysis scripts have been written to start exploiting this data.

VO Data Management: file migration and cleaning procedures

Storage Elements are often critical resources during the execution of scientific workflows on the grid infrastructure. It is therefore critical to handle common issues like storage element filling up or decommissioning. Taking care of such issues requires specific procedures related to the tracking, cleaning up or migration of VO users' data. Overall this is called "VO Data Management".

HEP VOs have set up some procedures to address those issues, thus enforcing a data management policy at the VO level. This relies on the fact that users manage their data through a controlled set of tools. In VOs with fragmented user groups, it is hardly possible to make this overall control, therefore the data management must rely on procedures that are agnostic of users tools.

In the following, we describe the issues to be tackled in VAPOR, and describe strategies adopted by other communities.

Remove old files

The fact that a user has left a VO is generally detected only when the user's membership expires, either because they were suspended automatically or because they did not sign the Acceptable Usage Policy (AUP) in time. To avoid that SEs fill up with useless data of former users, the files left behind needs to be tracked and deleted.

VAPOR helps to automate this process. The envisaged procedure is described below:

  1. Periodically, the VO Manager produces a list of DNs of users with membership expired for more than a given duration, e.g. 2 months: they may have failed to sigh the AUP, or were suspended for some other reason.
  2. The LFC administrator (i) retrieves LFC entries belonging to those users, possibly filtering only files older than some minimum age; (ii) changes the ownership of those entries to the VO Manager DN; (iii) provides the VO Manager with this list of entries.
  3. In turn, the VO Manager can remove (lcg-del) the files from the LFC and the SEs simultaneously.

It must be studied if this procedure would meet the needs of different VOs, and is acceptable for VO managers and LFC managers when they are not the same person or group of persons.

Deal with SEs filling up

To avoid that SEs fill up, it is necessary to (i) monitor SEs with only few remaining free space, and (ii) trigger a procedure to track VO files and contact the file owners so that they migrate their files to some other SE. Ideally, the procedure should propose alternative SEs with enough free space would help users in the migration.

Related works and material:

  • The LFCBrowseSE tool, developed by GRyCAP (LSGC memeber), is able to list all LFC entries which have a replica on a SE given by its hostname. Optionally it provides owners'DNs and files SURLs.
  • On top of LFCBrowseSE, the biomed support tools, developed by I3S, provide the scan-se tool that scans all SEs with few remaining space. For each SE it returns a list of users along with the space they use and their email addresses.

Consistency between LFC and SEs: zombie & ghost files

Zombie files, aka. dark data, are physical file replicas with no more entry in any file catalog of the VO. Such files waste space on SEs and are likely to remain forever unless some procedure points them out.

Conversely, ghost files are entries of the file catalog with no more corresponding physical replica. Such entries load the file catalog with useless entries that may hamper the overall catalog performances.

Not all users of a VO may use the same LFC or any other kind of file catalog. In this case, the procedure set up to delete ghost files should allow for the configuration of files to be ignored: for instance, a set of regular expressions describing file names on SEs with well known pattern.

Related works and material:

The LHCb and ATLAS VOs use the following approach: (i) supporting sites periodically produce a full dump of their SE (file names); (ii) the VO data manager compares this content with the file catalog content, and removes zombie replicas (scalability issues with large SEs), and removes ghost entries from the catalog. The procedure and format of the storage dumps is generic and reusable. The comparison tools for LHCb are very experiment-specific and probably hardly adaptable. But the ATLAS comparison scripts are probably reusable.

VAPOR needs to assess how this procedure could be adapted and automated in the context of other VOs. A test phase should involve a limited set of sites willing to participate. First, the SE dump should be done on demand to check the procedure validity, then dumps may be automated.

QMUL (UK) is ok to start this way. GRIF (FR) has proposed to test/tune the procedures before transmitting to sites, because: (i) some sites lack tooling/knowledge, (ii) procedures that perform too many requests to SE databases may be “SE killer”.

Other possible features

Set up a counter of number of tickets submitted by site, time ticket remain open, so that VO managers can eventually figure out problematic sites (if any), and consider corrective actions with them.

Community Accounting

Today, the only accounting tool available grid-wide is the EGI Accounting Portal. It produces per site, per NGI, and per VO resource usage reports. VAPOR intends to provide accounting reports at levels of organisation closer to the user communities: per VRC resource usage, per VO sub-group resource usage.

In VOs with fragmented user groups, estimating the future needs for resources is hardly possible as user groups remain independent. However, being making such estimations should be useful for catch-all VOs to be able to anticipate their needs and negotiate resources with sites and NGIs. In particular, the Resource Allocation Work Group currently works on the model of pools of resources contributed by NGIs. Such resources would be allocated to VOs in response to a "request for resources", in which VOs estimate the amount of CPU hours that they need.

VAPOR proposes projections of future needs for storage and computing resources of a VO, based on the extrapolation of passed accounting data from.