VT VAPOR:VAPOR features description
The following provides a preliminary description of the VAPOR features.
Status of this document
This document is the result discussions held between VAPOR partners during the first 3 months of the project (see Indico meetings). Its goal is to scope the features to be implemented in VAPOR and to provide a high level description of each of them.
What does 'VO Support' consist in?
The mission of VO Support teams is generally to make the liaison between three actors: the EGI instances (UCST, UCB), the resource centres, and the VO users. The VO Support function consists in several tasks, addressed at various extend depending on VOs. Hereafter we provide a non exhaustive list of VO Support tasks:
- VO users administration: accepting or refusing subscription requests, manage the Acceptable Usage Policyc (AUP), handle membership expiration.
- Provide users with support and expertise on VO-specific applications or science gateways.
- Monitor the services and resources allocated to the VO by NGIs and resource centres, submit GGUS tickets to the resource centre responsible for a faulty resource and help investigate the issue. Monitoring may be performed either using a dedicated VO Nagios instance, VO custom tools or any other test framework. Tests, such as Nagios probes, may be generic VO probes or VO specific.
- Negotiate with NGIs and resource centres the resources allocated to the VO.
- Deal with the problems reported by supporting resource centres regarding VO users, such a excessive job submission or storage space used, jobs low efficiency etc. In such cases, the VO Support is in charge of contacting the user and helping him/her fix the problem.
Depending on the size (in terms of number of users) of each VO, on its experience and funding model, the VO Support may be performed by teams varying from a single VO manager to a strong dedicated IT support team. As result, the tasks covered may span from only a fraction to the whole set (or more) of tasks described above.
VAPOR preferably addresses VOs with no or few IT support, where the VO Support is performed by a single VO manager or a team of VO users contributing the effort on a volunteering model. VAPOR remains independent of any science gateway or scientific application.
Note: it is assumed that the identification of VO Support teams members is done using a specific 'team' role in the VOMS server. It must be checked if this applies to any VO.
VO users management
VAPOR proposes to implement a Users Database intended to store informative data about the VO users, besides the rather simple administrative data available in the VOMS. The users database has two main goals:
- Manage and follow up on users registration life-cycle: registration (VO membership), group membership, membership expiration. The life cycle workflow integrates interactions with third party services: VOMS, LCG File Catalog, EGI Applications Database.
- Track information about users "hidden behind" a robot certificate. This is necessary to have a realistic idea of the number of actual users in a VO.
- Track information about scientific publications to encourage users to acknowledge the usage of EGI resources.
Below we provide some guidelines for the design of the database, and the functions that will exploit it.
The database is multi-VO, i.e. any number of VOs can be supported and managed in VAPOR. Users should keep on using the VOMS interface to register, only VO administrators are allowed to access and interact with the VAPOR users database.
Users data to be stored includes:
- General administrative data (DN, email, affiliation, membership duration...): important administrative fields may be made mandatory.
- Free text field: VO administrators must be able to add free text information regarding user's research field, why the user was accepted into the VO, what are their scientific collaboration, etc. Possibly, during the registration process, the user should be asked to list research fields from the discipline classification recently published by the VT Scientific Discipline Classification.
- Scientific publications: keep track of published works using the infrastructure by means of URL of free text field.
- DN of robot certificate (if any): users behind a robot certificate may or may not have their own certificate. If they do not, it is important to be able to register them in the database anyway.
- Scientific application used (linked with the EGI Applications Database)
- User's LCG File Catalog base directory
When possible, it should be avoided to duplicate existing information from other data sources (specifically the VOMS) into the VAPOR database, although this may be unavoidable in several cases for implementation concerns. A process should be able to track and cure inconsistencies between the VOMS and the VAPOR user database. A synchronization mechanism can be developed, based on the VOMS APIs available (GUI, CLI, any other?).
Related works and material:
- Some Czech national VOs use the Perun user and resource management system. Jiri Chudoba (VO AUGER) can provide information about this portal, as to how this could fit and be reused in the context of VAPOR.
- The VT Scientific Publications Repository Implementation Resources addresses the problem of acknowledging the resource usage.
Identification of users behind robot certificates
Periodically, robot certificate holders must be asked to enter information about real users in the system: at least a number of users, at most individual data (email, etc.). Exact content of this information is to be detailed with robot users. A check box that the robot certificate owner can tick guarantees that he/she is allowed to enter such personal user information with regards to the legislation of his/her country.
Related works and material:
- The EGI Operations Portal identifies robot certificates according to a specific DN scheme.
- Some Slovenian projects have also dealt with robot certificates already, Jan Jona Javorsek (Jozef Stefan Institute, Slovenia) proposed to add attributes into proxy certificates so that the different intermediaries may be tracked back until the original job submitter.
User life cycle management
Currently, a new user registers on the VOMS of the VO he/she wishes to join. Then, the VOMS sends an email notification to the VO administrator who may ask the user to provide additional details about his/her activity and affiliation. Then, the VO administrator manually approves the request.
The goal is to keep the VOMS as the service to initiate the subscription of a user. The features below automate the management of subscription requests received from the VOMS:
- Registration: filter emails notifications sent by the VOMS to the VO administrators (no other API seems to be available that would e.g. allow application to register with VOMS notifications).
- Check the validity of the user's DN: no quotes or double-quotes allowed as they are not supported by some services in the infrastructure. In this case, automatically reply and ask the user to require a new certificate.
- Automatically send an email asking the user to provide details about his activity and affiliation, the applications they use (pointer to the EGI Applications Database).
- Display a list of pending subscriptions that the administrator can manually approve or reject. The action of the administrator is reflected on the VOMS server.
- Automatically create the user's root directory in the LCG File Catalog (LFC) with appropriate ACLs. ACLS may be manually relaxed later depending on user's needs.
- Link user with the existing applications that they use in the EGI Applications Database.
- Add user's email address to VO users mailing lists the VO users mailing list.
- Membership expiration: filter email notifications sent by the VOMS, and trigger
- the cleaning up of files of expired users after a grace period (see Data Management features),
- the removal of the user's email address from the VO users mailing list.
- Mailing list management:
- The integration with different mailing list systems should be studied starting with Mailman and GoogleGroups, to keep a mailing list in sync with the current list of users.
- An function must allow to simply export the list of email addresses of the VO users.
Optional additional features:
- Show a summary of the activity on the VO management mailing list: filter out VOMS email notifications that are already taken care of, and display other messages so that someone deals with them.
- Collect feedback on the infrastructure, and scientific production (publications). For this purposes different alternatives may be considered:
- Use an existing platform such as Google Schoolar or ResearchGate, and add the publications from the members of the VO after requesting their consent to a specific join profile. This will create H-index factors immediately and it can be easily referenced and shared.
- Create a database with the entries in different databases and directly link to them. This will be more tedious but less prone to issues due to changes on the interfaces or procedures in the third party tools.
VO Operations management for VO Support teams
Below we provide a short review of existing VO operation tools, and we make a specific focus on the VO Operations Dashboard, as it has been agreed that VAPOR be integrated into this existing dashboard (see MoM with the dashboard developers team).
For VOs using a dedicated VO Nagios instance, the VO Support work may be based on the Nagios GUI itself. Additionally, operation tools are able to sort and display Nagios alarms in a more user friendly manner. VAPOR does not wish to develop such a tool, rather it focuses on complementing existing tools with additional tasks.
VO Operations Dashboard
The VO Operations Dashboard provides features that largely helps in the daily monitoring of the resources. As an information, here are some of its features:
- Filter and classify alarms from the VO Nagios box: keep track of alarms status, detailed log traces of the probe that triggered an alarm.
- Cross data with GGUS: show open tickets for a given resource, in the scope of a specific VO. Specific marker for alarms that are currently managed (typically when a GGUS ticket has been submitted and is in "waiting for response" status).
- Cross data with GOCDB status (downtimes, "not in production" and "not monitored") in order to filter out resources in such status from the alarms displayed.
- Wizard to speed up the submission of TEAM tickets in GGUS: use of configurable template per type of resource (SE, CE, WMS, etc.)
- Integrate hostname-to-sitename resolution to speed-up ticket submission process: based on a resource hostname (a SE or CE for instance), the ticket is automatically assigned to the resource centre responsible for that resource.
- Store and share notes that can serve as take-over reports between teams that relay each other during duty shifts.
Some other features are not implemented yet but under study:
- Send an email to a VO user by giving their DN (already exists in the EGI Operations Portal).
- Cross data with BDII status: show the status when it is not "production" along with the alarm (to be implemented)
- To avoid submitting duplicated tickets, search for GGUS open ticket for the same service/hostname, submitted by orher VOs.
- Advertise critical resources downtimes: The EGI Operations Portal sends email notifications based on information filtered from the downtimes declared in the GOCDB. It could also allow to describe filters only a subset of the services supporting a VO, e.g. to notify only critical services downtimes.
VO Admin Dashboard
The VO Dashboard project is developed by LIP to integrate all the relevant information to the VO management. It provides a single entry point for viewing all the data, providing all the key information at a glance.
Resource status indicators, statistical reports
Report GOCDB and BDII status
Before they submit a new GGUS ticket to report a problem, members of a VO Support team should know if the faulty resource is in proper production status, or on the contrary if it is in a non-production status of any kind. Two different information sources must be checked:
- The GOCDB provides administrative information as to the resource status: site suspended, resource or site downtime, resource not in production, resource not monitored
- The BDII provides the resource dynamic status such as production, not in production, closed, drained.
Operation tools filter out resources in downtime (GOCDB status). They generally do not report the BDII status of the resources, assuming that the BDII and GOCDB statuses are consistent. However, it happens that the BDII status is not in production, although the GOCDB status is in proper production. For instance, a site administrator may decide to close a CE queue for a VO to prevent the queue from being overloaded by a nasty user tool, although the CE remains in production for other VOs and thus remains in production status in the GOCDB. Conversely, when a resource centre (site) is suspended, the site is marked as suspended in the GOCDB although resources may remain in production in the BDII.
For this reason, VAPOR proposes that both statuses, GOCDB and BDII, be displayed to the VO Support team, in order to avoid submitting tickets about a resource that is not in proper production. This information must be easily accessed, in a synthetic way.
Related works and material:
- The biomed support tools developed by I3S provide a python script that collects the status of all resources (CE/SE/WMS) that support a given VO (configurable) and reports any resource in non-production status.
- The VO Operations Dashboard provides views of Nagios alarms along with non-production status retrieved from the GOCDB. At the time of writing, it is proposed to also integrate the status from the BDII. In any case, VAPOR should be able to provide such a consolidated list, in order to provide VO Support teams with a quick overview of non-production statuses.
Advertise critical resources downtimes
The VOMS and LFC are the most critical services of a VO. To be informed of scheduled and unscheduled downtimes of such services, any user can register with the notifications sent automatically by the EGI Operations Portal. However, most users do not do that and probably even do not know this service. Therefore, such downtimes should be advertised to the VO users in an automatic way.
To implement this feature, it is necessary to check if some API allows an application to subscribe to some notifications using a defined filter. If no API is available, a solution is to monitor downtime email broadcasts sent by the EGI Operations Portal, and automatically filter those impacting the critical services of the VO. An warning email is then sent to the VO users. This action should preferably be validated by an administrator before sending the email.
Related works and material: to do that in the Operations Portal itself, filter only a subset of services.
Monitor resources availability
VO Support teams need tools that help them have a clear overview of the activity of their VO in terms of storage and computing resources. VAPOR proposes to generate on-demand reports of the following types:
- Computing resources:
- Follow up of running vs. waiting jobs of the CEs supporting the VO, either globally (chart) or by individual CE (table): this helps monitor peaks of activity by the VO users, and possibly figure out bottlenecks.
- Quality of service on CEs supporting the VO: to monitor CEs more accurately than Nagios probes, it is proposed to submit jobs several times a day on all CEs supporting the VO, and report the ratio of successful jobs, failed jobs, timed out jobs (waiting for more than a configurable duration), average waiting time for successful jobs.
- Handle white-list of resources: publish a list of storage and computing resources that "work fine now": based on reports described above, it should be possible to come up with a list of resources updated frequently and with a good quality of service. E.g.: storage elements with large free space and good response time, computing elements with high rate of successful jobs and a low average waiting time. The interest of such a list depends on the fact that job submission systems may rely on it to choose CEs where to submit jobs: WMS, pilot job system, etc. all ave their format expectations. Also, it must be studied if such a list would indeed improve the efficiency of user's applications as compared to existing job pilot systems for instance.
- Storage resources:
- Show total, used and free space on SEs supporting the VO, globally and by individual SE.
- Show the variety of SEs by implementation type (DPM, dCache, Storm...) and version
- Overall quality of service indicators (lower priority): number of events, number of tickets submitted (by resource or by site, average time tickets remain open etc.), percentage of time down. Such indicators could help identify resources with recurring issues, and eventually consider corrective actions.
Related works and material:
- The GStat portal already provides some of the reports described above. However data in GStat cannot be exported, only the last year period is reported, and the accuracy of the data stored decreases with time. Having the monitoring data stored in VAPOR allows for a more flexible use of the data in order to compute specific statistical reports later on.
- The biomed support tools developed by I3S provide python scripts to collect data about storage resources (total, used, free space), data about computing resources (number of waiting and running jobs). The data is collected in csv files. In addition, some analysis scripts produce csv files ready to be drawn as charts.
- The JobMonitor tool, developed at IPHC (CNRS), is a framework that submits jobs to any resource supporting a given VO and report the number of successful jobs, failed jobs, timed out jobs, and the average waiting time for successful jobs. This tools is already running for biomed, the data is stored into a MySQL database. Some analysis scripts have been written to start exploiting this data.
VO Data Management: file migration and cleaning procedures
Storage Elements are often critical resources during the execution of scientific workflows on the grid infrastructure. It is therefore critical to handle common issues like storage element filling up or decommissioning. Taking care of such issues requires specific procedures related to the tracking, cleaning up or migration of VO users' data. Overall this is called "VO Data Management".
HEP VOs have set up some procedures to address those issues, thus enforcing a data management policy at the VO level. This relies on the fact that users manage their data through a controlled set of tools and documented procedures. In VOs with fragmented user groups, it is hardly possible to make this overall control, therefore the data management must rely on procedures that are agnostic of users tools and scientific applications.
In the following, we describe the issues to be tackled in VAPOR, and describe strategies adopted by other communities.
Remove old files
The fact that a user has left a VO is generally detected only when the user's membership expires, either because they were suspended automatically or because they did not sign the Acceptable Usage Policy (AUP) in time. To avoid that SEs fill up with useless data of former users, the files left behind must be tracked and deleted.
VAPOR helps to automate this process. The envisaged procedure is described below:
- Periodically, the VO Manager produces a list of DNs of users with membership expired for more than a given duration, e.g. 12 months: they may have failed to sign the AUP, or were suspended for some other reason.
- The LFC administrator (i) retrieves LFC entries belonging to those users, possibly filtering only files older than some minimum age; (ii) changes the ownership of those entries to the VO Manager DN; (iii) provides the VO Manager with this list of entries.
- In turn, the VO Manager can remove (lcg-del) the files from the LFC and the SEs simultaneously.
It must be studied if this procedure would meet the needs of different VOs, and is acceptable for VO managers and LFC managers when they are not the same person or group of persons.
Deal with SEs filling up
To avoid that SEs fill up, it is necessary to (i) monitor SEs with only few remaining free space, and (ii) trigger a procedure to track VO files and contact the file owners so that they migrate their files to some other SE. Ideally, the procedure should propose alternative SEs with enough free space would help users in the migration.
Related works and material:
- The LFCBrowseSE tool, developed by GRyCAP (LSGC member), is able to list all LFC entries which have a replica on a SE given by its hostname. Optionally it provides owners'DNs and files SURLs.
- On top of LFCBrowseSE, the biomed support tools, developed by I3S, provide the scan-se tool that scans all SEs with few remaining space. For each SE it returns a list of users along with the space they use and their email addresses.
In case of SE decommissioning or resource centre decommissioning, VOs must start a decommissioning procedure to track VO files and contact the file owners so that they migrate their files to some other SE. Ideally, the procedure should propose alternative SEs with enough free space would help users in the migration.
This is very close to the management of full SEs. The same related tools apply.
Consistency between LFC and SEs: zombie & ghost files
Zombie files, aka. dark data, are physical file replicas with no more entry in any file catalog of the VO. Such files waste space on SEs and are likely to remain forever. A proc edure must take care of this issue by regularly (once every 1 or 2 months typically) by tracking and removing the zombie replicas.
Conversely, ghost files are entries of the file catalog with no more corresponding physical replica. Such entries load the file catalog with useless entries that may hamper the overall catalog performances.
Users of a VO may not all use the same LFC or may use any other kind of file catalog. In this case, the procedure set up to delete ghost files should allow for the configuration of (i) file names pattern described by regular expressions that should be ignored by the clean up procedure, (ii) a minimum age, e.g. only files older than 1 year should be deleted.
Related works and material:
The LHCb, CMS and ATLAS VOs use the following approach: (i) supporting sites periodically produce a full dump of their SE (file names); (ii) the VO data manager compares this content with the file catalog content, removes zombie replicas from the SE (scalability issues with large SEs), and removes ghost entries from the catalog. The procedure and format of the storage dumps is generic and reusable. The comparison tools for LHCb are very experiment-specific and probably hardly adaptable. But the ATLAS comparison scripts are probably reusable. Very interesting material is described in this page. More particularly:
- ATLAS sites produce xml format storage dumps for DPM following procedures explained in this DDM operations twiki, this includes the dpm_dump.py script.
- CMS sites use a very similar DPM dump script dpmdump.py.
VAPOR needs to assess how this procedure could be adapted and automated in the context of other VOs. A test phase should involve a limited set of sites willing to participate. First, the SE dump should be done on demand to check the procedure validity, then dumps may be automated. QMUL (UK) is ok to start this way. GRIF (FR) has proposed to test/tune the procedures before transmitting to sites, because: (i) some sites lack tooling/knowledge, (ii) procedures that perform too many requests to SE databases may be “SE killer”.
Today, the only accounting tool available grid-wide is the EGI Accounting Portal. It produces per site, per NGI, and per VO resource usage reports. VAPOR intends to provide accounting reports at levels of organisation closer to the user communities: per VRC resource usage, per VO sub-group resource usage.
In VOs with fragmented user groups, estimating the future needs for resources is hardly possible as user groups remain independent. However, being making such estimations should be useful for catch-all VOs to be able to anticipate their needs and negotiate resources with sites and NGIs. In particular, the Resource Allocation Work Group currently works on the model of pools of resources contributed by NGIs. Such resources would be allocated to VOs in response to a "request for resources", in which VOs estimate the amount of CPU hours that they need.
VAPOR proposes projections of future needs for storage and computing resources of a VO, based on the extrapolation of passed accounting data from.