Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Integrating Reference Datasets Workplan

From EGIWiki
Revision as of 08:34, 28 November 2014 by Nunolf (talk | contribs)
Jump to navigation Jump to search


Main Members Workplan Meetings Actions



Tasks

Overview

ID Task Leader Participants Effort
T1 Identify existing life science datasets in EGI Gergely Sipos Fotis E. Psomopoulos, Afonso Duarte 2 months
T2 Identify reference datasets for replication Fotis E. Psomopoulos Giacinto Donvito, Nuno Ferreira 3 months
T3 EGI AppDB extension to a dataset registry William Karageorgos Marios Chatziangelou 7 months
T4 Tools for data replication Giacinto Donvito Fotis E. Psomopoulos, Afonso Duarte, Jona Javorsek, Marios Chatziangelou, Lukasz Dutka 6 months
T5 Analysis tools to work with data replicas Afonso Duarte E. Psomopoulos, Nuno Ferreira, ELIXIR delegates 3 months
T6 Integration with ELIXIR Registry Marios Chatziangelou Jon Ison 2 months

Description

The project between EGI and ELIXIR consists of the following tasks:

Identify existing life science datasets in EGI

Identify existing biological reference dataset replicas within the EGI infrastructure together with their key characteristics that make them usable for analysis (such as dataset version, source, access mode, related analysis tools, size, update frequency, tools used for replication, etc.). The task will survey resource providers and life science users of EGI and will look for datasets in the EGI information system and/or other EGI registries. The expected output of the task is an informative table about the datasets that are available on EGI and their key characteristics for users and resource providers → Milestone 1

  • Leader: Gergely Sipos (EGI.eu)
  • Contributors: Fotis E. Psomopoulos (AUTH), Afonso Duarte (ITQB)
  • Estimated length: 2 months

Identify reference datasets for replication

Identify key biological reference datasets from life sciences that would benefit from replication to EGI sites for example to increase their availability or scalability of access. The task will identify, engage with and survey life science data providers and data users including developers of the ELIXIR tools registry. The expected output of this task is an informative table about life science reference datasets that should be made available on EGI, together with their key characteristics for resource providers and users to replicate them and to use them (ie. metadata describing for example the size, update frequency, preferred access mode, related tools, etc.) → Milestone 2

  • Leader: Fotis E. Psomopoulos(AUTH)
  • Contributors: Giacinto Donvito (INFN), Nuno Ferreira (EGI.eu)
  • Estimated length: 3 months

EGI AppDB extension to a dataset registry

Extend the EGI Applications Database (AppDB) with new capabilities to expose information about biological reference datasets and their replicas across EGI. Key characteristics of these datasets should be made available by AppDB in the form of metadata for life science users. The initial dataset metadata schema should consist of basic attributes such as name, locations, size, and type; when input from tasks 1 & 2 becomes available, the schema should be revisited in order to identify any additional characteristics that may need to be included. A new access group should also be created, in order to allow particular individuals to input the actual initial metadata, once tasks 1 & 2 are complete. → Deliverable 1

  • Leader and partners: William Karageorgos (IASA)
  • Contributors: Marios Chatziangelou (IASA)
  • Estimated length: 7 months

Tools for data replication

Identify and propose suitable software tools, software configurations, operational practices and documentations to those who want to replicate key biological reference datasets to the EGI infrastructure. The tools can be relevant for resource providers to replicate complete datasets for groups of users, and can be relevant for life science researchers to replicate parts of reference datasets for custom analysis. The task will also setup a distributed testbed where the proposed tools and configurations can be tested and validated with real reference datasets and applications by life science communities. The expected outputs of the task are:

  1. Recommended services to replicate reference biological datasets to EGI (software, software configurations, operational practices, documentation) → Deliverable 2.
  2. A distributed testbed where the recommended service portfolio for replication is deployed and where reference life science datasets are replicated → Deliverable 3
  3. An evaluation of the recommended services on the testbed by resource providers and by life science users. (e.g. online survey or face to face workshop) → Milestone 3
  • Leader: Giacinto Donvito (INFN/Bari)
  • Contributors: Fotis E. Psomopoulos (AUT), Afonso Duarte (ITQB), Jona Javorsek (JSI), Marios Chatziangelou (IASA/GR), Lukasz Dutka (CYFRONET)
  • Estimated length: 6 months

Analysis tools to work with data replicas

Identify and provide guidance for the use of key life science software applications and tools that can be used to work with reference datasets on EGI. These tools can be used by life science researchers to define and execute custom analysis that work on reference datasets hosted on EGI. The task will review the identified tools on the distributed testbed of Task 4, and will provide information for the users about these tools at a central location, ideally as software profiles in EGI AppDB. → Deliverable 4

  • Leader: Afonso Duarte (ITQB)
  • Partners: E. Psomopoulos (AUTH), Nuno Ferreira (EGI.eu), ELIXIR (delegates to be involved)
  • Estimated length: 3 months

Integration with ELIXIR Registry

Collaboration work between the developers of the EGI AppDB and the ELIXIR service registry to federate information about ‘biological reference datasets’ from AppDB to the ELIXIR registry. The task will make content from the EGI AppDB visible for the broader life sciences community. Output of this task is technical integration between the ELIXIR Registry and the EGI AppDB, so content about reference datasets hosted on EGI can be federated from the EGI AppDB into the ELIXIR registry. → Deliverable 5.

  • Leader: Marios Chatziangelou (IASA)
  • Partner: Jon Ison (EBI)
  • Estimated length: 2 months


Milestones / Deliverables

Overview

Description

M1

Task 1 will provide an informative table about the datasets that are available on EGI together with their key characteristics for users and resource providers (ie. metadata describing the datasets such as size, update frequency, preferred access mode, etc.). This milestone will be used by Task 3 and Task 6 to implement metadata structures in AppDB and the ELIXIR registry to provide useful information about datasets. The milestone will be used also by Task 5 to identify analysis tools that can work with the existing reference dataset replicas.

M2

Task 2 will provide an informative table about life science reference datasets that should be made available on EGI, together with their key characteristics for resource providers and users to replicate them and to use them (ie. metadata describing for example the size, update frequency, preferred access mode, related tools, etc.). This milestone will be used by Task 3 and Task 6 as to implement the data structure that should be used by AppDB and the ELIXIR registry to provide information about datasets, and to populate these registries with content.

D1

Task 3 will deliver an extend version of the EGI Applications Database to expose information about biological reference datasets and their replicas across EGI.

D2

Task 4 will deliver recommended services to those who want to replicate key biological reference datasets to the EGI infrastructure. The tools can be relevant for resource providers to replicate complete datasets for groups of users, and can be relevant for life science researchers to replicate parts of reference datasets for custom analysis.The services are expected be software, software configurations, operational practices, documentation.

D3

Task 4 will deliver a distributed testbed where the recommended services (D2) are deployed and where reference life science datasets are replicated.

M3

Task 4 will provide an evaluation of the recommended services for dataset replication (D2). The evaluation will be performed by resource providers and life science users on the distributed testbed (D3) in the most suitable way, e.g. online survey, face-to-face workshop.

D4

Task 5 will provide information about key life science software applications and tools that can be used by life science researchers to define and execute custom analysis that work on reference datasets hosted on EGI. The information will be published at some central location, ideally as software profiles in EGI AppDB.

D5

Task 6 will deliver technical integration between the ELIXIR Registry and the EGI AppDB, so content about reference datasets hosted on EGI can be federated from the EGI AppDB into the ELIXIR registry.