Integrating Reference Datasets

Engagement overview

Community requirements

Community events

Training

EGI Webinars

Documentations

EGI Virtual teams:

Main •

Active Projects •

Closed Projects •

Guidelines

VT Integrating Reference Datasets:

Main •

Members •

Workplan •

Meetings •

Actions

General Project Information

Project title : Integrating life science reference datasets within EGI
Proposers : Fotis E. Psomopoulos, Giacinto Donvito
Coordinator : Fotis E. Psomopoulos
Mailing list : lsrdi-vt .at. mailman.egi.eu
Start Date : 1st of December, 2014
Foreseen length: 9 months

Motivation

There has been significant work done in the EGI in the past to help the deployment and discovery of services, where “services” can be either computationally oriented (such as batch queues) or application oriented (such as web-services, ready-to-use applications embedded in portal gateways or encapsulated in Virtual Machine Images). However in bioinformatics many services used for analysis purposes rely on public reference datasets. Reference dataset are getting big and users struggle to discover, download and compute with them. There is an increasing demand to compute the data where the reference datasets are located. EGI members already host some biological reference datasets across the infrastructure, however currently EGI neither provides discovery capabilities for available datasets, nor provides guidelines for those who wish to use these datasets or would like to replicate additional datasets onto EGI sites. The project will facilitate the discovery of existing reference datasets in EGI and will develop and deploy services that allows the replication of life science reference datasets by data providers, resource providers and researchers, and the use of these datasets by life science researchers in analysis applications.

Tasks

Resources

EGI and ELIXIR will share and contribute equality to cover the cost of this pilot. Contributions will initially be covered from already running projects (such as EGI-InSPIRE), but opportunities for additional funding will be explored during the work. The partners may organise a joint workshop during the project to help the project achieve certain goals.

Benefits

The project will benefit ELIXIR by establishing:

A set of tools and recommendations that would help ELIXIR members and partners
- Achieve more balanced load on storage resources across their sites
- Unload user analysis jobs from large centres to partner sites (with data replicas)
- Perform data processing at national or home institute resources
A pilot infrastructure that includes
- Key datasets for life science analysis workflows
- Information about applications and tools that researchers can choose from to work with reference datasets
- A registry that provides information for users about the reference datasets and about the tools that are available to interact with these data
A group of experts who can
- Guide the setup of production infrastructures based on the pilot infrastructure
- Themselves become providers in production systems.

The project will benefit EGI by:

Capturing and documenting knowledge about tools, methods and solutions to replicate large datasets onto the e-infrastructures (grid, cloud)
Producing tools and best practices that are reusable and relevant to other scientific domains (not only life sciences)
Broadening EGI's scope from a compute infrastructure to a data infrastructure
Strengthening the social network between the EGI and ELIXIR communities

Integrating Reference Datasets

Contents

General Project Information

Motivation

Tasks

Resources

Benefits

Navigation menu

Integrating Reference Datasets

General Project Information

Motivation

Tasks

Resources

Benefits

Navigation menu

Search