Integrating Reference Datasets

From EGIWiki
Jump to: navigation, search
Engagement overview Community requirements Community events Training EGI Webinars Documentations


EGI Virtual teams: Main Active Projects Closed Projects Guidelines


VT Integrating Reference Datasets: Main Members Workplan Meetings Actions


Contents


General Project Information

Motivation

There has been significant work done in the EGI in the past to help the deployment and discovery of services, where “services” can be either computationally oriented (such as batch queues) or application oriented (such as web-services, ready-to-use applications embedded in portal gateways or encapsulated in Virtual Machine Images). However in bioinformatics many services used for analysis purposes rely on public reference datasets. Reference dataset are getting big and users struggle to discover, download and compute with them. There is an increasing demand to compute the data where the reference datasets are located. EGI members already host some biological reference datasets across the infrastructure, however currently EGI neither provides discovery capabilities for available datasets, nor provides guidelines for those who wish to use these datasets or would like to replicate additional datasets onto EGI sites. The project will facilitate the discovery of existing reference datasets in EGI and will develop and deploy services that allows the replication of life science reference datasets by data providers, resource providers and researchers, and the use of these datasets by life science researchers in analysis applications.

Tasks

  1. Identify existing life science datasets in EGI
  2. Identify reference datasets for replication
  3. EGI AppDB extension to a dataset registry
  4. Tools for data replication
  5. Analysis tools to work with data replicas
  6. Integration with ELIXIR Registry

Resources

EGI and ELIXIR will share and contribute equality to cover the cost of this pilot. Contributions will initially be covered from already running projects (such as EGI-InSPIRE), but opportunities for additional funding will be explored during the work. The partners may organise a joint workshop during the project to help the project achieve certain goals.

Benefits

The project will benefit ELIXIR by establishing:

  1. A set of tools and recommendations that would help ELIXIR members and partners
    • Achieve more balanced load on storage resources across their sites
    • Unload user analysis jobs from large centres to partner sites (with data replicas)
    • Perform data processing at national or home institute resources
  2. A pilot infrastructure that includes
    • Key datasets for life science analysis workflows
    • Information about applications and tools that researchers can choose from to work with reference datasets
    • A registry that provides information for users about the reference datasets and about the tools that are available to interact with these data
  3. A group of experts who can
    • Guide the setup of production infrastructures based on the pilot infrastructure
    • Themselves become providers in production systems.

The project will benefit EGI by:

  1. Capturing and documenting knowledge about tools, methods and solutions to replicate large datasets onto the e-infrastructures (grid, cloud)
  2. Producing tools and best practices that are reusable and relevant to other scientific domains (not only life sciences)
  3. Broadening EGI's scope from a compute infrastructure to a data infrastructure
  4. Strengthening the social network between the EGI and ELIXIR communities
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export