VT Scalable Access to Federated Data

From EGIWiki
Jump to: navigation, search
Engagement overview Community requirements Community events Training EGI Webinars Documentations


EGI Virtual teams: Main Active Projects Closed Projects Guidelines

General Project Information

Motivation

Different solutions for federated storage management for High Throughput Computing of data a la grid or on cloud are possible, but not yet widely available in EGI as validated platforms capable of meet the performance requirements of Research Infrastructures. The problem to be faced is processing and visualization of large datasets, where the volume of data makes transfer unfeasible, and requires the migration of computation to data.

For example, "large amounts of image stacks or volumetric data are produced daily at brain research sites around the world. This includes human brain imaging data in clinics, connectome data in research studies, whole brain imaging with light-sheet microscopy and tissue clearing methods or micro-optical sectioning techniques, two-photon imaging, array tomography, and electron beam microscopy." Similar requirements are emerging from other areas like structural biology and life sciences.

A key challenge in make such data available is to make it accessible without moving large amounts of data. Typical dataset sizes can reach in the terabyte range, while a researcher may want to only view or access a small subset of the entire dataset.


Objectives

  • collect use cases and respective requirements for an "active repository platform" offering the capability to easily deploy an active repository that combines large data storage with a set of computational services (high throughput computing and cloud compute IaaS) for accessing and viewing large volume datasets
  • collect use cases for accessing and depositing data with PID identifiers
  • implement a distributed infrastructure offering different test environments for testing scalability of big data access in the EGI federated cloud/Grid infrastructure

Tasks

TASK1 (DONE) Invite TCB, OMB, compentece centres and user communities to participate, identify infrastructure providers contributing resources to the testbed

TASK2 (DONE) Define a list of relevant use cases for scalable big data access requiring co-location of compute and data

  • Human Brain Project active repository testbed
  • life science dataset replication

TASK3 (IN PROGRESS) Performance testing in different test scenarios

  • END OF APRIL. Draft technical specifications of the use cases including performance requirements -> Lukasz team to produce a draft for revision with HBP and discussion in the virtual team.
  • By MAY 14 (PHASE 1)
    • Local data access cloud test of HBP use case 1. Requires the set-up of the testbed that can meet the technical specifications of the testbed.
    • Collect technical specifications for the life science data replication use case (Fotis' virtual team)
    • Discuss with Life Science Community use case 2 and see if it can be easily gridified
  • JUNE: Distributed data access with data replication across (say) 3 cloud sites. Requires brokering capability for dispatching of workload next to data

Outcomes/Deliverables

  • May 2015. Distributed testbed, which can be incrementally developed with new technical solutions as needed by use cases
  • Dec 2015. Report on use cases and performance results

Members

  • Infrastructure providers
    • France. IN2P3-IRES/J. Pansanel
    • Germany. GWDG/P. Kasprzak
    • Germany. DESY/C. Bernardt, P. Fuhrmann
    • Greece. GRNET/K. Koumantaros
    • Italy. INFN Bari/ M. Antonacci, G. Donvito
    • Poland. CYFRONET/ L. Dutka
    • Spain. CESGA/C. Fernandez, J. Cacheiro
  • Technology providers, including members of the Technology Coordination Board
    • OneData/L. Dutka - CYFRONET
    • dynamic HTTP federation ("dynafed")/O. Keeble, F. Furano, CERN Data Management group
    • invenio/T. Smith - CERN
    • iRODS/J. Pansanel, E. Medernach - CNRS
    • dCache/C. Bernardt - DESY
    • GLOBUS/H. Heller - LRZ
  • Use cases
    • ELIXIR Data Replication use cases/F. Psomopoulos (AUTH)
    • Human Brain Project - Neuroscience/S. Hill, J. Muller (EPFL) (use case under discussion)

Resources

  • unfunded participation
  • NGIs and user communities contributing to EGI-Engage competence centres