Competence centre LifeWatch Workflows and VLabs

From EGIWiki
Revision as of 01:42, 31 March 2015 by Marcoj (talk | contribs)
Jump to: navigation, search

Workflows & Virtual Laboratories

Objectives The objective of this task is to integrate, and as necessary develop, in EGI FedCloud framework, the services required to support workflows oriented to the deployment of Virtual Labs for LifeWatch. A key objective for LifeWatch is to offer support to researchers to implement workflows integrating different data sources and models, possibly executing on different resources in the e-infrastructure. Virtual Labs, based on these workflows, aim to provide an integrated framework for researchers, including not only access to multidisciplinary data catalogues, but also to simulation and modelling, processing and analytic tools. This mini-project will setup the basis for the deployment of Virtual Labs for LifeWatch researchers using FedCloud resources, offering not only a wide range of integrated tools, but also the guidance/experience on how to implement an extensible framework. Biodiversity analysis requires understanding the widespreading of different species and the features that differentiate within species for surviving on different conditions and environments. Knowing the ecological features that define suitable niches for the development of a specie is key for understanding the risk of the introduction of alien species, the identification of new areas for the reintroduction of species or the endanger species if the climate conditions change. One well-known method is the Ecological Niche Modelling, which requires layers defining the conditions of the environment (climate, geographic conditions, human interactions, etc.) and evidences of occurrences of species to create predictive models. Creating models for high-resolutions maps and a wide spectrum of species require intensive computation resources, such demonstrated in the EUBrazilOpenBio ( project. On the other side, knowing the different strands and families of species and their genetic relations is very important to understand the adaptation capabilities within a species, as well as the evolution of populations. Phylogenetic techniques create relation maps among different species or individuals from the same species to understand the widespreading capabilities of specific individuals or species. Phylogenetics use multiple alignment, Bayesian networks and Monte-Carlo chains, among other computing-intensive techniques. The design and integration of data mobilisation and analysis for the organization, analysis and dissemination of information on the interrelationships and interdependences between organisms, will be a clear contribution towards the Network of Life. Within this task the requirements for addressing both previous cases and to implement the proper environment to tackle these challenges will be analysed. The task will also address the integration of popular tools (OpenModeller, Phylyp, MUSCLE, T-Coffe, etc.) and frameworks (such as Galaxy) to efficiently run on the EGI Federated cloud.

Description of work Task 3.1 (UPV as JRU-LW-ES, INRA as NGI-FR): Integration of popular Bioinformatic interfaces (Galaxy, other pipelines) on EGI Federated Cloud in the SaaS model. This task has several subtasks: - Adaptation of a Galaxy portal, in particular the instance from the INRA in France, at to run jobs on EGI, benefiting from the provisioning and deployment capabilities of the EGI Federated cloud, executing a set of tools customized for different problems. - Link the public part of INRA’s numerical taxonomy database (R-Syst) for linking reference data and tools. - Create a repository of configurations for addressing different Biocomputing problems: Phylogenetics, RNA-Seq, Homologies search, etc. The task will increase the user community by seamlessly integrating the same tools that they already use. The approach (on-the-fly configuration of Virtual Appliances) will also help on managing a wider range of different configurations and on updating them. Task 3.2 (BSC,UPV as JRU-LW-ES, INFN as NGI-IT): An extensible framework for executing biodiversity pipelines on EGI Federated Cloud. Currently, a prototype for running Biodiversity workflows on EGI Federated Cloud is available through the OpenModeller HTC service developed in the frame of EUBrazilOpenBio by BSC, UPV and CRIA. The service will be enriched to address the requirements from the community identified through the prototype: -The current Niche Modelling Service developed by EUBrazilOpenBio is based on an openModeller workflow implemented through the COMPSs programming framework and available in the EGI AppDB. COMPSs provides scaling and elasticity features allowing to adapt the number of available resources to the actual need of the execution. BSC will extend the service in order to exploit new data management functionalities thus overcoming the current limitations when the data has to be shared by different instances deployed over different providers. Being its implementation generic and easily extensible, the service will be used to execute other biodiversity pipelines identified during the task. -INFN will optimize data management to provide the service with the proper mechanisms to integrate both users’ own data and reference data. Special focus will be given to large-scale data. New cloud storage technologies will be exploited in order to let the Biodiversity tools profit successfully from emerging technologies. This will also increase the portability of those applications over a distributed environment based on cloud resources. The solution that will be implemented indeed will provide the application with the capabilities of accessing data in a seamless way and the users with an easy framework to import their data in the computational infrastructure. COMPSs will be adopted to develop the applications and to optimize their execution, through automatic parallelization techniques, on the EGI Federated Cloud. Task 3 (CIBIO, CSIC as JRU-LW-ES): Implementation of the Network of Life. After an analysis of the framework of different standards, protocols and tools available within GBIF, the needs of adaptation/expansion to support species relationship data will be defined. Storage and organization needs of geo-referenced information on species interactions, extracted from the primary literature, will be considered. The system implemented will be able to build networks of potential interactions, based on the species that have been reported in a given area. Social network algorithms will be used to provide constraints on the network built by the researcher. The relationships should be connected with the environmental information layers, allowing the prediction of changes due to climate and land use change, estimate the impact of invasive species or the spread of agricultural and forest pests, and test the theory on the structure and dynamics of ecological networks.

Deliverables/milestones (brief description and month of delivery) D3.1 Analysis of existing biodiversity informatics standards, tools and protocols and needs to support species relationship data. Type: Report. Due M3. D3.2 Galaxy portal successfully connected to EGI Federated Cloud. Type: Prototype. Due: M6 D3.3 Portfolio of configurations. A set of configurations and Virtual Appliances for different Biocomputing problems. Type: Report+Other. Two releases one in M12 and one in M18. D3.4 Report on the extensions of the framework to execute new biodiversity workflows through COMPSs. Type: Report+Prototype, Due: M24 D3.5 Report on the usage of the Biodiversity services. A report describing the impact on the user community of the developments in the task. Due: M24