VT Genome analysis and protein folding support

Engagement overview

Community requirements

Community events

Training

EGI Webinars

Documentations

EGI Virtual teams:

Main •

Active Projects •

Closed Projects •

Guidelines

General Project Information

Leader: Afonso Duarte (ITQB-UNL, Portugal)
Meetings: via teleconference
Mailing List: vt-gapf(_at_)mailman.egi.eu
Status: Implementation
Start Date: 24-05-2014
End Date: 20-05-2015
Meetings :

From implementation phase: 22nd July 2014; 17th September 2014; 8th October 2014; Final VT project F2F meeting during the EGI Conference in Lisbon 20th May at 13:30

From scoping phase: 22th January 2014; 5th February 2014; 20th February 2014; 12th March 2014; 27th March 2014.

Project Initiation document,
Minutes and Presentations of the GAPF meetings:https://documents.egi.eu/document/2149
Project final report

Motivation

The researchers working with applications in the fields of Protein Structural Biology and Sequence Analysis (Protein/DNA/RNA) are growing communities within the EU research map. The setup of several FP7 projects and the increasing number of National and European infrastructures in these fields is self-evident (e.g. national ELIXIR nodes, PATHSEEK, NGS-PTL, RESPONSIFY projects).

Within the European Grid Infrastructure several VRCs (like WeNMR and LSGC) and VOs (like enmr.eu, biomed, …) have been setup to tackle the computational needs of these fields of science. However the information on how new users can approach and use such applications and tools is disperse and not obvious to access for users that work in the Biological Sciences field and are inexperienced with large-scale distributed computing systems.

The goal of this VT is to bridge the panoply of powerful tools, applications, workflows and knowledge existing within EGI and the end-users. This will be done via a) the setup of new training activities and outreach documents based on existing applications and b) the identification and integration of new tools and application in EGI that can attract new users to the infrastructure by advancing science. This VT also aims at triggering the setup of new knowledge networks within the users and EGI therefore potentiating future collaboration in research projects (e.g. future Horizon 2020 proposals).

Objectives

The beneficiaries of GAPF VT are Protein Structural Biology and Sequencing (Protein/DNA/RNA) communities.
The main objective of the VT is to increase awareness amongst the community of the existing services and applications of EGI that fall inside their expertise area, and to increase the attractiveness of EGI for researchers of these fields by the further development of the e-infrastructure.
The following supporting aims will help to achieve the major goal.

1. Identify tools available in the EGI e-infrastructure relevant for the VT interested community.

2. Identify reusable tools and scientific applications relevant for the VT interested community not yet supported by EGI, and make these available on the EGI production infrastructure.

3. Develop outreach materials to disseminate relevant applications to the target community.

4. Identify synergies (knowledge networks) within the users in order to increase the EGI usage experience and increase the number of users.

5. Organize training and promotional sessions to disseminate the services, tools and applications to potential users.

Tasks

T1 - Identify relevant applications, inside the scope of the VT, that are already available on the EGI e-infrastructure;

T2 - Identify relevant tools, applications, inside the scope of the VT, that are not yet available on EGI e-infrastructure, but that would benefit from the integration;

T3 - Promote the outcome of the VT to the target communities.

Members

EGI.eu

Gergely Sipos, EGI.eu, Netherlands
Nuno Ferreira EGI.eu, Netherlands
Neasan O'Neill EGI.eu, Netherlands

Portugal

Afonso Duarte, EGI Champion
João Pina, LIP, IberGrid

Greece

Fotis E. Psomopoulos, EGI Champion
Kostas Koumantarosm, GRNet

Italy

Daniele Cesini, INFN
Alessandro Constantini, INFN

France

Johan Montagnat, CNRS, LSGC
Tiphaine Martin, KCL

Spain

Jesus Marco de Lucas, CSIC
Beatriz Ranz Ribeiro , CSIC
Ignacio Blanquer, UPVLC

Finland

Kimmo Mattila, CSC

Germany

Konrad Förster, IMIB

UK

Rafael Jimenez, ELIXIR

The Netherlands

Alexandre Bonvin, WeNMR

Others ? YOU ! (if interested in this Virtual Team contact us - aduarte(+at+)itqb.unl.pt)

Resources

NGI-Community engagement table: https://documents.egi.eu/document/2074.

Progress

List of tools used/required by the sequencing and protein folding communities (see table below).

To add/change entries please use: https://docs.google.com/spreadsheet/ccc?key=0Ama69JoAAogvdHFzQi1UamwxN0MtVS1GUEV4ZmVGWXc&usp=sharing. If you don't have permission to do so please let us know!

Presentations of the different NGIs with information on use cases: https://drive.google.com/folderview?id=0B2a69JoAAogvWUw2UG9KTWpydFE&usp=sharing.

Presentation of GAPF VT objectives and aims in the EGI Community Forum, Helsinki, 22 May 2014. "Support for genome analysis and protein folding within the e-infrastructure" (https://indico.egi.eu/indico/contributionDisplay.py?sessionId=38&contribId=151&confId=1994).

Output: Community Use Cases

READemption

READemption is a pipeline for the computational evaluation of RNA-Seq data. The use case consists in running the analysis workflow on the EGI Cloud Federation. (Concluded use case)

Konrad Förstner (University of Würzburg, Germany)

Application: http://pythonhosted.org/READemption/

Webinar on READemption will take place on the 27th November 2014 (access here: https://indico.egi.eu/indico/conferenceDisplay.py?confId=2345 ).

The webinar will be recorded and a video will subsequently be made available via the EGI YouTube channel: https://www.youtube.com/user/EuropeanGrid .

Trufa

TRUFA (Transcriptomes User-Friendly Analysis) is a webserver designed to help researchers in genomics to perform de novo RNA-seq analysis. The goal is to exploit Cloud Federation resources from the TRUFA portal. (Ongoing Use Case)

Jesus Marco de Lucas (Instituto de Fisica de Cantabria, Spain)

Application: https://trufa.ifca.es/web

FedCloud Wiki: https://wiki.egi.eu/wiki/FedCloudTRUFA

Chipster

Chipster is a user-friendly analysis software for high-throughput data. It contains over 300 analysis tools for next generation sequencing (NGS), microarray, proteomics and sequence data. Chipster's client software uses Java Web Start to install itself automatically, and it connects to computing servers for the actual analysis. Chipster is open source and the server environment is available as a virtual machine image. (Ongoing Use Case)

Application: http://chipster.csc.fi/

FedCloud Wiki: https://wiki.egi.eu/wiki/FedCloudChipster

RSAT

RSAT provides a series of modular computer programs specifically designed for the detection of regulatory signals in non-coding sequences. (Ongoing Use Case)

Application: http://rsat.ulb.ac.be/

Fedcloud Wiki: https://wiki.egi.eu/wiki/FedCloudRSAT

List of tools used/required by the sequencing and protein folding communities

This list is a contribution from the different NGIs, users, developers, ...

If want to add/change entries please use the file (https://docs.google.com/spreadsheet/ccc?key=0Ama69JoAAogvdHFzQi1UamwxN0MtVS1GUEV4ZmVGWXc&usp=sharing).

If you don't have permission to do so please let contact us !

P.S.: This is a work in progress table

Name	type of tool (Application, workflow...)	Target Community(ies)	Description	Available in AppDB?	AppDB URL	Available online ?	website	Developer	Open source

Gromacs	Application	Protein structure, protein folding and protein dynamics	Performs molecular dynamics i.e calculate the Newtonian equations of motion for systems with hundreds to millions of particles. GROMACS can work with different biochemical molecules (e.g. proteins, lipids and nucleic acids)	yes	https://appdb.egi.eu/store/software/gromacs.wenmr	yes	http://www.gromacs.org/	http://www.gromacs.org/	yes
BLAST	Application	Bioinformatics, General Application, Basic Local Sequence Alignment	BLAST is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences.	yes	https://appdb.egi.eu/store/software/blast
BiG (BLAST in Grids)	Application	Bioinformatics, General Application, Basic Local Sequence Alignment	To draw a comparison by BLAST between all sequences of prokaryotes, fungi, plants and animals The objective of this project, to run an application on the Grid infrastructure provided by the e-Science network, is to be able to launch BLAST processes in that infrastructure. More precisely, it is to draw a comparison by BLAST between all sequences of prokaryotes, fungi, plants and animals. The results are then analyzed using scripts to estimate the degree of horizontal transfer of genes between prokaryotes and plants. To do so, those proteins that are found in prokaryotes and plants but not in animals will be identified, using different similarity thresholds to consider to what extent the similarity between two proteins implies a common origin or not.	yes	https://appdb.egi.eu/store/software/big.blast.in.grids
BWA Burrows-Wheeler Aligner	Application	Bioinformatics, General Application, Next Generation Sequence Assembling	Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome. Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome. It implements two algorithms, bwa-short a nd BWA-SW. The former works for query sequences shorter than 200bp and the latter for longer sequences up to around 100kbp. Both algorithms do gapped alignment. They are usually more accurate and faster on queries with low error rates.	yes	https://appdb.egi.eu/store/software/bwa.burrows.wheeler.aligner
VELVET	Application	Bioinformatics, General Application, Next Generation Sequence Assembling	Sequence assembler for very short reads Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454, developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), near Cambridge, in the United Kingdom. Velvet currently takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.	yes	https://appdb.egi.eu/store/software/velvet
SOAP-denovo	Application	Bioinformatics, General Application, Next Generation Sequence Assembling	SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost effective way.	yes	https://appdb.egi.eu/store/software/soap.denovo
ClustalW-MPI	Application	Bioinformatics, General Application, Multiple Sequence Alignment	ClustalW parallel implementation ClustalW is a tool for aligning multiple protein or nucleotide sequences. The alignment is achieved via three steps: pairwise alignment, guide-tree generation and progressive alignment. ClustalW-MPI is a distributed and parallel implementation of ClustalW. All three steps have been parallelized to reduce the execution time.	yes	https://appdb.egi.eu/store/software/clustalw.mpi
CLUSTALW	Application	Bioinformatics, General Application, Multiple Sequence Alignment	ClustalW is a program to perform multiple alignment of nucleic acid and protein sequences	yes	https://appdb.egi.eu/store/software/clustalw.mpi
MAFFT	Application	Bioinformatics, General Application, Multiple Sequence Alignment	Multiple Alignment using Fast Fourier Transform MAFFT is a multiple alignment program for amino acid or nucleotide sequences	yes	https://appdb.egi.eu/store/software/mafft
MUSCLE	Application	Bioinformatics, General Application, Multiple Sequence Alignment	MUltiple Sequence Comparison by Log-Expectation MUSCLE is public domain multiple alignment software for protein and nucleotide sequences	yes	https://appdb.egi.eu/store/software/muscle
HMMER	Application	Bioinformatics, General Application, Multiple Sequence Alignment	Profile hidden Markov models (profile HMMs) can be used to do sensitive database... Profile hidden Markov models (profile HMMs) can be used to do sensitive database searching using statistical descriptions of a sequence family’s consensus. HMMER is a freely distributable implementation of profile HMM software for protein sequence analysis	yes	https://appdb.egi.eu/store/software/hmmer
MrBayes	Application	Bioinformatics, General Application, Phylogenetics	A program for the Bayesian estimation of phylogeny The reconstruction of the evolutionary history of a group of organisms (phylogeny) is used throughout the life science, as they offer a structure around which to organize the knowledge and data accumulated by researchers. The inference of phylogenies with computational methods is widely used in medical and biological research and has many important applications, such as gene commonly prediction, drug discovery and conservation biology. The most commonly used methods to infer phylogenies include cladistics, phenetics, maximum likelihood and Markov Chain Monte Carlo based Bayesian inference. These last two depend upon a mathematical model describing the evolution of characters observed in the species included and are usually used for molecular phylogeny where the characters are aligned nucleotide or amino acid sequence. Due to the nature of Bayesian inference, the simulation can be prone to entrapment in local optima. To overcome this problem and achieve better estimation, the MrBayes program has to run for millions of iterations (generations) which require a large amount of computation time. For multiple sessions with different models or parameters, it will take a long time before the results can be analyzed and summarized. Since the phylogenetic tools are widely used by bioinformatics community, a Grid service for the parallelised version of MrBayes application has been deployed in order to allow bioinformatics to perform phylogenetic studies on a large scale.	yes	https://appdb.egi.eu/store/software/mrbayes
CD-HIT-Grid	Application	Bioinformatics, General Application, Sequence Analysis	Protein clustering on the Grid with CD-HIT CD-HIT performs protein clustering on a protein or genome sequence database. This consists in removing redundant sequences at a given sequence similarity level and generating a new database with the representatives only. As protein and genome databases are growing up day after day, the clustering process on interesting datasets in a single machine is not feasible due to memory constrains. A Grid environment allows an adaptive database distribution in order to optimize its overall analysis. This activity was proposed by CNIO (Spanish National Cancer Research Centre) and started in the context of the BioGridNet Program	yes	https://appdb.egi.eu/store/software/cd.hit.grid
InterProScan	Application	Bioinformatics, General Application, Sequence Analysis	InterProScan is a tool that combines several types of analysis in order to assign one or more functional signatures to a particular protein. InterProScan is a tool that combines several types of analysis in order to assign one or more functional signatures to a particular protein. It is implemented as a wrapper for some applications that can be executed simultaneously and combined in a computational analysis. The application was ported within the framework of the FP7 EDGeS project	yes	https://appdb.egi.eu/store/software/interproscan
Grid Bio Portal (GISELA)	Platform	Bioinformatics, Sequence Analysis	Grid Portal of Bioinformatics Applications The application consists on a Grid Portal in which different bioinformatics applications can be deployed. Different Grid services implement basic tools such as multiple alignments, Phylogenetics inference, etc, which run on a Grid infrastructure, along with more complex compound workflows that implement widely used analysis sequences. The portal will provide authentication, load balancing, session management, workflow, user interface, reliability, fault tolerance, data management and accounting. This software layer will increase the productivity of software production cycle and will offer a uniform framework to host Grid services. The portal will run on a computer provided of a Grid User Interface, acting as a bridge between the Web and the Grid sides. It makes use of myproxy servers and all the other global Grid Services (data catalogue, resource brokering, file transfer, etc.).	yes	https://appdb.egi.eu/store/software/grid.bio.portal.gisela
EMBOSS_in_JST	Application, Platform	Bioinformatics, Sequence Analysis	The European Molecular Biology Open Software Suite in the JST framework EMBOSS, ”The European Molecular Biology Open Software Suite” is a free Open Source software analysis package well established in the world-wide bioinformatics community. The tool has been adapted to be executed on the EGEE grid infrastructure within the JST framework in order to perform large scale analyses. Compared to other porting of the EMBOSS package on the GRID environment such as GrEMBOSS (http://cimi.ccg.unam.mx/ccg-OrganicG/en/GrEMBOSS), a gridified version of EMBOSS developed inside the EELA project, EMBOSS_in_JST appears to be well performing in the management of large data flow. The validity of the EMBOSS_in_JST approach was validated on a case study on Viroids. They are circular RNAs infecting plants. They show compact secondary structures and are unable to code for any protein. Infectivity of these RNAs exclusively relies on their ability to interact directly with host factors (proteins and/or RNAs) and to redirect cellular machinery and biosynthetic pathways for their replication and spread in the host. Viroids accomplish this aim likely mimicking some host RNA structural property. Therefore, viroid RNAs may unveil structural motives with functional properties also contained in cellular RNAs. Bioinformatics approaches in viroid research are impaired by the fact that the complete genome of most natural viroid hosts is still unknown. To overcome this difficulty we decided to run a secondary structure analysis on sub-sequences of the whole plant sequence data set available in EMBL. We analysed as a first test 231’000 intron regions for the secondary structure of interest by using the vrnalfold algorithm (search for local folding patterns) from the EMBOSS/EMBASSY package.	yes	https://appdb.egi.eu/store/software/emboss.in.jst

Return to VT Projects

VT Genome analysis and protein folding support

Contents

General Project Information

Motivation

Objectives

Tasks

Members

Resources

Progress

Output: Community Use Cases

List of tools used/required by the sequencing and protein folding communities

Navigation menu

VT Genome analysis and protein folding support

General Project Information

Motivation

Objectives

Tasks

Members

Resources

Progress

Output: Community Use Cases

List of tools used/required by the sequencing and protein folding communities

Navigation menu

Search