Difference between revisions of "EGI-Engage:Data Plan"
(10 intermediate revisions by 6 users not shown) | |||
Line 4: | Line 4: | ||
'''Help and support:''' quality@egi.eu | '''Help and support:''' quality@egi.eu | ||
'''Last update''': March 2016 | |||
This page describe data management plan for the research data that will be generated within EGI-Engage. For each dataset, it describes the type of data and their origin, the related metadata standards, the approach to sharing and target groups, and the approach to archival and preservation. | This page describe data management plan for the research data that will be generated within EGI-Engage. For each dataset, it describes the type of data and their origin, the related metadata standards, the approach to sharing and target groups, and the approach to archival and preservation. | ||
Line 53: | Line 55: | ||
*'''Target groups:''' technology provider and service developer and provider teams who contribute to the EGI service portfolio | *'''Target groups:''' technology provider and service developer and provider teams who contribute to the EGI service portfolio | ||
*'''Scientific Impact: '''used for the further-development of IT services offered by the EGI Community. These services are often result of technological R&D and subject of publications in conference proceedings and peer-review journals | *'''Scientific Impact: '''used for the further-development of IT services offered by the EGI Community. These services are often result of technological R&D and subject of publications in conference proceedings and peer-review journals | ||
*'''Approach to sharing:''' A public version of the collected requirements is going to be shared in the EGI-Engage milestones and deliverables. | *'''Approach to sharing:''' A public version of the collected requirements is going to be shared in the EGI-Engage milestones and deliverables. Some of the most important documents in this respect will be: M6.5 Joint training program for the second period (M15, May 2016), Intermediate and annual project reports (every 6 months), First milestones and deliverables of the ELIXIR, EPOS, BBMRI competence centres | ||
| | | | ||
Line 60: | Line 62: | ||
*Google Drive of EGI.eu. | *Google Drive of EGI.eu. | ||
*EGI Document Database (visibility can be restricted to certain user groups): https://documents.egi.eu/public/DocumentDatabase | *EGI Document Database (visibility can be restricted to certain user groups): https://documents.egi.eu/public/DocumentDatabase | ||
*EGI Wiki or public documents (typically for derived and analysed user survey data): https://wiki.egi.eu/wiki/EGI-Engage | *EGI Wiki or public documents (typically for derived and analysed user survey data): https://wiki.egi.eu/wiki/EGI-Engage | ||
*Tickets in RT - typically ticket in the technical-support-cases Queue | |||
|- | |- | ||
| SA2.3 | | SA2.3 | ||
| Kimmo Mattila (kimmo.mattila@csc.fi) | | Kimmo Mattila (kimmo.mattila@csc.fi) | ||
| <br> | | <font size="2"><font face="Arial, serif">The use cases of Elixir Competence Center as well as ELIXIR as infrastructure manage life science data produced by life scientists. The decripition below refers to that data.<br><br>The Elixir Compentence Center itself is not creating new scientific data. <br>Due to that Elixir CC is not actively doing scientific data management.</font></font><br> | ||
<span style="font-variant: normal"><font color="#000000"><span style="text-decoration: none"><font face="Arial, serif"><font size="2" style="font-size: 10pt"><span style="font-style: normal"><span style="font-weight: normal"> | |||
</span></span></font></font></span></font></span> | |||
*'''Types of data: '''life science data; the management of genomics data: Marine metagenomics, Plant genomics and phenotype and Human sensitive data | *'''Types of data: '''life science data; the management of genomics data: Marine metagenomics, Plant genomics and phenotype and Human sensitive data | ||
*'''Origin of data: ''' produced and submitted by scientists. ELIXIR repositories collect, integrate and provide access to the data. | *'''Origin of data: ''' produced and submitted by scientists. ELIXIR repositories collect, integrate and provide access to the data. | ||
Line 78: | Line 84: | ||
| Services for archiving and preservation within ELIXIR are listed in https://www.elixir-europe.org/services. | | Services for archiving and preservation within ELIXIR are listed in https://www.elixir-europe.org/services. | ||
|- | |- | ||
| SA2.4<br> | | SA2.4<br> | ||
| | | | ||
Petr Holub (petr.holub@bbmri-eric.eu) | Petr Holub (petr.holub@bbmri-eric.eu) | ||
| | | Deals primarily with human-related data, most of which can be considered privacy-sensitive. | ||
| | *'''Types of data: ''' privacy-sensitive human data, limited data sets of non-human data. | ||
| | *'''Origin of data: ''' health care (for patients participating in research), biobanks (collection of information for non-patient research participants, as well as analysis of samples), scientists (analysis of samples and data and return of resulting data back to the biobanks). | ||
| | *'''Scale of data:''' depending on the specific data types collected in the given setting, but various omics data and large 2D/3D imaging data can be in order of more than 1TB per research participant. | ||
| Standards are related to the origin of data: for clinical biobanks, it is related to the healthcare standards in the given country (not homogeneous across Europe); for population biobanks, it is subject to various ongoing efforts to make data structures more standardized and interoperable (e.g., ISO TC276 Working Group 5). | |||
| | |||
*'''Target groups:''' researchers medical and biomedical research. | |||
*'''Scientific Impact:''' health care improvement, public health improvement. | |||
*'''Approach to sharing:''' FAIR (findable, accessible, interoperable, reusable) access, while complying with regulatory frameworks related to privacy-sensitive data. | |||
| Part of core business of biobanks and hence BBMRI-ERIC. | |||
|- | |- | ||
| SA2.5 | | SA2.5 | ||
Line 125: | Line 137: | ||
| As data formats for the data gathered | | As data formats for the data gathered | ||
in the | in the water reservoir we will use basically text-based formats like CSV. However, other data product will be generated using results of models performed over these input data and formats like NetCDF or HDF5 will be used. For sharing these datasets, the metadata standard used will be WaterML. WaterML is a OGC (Open Geospacial Consortium) standard information model for the representation of water observations data. In order to guarantee reproducibility of the experiments, the goal is to set up an ontology including WaterML attribute that establish relationships between the different components of the case study: instrumentation, datasets, software, models, etc. | ||
water reservoir we will use basically text-based formats like CSV. | |||
However, other data product will be generated using results of models | | | ||
performed over these input data and formats like NetCDF or HDF5 | |||
will be | |||
used. For sharing these datasets, the metadata standard used will be | |||
WaterML. WaterML is a OGC (Open Geospacial Consortium) standard | |||
information model for the representation of water observations | |||
data. In | |||
order to guarantee reproducibility of the experiments, the goal is to | |||
set up an ontology including WaterML attribute that establish | |||
relationships between the different components of the case study: | |||
instrumentation, datasets, software, models, etc. | |||
| | |||
*'''Target groups:''' The data can be interesting for other research teams that make similar analysis at other water reservoirs. | *'''Target groups:''' The data can be interesting for other research teams that make similar analysis at other water reservoirs. | ||
*'''Scientific Impact: '''The data can potentially underpin scientific publications. | *'''Scientific Impact: '''The data can potentially underpin scientific publications. | ||
Line 160: | Line 161: | ||
| Data are stored on a few e-Infrastructures, mirrored and synchronised. There are two levels of storage: a large short-term, and a reduced long-term. | | Data are stored on a few e-Infrastructures, mirrored and synchronised. There are two levels of storage: a large short-term, and a reduced long-term. | ||
|- | |- | ||
| SA2.9<br> | | SA2.9<br> | ||
| Daniele Bailo (daniele.bailo@ingv.it)<br> | | Daniele Bailo (daniele.bailo@ingv.it)<br> | ||
| | |<br> | ||
| | *'''Types of data: '''Seismological waveforms. | ||
| | *'''Origin of data: '''World-wide data archives providing FDSN interfaces for raw and earthquakes parametric data, (EIDA - IRIS - USGS - NCEC) and synthetics produced and postprocessed in HPC and local resources | ||
| | *'''Scale of data:''' Real streams form FDSN, including IRIS - US, are pre-staged on-demand. These are only limited by specific authorisation policies and mechanisms which are not yet fully hadled in the curren system. Therefore size of pre-staged raw data and processed synthetics depends from usage | ||
| FDSN standards for disseminating data. Data format : SEED and mSEED. Also: VTK, W3C-PROV, QuakeML, mp4, KML for the specific purpose each standard was created for. | |||
| | |||
*'''Target groups:''' Mainly environmental researchers. | |||
*'''Scientific Impact: '''This research data can underpin scientific publications and development of pan european computational earth science e-Infrastructure | |||
*'''Approach to sharing:''' Produced data can be shared according to users' requirements. Authorization policies can be configured by users via the generic iRODS authorisation system and its GUI client. Default setting consider all data like 'private'. Metadata, provenance and lineage are currently publicly available. | |||
| Synthetic and pre-staged raw data are handled and stored within a local federation and data management system, which preserves lineage and provenance information. Original raw data is stored across the nodes of the FDSN service providers. | |||
|- | |- | ||
| SA2.10 | | SA2.10 |
Latest revision as of 15:33, 14 April 2016
Help and support: quality@egi.eu
Last update: March 2016
This page describe data management plan for the research data that will be generated within EGI-Engage. For each dataset, it describes the type of data and their origin, the related metadata standards, the approach to sharing and target groups, and the approach to archival and preservation.
Deliverable 2.4 Data management plan
This document will be further developed before the mid-term and final project reviews:
- February 2016
- February 2017
- August 2017
with more detailed information related to the discoverability, accessibility and exploitation of the data.
Rules
The Open Research Data Pilot applies to two types of data:
- the data, including associated metadata, needed to validate the results presented in scientific publications as soon as possible;
- other data (e.g. curated data not directly attributable to a publication, or raw data), including associated metadata.
The obligations arising from the Grant Agreement of the projects are (see article 29.3):
Regarding the digital research data generated in the action (‘data’), the beneficiaries must:
- deposit in a research data repository and take measures to make it possible for third parties to access, mine, exploit, reproduce and disseminate — free of charge for any user — the following: the data, including associated metadata, needed to validate the results presented in scientific publications as soon as possible; other data, including associated metadata, as specified and within the deadlines laid down in the 'data management plan';
- provide information — via the repository — about tools and instruments at the disposal of the beneficiaries and necessary for validating the results (and — where possible — provide the tools and instruments themselves).
Note: As an exception, the beneficiaries do not have to ensure open access to specific parts of their research data if the achievement of the action's main objective, as described in Annex 1, would be jeopardised by making those specific parts of the research data openly accessible. In this case, the data management plan must contain the reasons for not giving access.
Datasets
Task | Contact | Data description | Standards and metadata | Data sharing | Archiving and preservation |
---|---|---|---|---|---|
SA2.1/ SA2.2 | Gergely Sipos (gergely.sipos@egi.eu) |
|
The data is not in any standard format. Survey data - textual data, structured data (typically CSV or XLS) or graphics (usually survey summary or analysis) |
|
Based on the nature of the data these can be:
|
SA2.3 | Kimmo Mattila (kimmo.mattila@csc.fi) | The use cases of Elixir Competence Center as well as ELIXIR as infrastructure manage life science data produced by life scientists. The decripition below refers to that data. The Elixir Compentence Center itself is not creating new scientific data. Due to that Elixir CC is not actively doing scientific data management.
|
Some standards like the standard formats in the marine or the plain domain are still under development. Some of the standards for capturing and exchanging genomic data that might be used in the use cases are described in BioSharing [R3]. Part of the data may be stored to public data repositories (e.g. ENA) that have clearly defines metadata models. |
|
Services for archiving and preservation within ELIXIR are listed in https://www.elixir-europe.org/services. |
SA2.4 |
Petr Holub (petr.holub@bbmri-eric.eu) |
Deals primarily with human-related data, most of which can be considered privacy-sensitive.
|
Standards are related to the origin of data: for clinical biobanks, it is related to the healthcare standards in the given country (not homogeneous across Europe); for population biobanks, it is subject to various ongoing efforts to make data structures more standardized and interoperable (e.g., ISO TC276 Working Group 5). |
|
Part of core business of biobanks and hence BBMRI-ERIC. |
SA2.5 | Alexandre Bonvin (a.m.j.j.bonvin@uu.nl) |
|
The end results are typically deposited into public databases like the PDB or EMDB for cryo-EM data. |
|
From a university perspective, data are to be kept for 10 years. Currently, there is no proper archiving mechanism in place at the particular site (Utrecht University). At the moment, policies and services rely on what is provided by the database service providers where data are deposited. |
SA2.6 | Davor Davidović (davor.davidovic@irb.hr) |
|
The community does not promote any specific metadata standard. The adopted metadata formats vary from case to case. Also, there is no recommendation about any long-term preservation format and thus no domain-specific data format is used or recommended. Thus, an individual approach for each use case is required. |
|
The implementation of the repositories, safe guarantee, number of copies, etc. is on individual data/repository providers. The plan is to implement several digital repositories for a specific DARIAH use cases (e.g. Bavarian dialects) using gLibrary framework that allows storing the data on different storage elements (local, grid and cloud storage elements). |
SA2.7 | Jesus Marco de Lucas (marco@ifca.unican.es) |
|
As data formats for the data gathered
in the water reservoir we will use basically text-based formats like CSV. However, other data product will be generated using results of models performed over these input data and formats like NetCDF or HDF5 will be used. For sharing these datasets, the metadata standard used will be WaterML. WaterML is a OGC (Open Geospacial Consortium) standard information model for the representation of water observations data. In order to guarantee reproducibility of the experiments, the goal is to set up an ontology including WaterML attribute that establish relationships between the different components of the case study: instrumentation, datasets, software, models, etc. |
|
Copies are kept in WORM tapes, and in a separate server (400 km away) of the company. Main repository uses RAID technology and has not lost any data in the last 10 years. The data are automatically synchronised across the servers. |
SA2.8 | Ingemar Häggström (ingemar.haggstrom@eiscat.se) |
|
A mixture of standards are adopted depending on type. For long-term preservation, the format hdf5 will be used. |
|
Data are stored on a few e-Infrastructures, mirrored and synchronised. There are two levels of storage: a large short-term, and a reduced long-term. |
SA2.9 |
Daniele Bailo (daniele.bailo@ingv.it) |
|
FDSN standards for disseminating data. Data format : SEED and mSEED. Also: VTK, W3C-PROV, QuakeML, mp4, KML for the specific purpose each standard was created for. |
|
Synthetic and pre-staged raw data are handled and stored within a local federation and data management system, which preserves lineage and provenance information. Original raw data is stored across the nodes of the FDSN service providers. |
SA2.10 | Eric Yen (Eric.Yen@twgrid.org) |
|
The ISO 19156 standard for Observation and Measurement data model was selected. For weather and climate data, the centre will also comply with the Climate and Forecast convention (CF) (e.g. NetCDF). Both of these specifications are included in the new metadata model called ADAGUC Data format standard. |
|
The data will be organised and managed in a repository over the distributed infrastructure. The CC plans to have no less than three copies of the data set at different sites. Academia Sinica (Taiwan) is in charge of the long-term data preservation. |