Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

AgINFRA

From EGIWiki
Revision as of 18:03, 30 July 2015 by Ychen (talk | contribs) (→‎Requirements for EGI Testbed Establishments)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Engagement overview Community requirements Community events Training EGI Webinars Documentations


Community Information

Community Name

EGI Federated Cloud services for the agri-food research

Community Short Name

agINFRA

Community Website

http://www.aginfra.eu

Community Description

The agINFRA project, supported by the Agriculture Information Management Standards of the Food and Agriculture Organization of the United Nations (AIMS FAO) and the CIARD global initiative, introduces a set of recommendations applying to agri-food research community for data management, sharing and dissemination. Additionally, these recommendations aim to provide a framework for the research community of European agri-food research institutions that need to follow the H2020 Open Access mandate and share their metadata with their thematic aggregator in order to publish them in OpenAire. (from www.aginfra.eu)

Community Objectives

agINFRA aims to function as the thematic aggregator of the agri-food research domain and act as the main research community for OpenAire.

Main Contact Institutions

Agro-Know, FAO

Main Contact

  • Effie Tsiflidou, effie@agroknow.gr
  • Nilolaos Marianos, n.marianos@agroknow.gr

Prior requirement capture activities

agINFRA D2.2 Revised stakeholders needs deliverable http://www.aginfra.eu/project/images/DELIVERABLES/aginfra_d2.2_revised-review-of-stakeholder-needs_final_20131025.pdf agINFRA D5.5 Report on agricultural data sources/repositories integration

Science Viewpoint

Scientific Challenges

  • High volume storage
  • Impossible to use centralized storage
  • Large, live, constantly updated data streams
  • Handling of heterogeneous data

Objectives

  • Raw data resources with agricultural data must be publicly available, using a unified search and discovery platform
  • Making such resources more broadly discoverable by humans and machines by registering them in shared public directories and providing all the technical information that allows applications to process those data
  • Reach out to entrepreneurs who can put their data to work in new services
  • Invite commercial entities into the conversation around the future of data

KPI inputs

Access Increased access and usage of e-Infrastructures by scientific communities, simplifying the “embracing” of e-Science. Number of users of the web portals: 10000 monthly; Number of sites provide the services: 20
Visibility Visibility of the project among scientists, technology providers and resource managers at high level. Number of portal cloud installations/usage: 4

User Stories

Use cases taken from agINFRA public deliverable D1.3.3 agINFRA Scientific Vision: Part A

  • Data provider who needs to host and store a small scale CMS

In this case, data provider requests from the system to set up his own CMS instance in order to cover the needs for a small scale CMS E.g. Open Educational Resources (http://www.oercommons.org/), which provides access to hundreds of course-related materials and collections in several themes

  • Data provider, who needs to host and store a large scale hosting & replication CMS

In this case, data provider requests from the system to allocate space or to set up accounts in a large scale CMS E.g. Consiglio per la Ricerca e la Sperimentazione in Agricoltura - CRA (http://sito.entecra.it/portale/index2.php), which includes thousands of data sources in several research fields in agriculture and related domains

  • Data provider, who needs to host CMS at own or external / commercial infrastructure In this case, content provider is interested to expose (meta)data to e-infrastructure, E.g. Turkish Agricultural Learning Objects Repository - TrAgLOR (http://traglor.cu.edu.tr/), which serves as an organized collections of learning objects, stored on servers and delivered through networks.

Information Viewpoint

Data

Data Object types

Germplasm data

Data size

~ 10KB

Data collection size

~ 1PB

Data format

XML

Standards in use

MCPD (for Germplasm data)

Data management plan

  • agINFRA collects data free of access to make them publicly available
  • agINFRA should ensure long-term preservation

Privacy policy

  • publicly available, free of access

Metadata

Metadata object types

  • AGRIS Bibliographic information: metadata for publications (scientific articles, thesis, dissertations, journals)
  • GLN metadata for educational resources.
  • VocBench instances
  • VEST Registry
  • CIARD RING

Metadata Identifiers

ARN

Metadata Size

~10KB

Metadata format

RDF, OWL, XML

Standards in use

RDF, OWL, SKOS, OAI-PMH

Metadata generation

Custom java code based on xml transformations

Other aspects

Triple store with RDF files in order to preserve linked open data

Data Lifecycle

  • Data acquisition level (including manual sent raw XML files or harvesting via protocols like OAI-PMH)
  • Metadata records evaluation and mappings
  • Data transformation
  • Data identification – deduplication
  • Data triplification (XML to RDF)
  • Upload RDFs to allegro-graph triple store
  • Data indexing
  • Data publishing to AGRIS portal and also provide an FTP with XML records and RDFs
    • Data curation

Technology Viewpoint

System Architecture

agINFAR Architecture

In the context of the agINFRA project, there are a number of data providers providing access to different data types, such as educational, bibliographic, germplasm, statistical, soil maps, cultural and other. The aggregation of metadata from these data sources, which use different metadata schemas in order to meet the specific requirements of each data type, would traditionally be carried out by individually transforming and then harvesting each data source. This approach would be most appropriate for serving the data integration as well as other services deployed by the agINFRA project. A more state-of-the-art methodology should apply the current advances in the context of the Semantic Web, including the publication of all available data as linked and open data. The first step in this process would be the development of a metadata model for each resource type, which would accommodate the most common and / or essential elements of the metadata schemas used in agINFRA by the data providers. (from agINFRA D5.4 public deliverable)

agINFRA infrastructure pays special attention in topics like the efficient metadata management (checking for mappings and transformation of the targeted metadata schemas to a common schema), storage issues for hosting data components and scaling up the handled metadata aggregations and their versions, computing issues in terms of time and resources that are needed for harvesting and often recurring for the coverage similar workflows that are needed (for validation, transformation, harvesting, auto-tagging and indexing). (from agINFRA D1.3.3 public deliverable)

Community data access protocols

web interface & FTP

Data management technology

Custom

Data access control

POSIX

Public data access protocol

HTTP

Public authentication mechanism

anonymous access

Computing capacities

CPU 3500 CPU’s
GPU no
RAM 4GB
Storage 30GB
e-Infrastructure Cloud
Client Desktop, laptop, mobile device

Software and applications in use

Software/ applications/services

  • Software name: apache tomcat, solr, custom java code
  • Software Licensing: open source
  • Configuration: our web app is based on java war application and run on tomcat 6
  • Dependencies needed to run the application, indicating origin and requirements: cloud infrastructure, open jdk 6, apache tomcat 6

Operating system

centos 5 linux

Runtime libraries/APIs

java, sax parser, solr 1.4

Typical processing time

15h

e-Infrastructure in use

EGI, GEANT though GRNET cloud (oceans and Vima)

Requirements for EGI Testbed Establishments

Does the case include preferences on specific tools and technologies to use? cloud infrastructure like virtual machine instances
Does the user have preferences on specific resource providers? no
Approximately how much compute and storage capacity and for how long time is needed? 2.2 GHz, long term preservation, 100 GB
Does the user need access to an existing allocation, or does he/she needs a new allocation? no
Does the user (or those he/she represent) have the resources, time and skills to manage an EGI VO? No