Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

CANFAR

From EGIWiki
Revision as of 12:41, 16 July 2015 by Tferrari (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Engagement overview Community requirements Community events Training EGI Webinars Documentations


Community Information

Community Name

Canadian Advanced Network for Astronomical Research

Community Short Name if any

CANFAR

Community website

http://www.canfar.phys.uvic.ca/canfar/

Community Description

The Canadian Advanced Network for Astronomical Research (CANFAR) is a computing infrastructure for astronomers. CANFAR aims to provide to its users easy access to very large resources for both storage and processing, using a cloud based framework. CANFAR allows astronomers to run processing jobs on a set of computing clusters, and to store data at a set of data centres. (From http://www.canfar.net/about)

Community Objectives

The main objectives of the community include:

  • Manage large astronomical and astrophysical data sets,
  • Allow users to share the data sets between European and Canadian infrastructures,
  • Provide means for data set querying using FITS metadata,
  • Enable running computations on large data sets.

Main Contact Institutions

Instituto Nazionale di Astrofisica (INAF), Via G.B. Tiepolo, 11 I-34143 Trieste, Italy - Tel. +39 040 3199 111 - infoats@oats.inaf.it

Main Contact

Giuliano Taffoni <taffoni@oats.inaf.it>

Science Viewpoint

Scientific challenges

The main problem related to CANFAR case study is that European A&A community has only storage infrastructure, without computing, which is available in the Canadian A&A cloud. The typical observation files are very large, and thus very expensive to transfer to computational sites. After the data is made public (typically after 1 year) it should be replicated between European and Canadian cloud storage.

Objectives

The main objectives of this community with respect to Open Data Cloud include:

  • Establish close collaboration between European and Canadian astronomy and astrophysics (A&A) communities,
  • Enable sharing large volumes of astronomical observation and simulation data according to agreed policies (e.g. after 1 year of publication data should be public)
  • Enable replication of data between Canadian and European Cloud storage infrastructures.

Use Cases

UC1: User data is made public automatically after 1 year. Actors:

  • Principal Investigator who created the original data.

Action:

  • Access to the data is automatically enabled after 1 year from creation by the Open Data Platform
  • Data is replicated between CANFAR and EGI infrastructures
  • Data is available through EGI Open Data Platform portal

Current solutions:

  • Data transfers are initiated manually

UC2: User wants to find publicly available data set. Actors:

  • Community user interested in accessing particular observation data set.

Action:

  • User enters in the community portal query specifying selected FITS metadata key/value pairs. Matching data sets are located, and filtered based on the privacy ACL’s set for data (all public data sets matching the query will be returned to any user.)

Problems to be solved:

  • Enable automatic ACL modification based on time of data creation
  • Enable public access over specified protocols (e.g. HTTP, FTP) of the public files

Information Viewpoint

Data

  • Astronomical and astrophysical observation raw data (FITS format, includes ASCII header and binary CCD data)
  • Astronomical and astrophysical observation pre-processed data (e.g. optimized volume)
  • Astronomical and astrophysical simulation data

Data size

~1TB (one night observation)

Data collection size

1PB

Data format

FITS (Flexible Image Transport System)

Data Identifiers

Metadata in the files is located in the headers of FITS files, and also indexed in external SQL database for lookup

Standards in use

FITS

Data locations

Italian sites and Canadian sites

Data management plan

Data is typically owned by Principal Investigator for 1 year, after which it should be made public. The PI can also process the data, pre-process it to reduce its volume.

Privacy policy

For 1 year after creation the policy is defined by the Principal Investigator, i.e. she can decide who can access the data. After 1 year the data should be publicly available.

Metadata

Metadata for FITS files is stored in an ASCII header of each file in a simple list of key/value pairs with optional comments

Metadata Identifiers

Metadata is stored in key/value pairs, metadata identifiers are simple abbreviated strings, e.g. ORIGIN, LPKTTIME, NAXIS, etc.

Metadata size

Small in comparison to actual data, typically up to a 100 key/value pairs per file

Metadata format

ASCII text header in the beginning of each FITS file, with key/value pairs with optional comments.

Standards in use

FITS

Metadata locations

Metadata is located in the header of each FITS file, as well as indexed in relational databases for data discovery.

Technology Viewpoint

System Architecture

CANFAR (Canadian Advanced Network for Astronomical Research) is composed of:

  • Canadian National Research Network (CANARIE)
  • Cloud processing and storage (Cloud Canada)
  • Canadian Astronomy Data Centre (CADC)

Together they provide a platform for distribution, processing and storage of astronomical and astrophysical data sets. The cloud infrastructure is based on OpenStack technology. Main service provided to the users include:

  • VOSpace – Virtual Observatory user storage,
  • VMOD – Virtual Machines on Demand,
  • GMS – Batch processing and group management,

All services are based on RESTful protocols maintained by CADC. VOSpace provides a web based user interface for finding datasets based on FITS metadata queries. The metadata from FITS file headers is indexed in a relational database. Users information is stored in an LDAP catalogue.

CANFAR Architecture

Community data access protocols

  • REST or SOAP for data management control
  • HTTP, FTP for data transfers

Data management technology

CANFAR data management system is based on VOSpace which is an implementation of Virtual Observatory Specification Draft (http://www.ivoa.net/documents/VOSpace/20150601/VOSpace.pdf). Data management control is available through a RESTful interface.

Data access control

GMS service provides the role of Policy Information Point during authorization requests, returning information about users groups, capabilities and capacities. VOSpace permissions are similar to POSIX based rights.

Non-functional requirements

Performance Requirements Requirement levels Description
Availability High
Accessibility High The public data should be easily accessible to all users.
Throughput High Data transfers should use all available bandwidth whenever possible. This can be achieved by striping data into blocks and serving them simultaneously from several nodes in the cluster.
Response time Middle *Response time in terms of metadata queries should be quick in terms of typical user experience
  • Response time in case of large data set transfer is not critical (data transfers will take several minutes/hours anyway)
Security High Only data which is publicly available should be accessible by non-authorized users.

e-Infrastructure in use

CANFAR infrastructure is based on OpenStack cloud platform

Requirements for EGI Testbed Establishments

Does the case include preferences on specific tools and technologies to use?

  • Automatic provision of public data sets to users (based on predefined policies, e.g. after 1 year since creation)
  • Due to large size of data sets, data transfer from storage site to computation site can be very expensive. Either computation should be moved closed to data, or if not possible, local mount of the remote storage on the computational nodes should be provided

Approximately how much compute and storage capacity and for how long time is needed?

Current data size is over 1PB in size.

Does the user (or those he/she represents) have access to a Certification Authority?

This will be resolved as part of EGI FedCloud project. Authentication will be based on X.509 certificates and in the future possibly based on eduGAIN service.

Meetings and minutes

Reference