Difference between revisions of "Competence centre DARIAH"
|Line 137:||Line 137:|
* [https://cdstar-prod04.gwdg.de/dev/null/ DARIAH-CC test deployment] (Please note: Access via CDSTAR API only)
* [https://cdstar-prod04.gwdg.de/dev/null/ DARIAH-CC test deployment] (Please note: Access via CDSTAR APIonly)
* [https://info.gwdg.de/docs/doku.php?id=en:services:storage_services:gwdg_cdstar:start Brief usage info]
* [https://info.gwdg.de/docs/doku.php?id=en:services:storage_services:gwdg_cdstar:start Brief usage info]
* [https://github.com/clld/pycdstar Python client]
* [https://github.com/clld/pycdstar Python client]
Revision as of 23:33, 2 October 2016
|EGI-Engage Competence centres:||Main page||ELIXIR||BBMRI||MoBrain||DARIAH||LifeWatch||EISCAT_3D||EPOS||Disaster Mitigation |||EGI-Engage Knowledge Commons|
The DARIAH Competence Center (CC) is a virtual research group, consisting of computer scientists and the researchers coming from the Arts and Humanities. Its mission is to broadening the usage of the advanced research infrastructures and technologies, such as cloud-oriented services, computational and storage resources in the domain of the Arts and Humanities research. The main activities of the DARIAH CC are:
- Bringing together relevant stakeholders from the EGI and DARIAH communities who help DARIAH members to better understand their research infrastructure needs, and who operate as a ‘knowledge hub’ concerning the use of e-Infrastructures in the domain. >> Read about the CC members.
- Providing customised e-Infrastructure services for Arts and Humanities, by extending generic EGI e-Infrastructure services according to the unique needs of Arts and Humanities users. >> Try the e-infrastructure services
- Creating demonstrator applications from the Arts and Humanities domain that use e-Infrastructure services. The demonstrators display state-of-the-art capabilities of what's available in digital science with the use of EGI. >> Learn about and try our demonstrators
- Raising awareness within the Arts and Humanities researcher community teams about the benefits of e-Infrastructure technologies. The CC is organising outreach, training and workshop events, and contributes to high-impact events in the field. >> Check our event calendar
THE REST OF THE PAGE, INCLUDING THE REFERENCED SERVICES, IS UNDER FINALISATION
|Ruđer Bošković Institute (RBI), Croatia (Leader institute)||Davor Davidović (CC coordinator), Karolj Skala|
|Hungarian Academy of Science, Institute for Computer Science and Control (MTA SZTAKI), Hungary||Zoltán Farkas (CC technical coordinator)|
|Italian National Institute of Nuclear Physics (INFN), Italy||
|Gesellschaft für wissenschaftliche Datenverarbeitung mbH (GWDG), Germany||
|Data Archiving and Networked Services (DANS), Netherlands||Rene van Horik|
|Austrian Academy of Science (AAS), Austria||Eveline Wandl-Vogt|
Generic services for Digital Arts and Humanities
This section describes the services developed and offered by the DARIAH Competence Centre to the researchers coming from the Arts and Humanities research domain.
Cloud-based compute and storage infrastructure
The most basic service provided by the DARIAH-CC is a federated 'Infrastructure as a Service' (IaaS) cloud. The system is open for any researcher and research project or institute from the Digital Arts and Humanities domain. The infrastructure is based on the EGI Federated Cloud technology and currently connects OpenStack clouds from two providers: INFN-Catania and INFN-Bari. Three more sites are expected to join later in 2016 (GWDG, IRB, SZTAKI). The participating clouds are joint into a virtually single system through an identity federation that enables researchers and their applications to access the different clouds with a single identity and with single-sign on. Access to the infrastructure is controlled by the Competence Centre.
The infrastructure is offered as a scalable system for data and compute intensive applications. These applications can be deployed on the DARIAH-CC cloud in the form of Virtual Machine images (VMIs). Existing VMIs can be browsed, and new VMIs can be deployed on the DARIAH-CC cloud through the AppDB Cloud Marketplace portal. The distinct clouds of the DARIAH-CC cloud federation can all be accessed through programming APIs, command line tools and the high level gateways and applications offered by the DARIAH-CC.
The list of resources provided by the DARIAH Competence Centre is available here
It is our pleasure to inform you that the EGI.eu and DARIAH Competence Centre have signed an agreement on providing the EGI Cloud resources for the DARIAH community.
DARIAH Science Gateway
The gateway provides various web-based applications and services for the Digital Humanities researchers, institutes and communities:
- Simple Semantic Search Engine (SSE): Allows users to search in the e-Infrastructure Knowledge Base (Open Access Document Repositories and Data Repositories).
- Parallel Semantic Search Engine (PSSE): A parallelised version of SSE enabling simultaneously search across the e-Infrastructure Knowledge Base, Europeana, Cultura Italia, Isidore, OpenAgris, PubMed and DBpedia platforms.
- DBO@Cloud: A Cloud-based repository presenting 100+ years old collection of Bavarian dialects.
- Cloud Access service: Single-job applications and parameter-sweep applications can be run on the DARIAH VO clouds without porting efforts.
- Workflow Development service: Complex workflow applications can be developed and run on all the resources of the DARIAH VO.
- File transfer service: Enables transferring data from, to and between storage services providing HTTP, HTTPS, SFTP, GSIFTP, SRM, iRODS and S3 protocols.
The gateway is still under development. To access DARIAH Science Gateway (beta version), please click here.
Semantic Search Engine
The e-Infrastructure Knowledge Base, originally developed in the context of the CHAIN-REDS project and now maintained by the Sci-GaIAcountries. Information are presented to visitors through geographic maps and tables that show and list e-Infrastructure, sites, services, and applications. In addition, the Knowledge Base about Open Access Document Repositories (OADRs), Data Repositories (DRs) and Open Educational Repositories (OERs) in both geo- and table-view. Many of the OADRs, DRs and OERs contain material related to Cultural Heritage.
On top of the e-Infrastructure Knowledge Base, the Division of Catania of the Italian National Istitute of Nuclear Physics (INFN) has developed a Semantic Search Engine (SSE) to semantically enrich the metadata of the more than 30 million resources included in the Knowledge Base and allow the discovery of new correlations about document and data and, ultimately, the creation of new knowledge. In this sense, the SSE goes way beyond normal Google/Google Scholar searches. The queries to the Semantic Search Engine can be made in more than 110 languages (including those not based on the latin alphabet) and the results are ranked according to the latest issue of Ranking Web of Repositories. Moreover, they are connected, whenever the information is available, to Google Scholar and Altmetric in order to provide users with additional information about versions and citations of a given resource found by the query. Last but not least, the SSE exploits LodLive API to allow users navigate/explore the Linked Data graph of each resource found by a search.
The SSE may be very useful for the Cultural Heritage community in the following respects:
- Discover connections between several types of Cultural Heritage resources, e.g. documents, data, images, other multimedia materials, etc.;
- Discover connections and correlations between Cultural Heritage topics and topics belonging to other disciplines/domains;
- Connect the SSE with specific vocabularies and thesauri to make the results of searches more relevant for a given Cultural Heritage sub-domain.
Parallel Semantic Search Engine
The Parallel Semantic Search Engine (PSSE) service has been developed by the Division of Catania of the Italian National Institute of Nuclear Physics (INFN) as an extension of the Semantic Search Engine (SSE). The PSSE allows parallel searches across different online repositories of open & linked data. Using the PSSE, users can search and semantically correlate contents in geographically distributed digital repositories across several different domains. The service is currently configured to search digital contents in the e-Infrastructure Knowledge Base, in OpenAgris, in Europeana, in Cultura Italia, in Isidore, in PubMed and in DBpedia but others repositories can be easily added if need arises.
In order to use the PSSE, click here.
EGI Training Infrastructure
The EGI training infrastructure is an 'Infrastructure as a Service' type cloud system, that is open for face-to-face and online training tutorials and schools for the EGI community and their partners. The infrastructure can basically support two types of tutorials:
- Events that demonstrates services of the EGI federated cloud. In this operational mode the infrastructure can accommodate courses that focus on the usage of the EGI cloud services themselves. Such courses typically target programmers or other technical members of scientific communities or projects.
- Training courses about scientific software and/or scientific services. In this operational mode the representatives of the community (the trainers) deploy custom Virtual Machine images on the training infrastructure before the training, and these images offer the training environment for the students. Because of the cloud-based operational model the students can have dedicated training environments, and the community can benefit from the easy deployment, predictability and repeatability of courses.
Do you want to use the training infrastructure for an event? Please email email@example.com. Further information and sample exercises about the training infrastructure are available at https://wiki.egi.eu/wiki/Training_infrastructure.
Services for developers and service providers in Digital Humanities
gLibrary (v2.0) is a framework, designed and developed by the Division of Catania of the Italian National Institute of Nuclear Physics (INFN), that offers both access to existing (both closed and open) data repositories and the creation of new ones via a simple REST API.
A repository in the gLibrary lingo is a virtual container of one or more data collections. A collection provides access to a relational DB table or to a non-relational (NoSQL) DB collection. Currenly, gLibrary supports MySQL, PostgreSQL, Oracle and MongoDB.
gLibrary is intended to the researchers, scholars and research groups in the domain of the Arts and Humanities as well as other disciplines, that wish to organize their digital assets and the accompanying metadata on distributed storage systems based on different technologies. The system provides to non-IT experts an easy-to-use service that features browsing, searching, downloading and replicating of digital objects.
In the context of the EGI DARIAH Competence Centre, gLibrary has been used to build, with data provided by the Austrian Academy of Sciences, a demonstrative repository of Bavarian dialects within the Austrian-Hungarian monarchy. To access the repository, called DBO@Cloud, follow this link.
The Common Data Storage ARchitecture (CDSTAR) is a customizable object storage solution for science and research. CDSTAR addresses the specific requirements of research data management according to the good scientific practice. The system integrates the ability of storing metadata along the research data in a flexible metadata schema that can be tailored for the specific use in different scientific disciplines. Additionally, the data objects that are stored in CDSTAR can be registered automatically at the EPIC Persistent Identifier (PID) service. The EPIC service gives data sets a unique, globally resolvable identifier as an additional abstraction layer that allows citing data sets in scientific publications. A role-based security concept is also integrated into CDSTAR, which allows the protection of data sets with an individual set of permissions and rights for each user. Additionally, CDSTAR is capable to use SAML infrastructures such as the one provided by DARIAH in order to verify data access. This allows to use the data permissions for the infrastructure at a central data point provided by DARIAH and to use the DARIAH access credentials.
CDSTAR offers the following features:
- Integration of user-defined metadata along with the research data using flexible metadata schemata that can be tailored for the specific requirements of different scientific disciplines.
- Implementation of different storage back-ends.
- Provision of a stable RESTful Interface that transfers data over HTTP.
- Integrates of an enterprise-grade search engine based on Elasticsearch that operates on metadata as well as full text and indexes a wide range of file formats.
- A role-based security concept that allows the protection of data sets with an individual set of permissions and rights for each user.
- Use of a central permission system like the DARIAH Policy Decision Point (PDP).
- Support for SAML-usage in decentralized data access scenarios.
CDSTAR is developed by EGI DARIAH-CC partner GWDG as Open Source Software under the Apache2 Licence. CDSTAR is delivered as a container (Docker/LCX) or can be deployed from source. In most application scenarios, CDSTAR is the basis for research-specific infrastructures.
DBO@Cloud - The Virtual Dialect Dictionary
DBO@Cloud is a Cloud-based repository presenting the work of a 100+ years old collection of Bavarian dialects within the Austrian-Hungarian monarchy from the beginning of German language to nowadays. The datasets are provided by the Austrian Academy of Science. The service is based on gLibrary and demonstrates the functionalities of the framework, especially its hierchical searching functionality.
To access the repository click here.
This short video demonstrates the functionalities of the DBO@Cloud repository and describes how to browse and manage collections.
Optical Character Recognition (OCR)
General description of the service
Target users/beneficiaries of the service
Why to use service, what are the main benefits
- 38th MIPRO conference, 25-29 May 2015, Opatija, Croatia (poster )
- 12th ESWC2015 conference, 31st May-4th June 2015, Portorož, Slovenia (presentation , poster , abstract )
- JaDH 2015 (The Japanese Association for Digital Humanities annual meeting), 1-3 Sep 2015, Kyoto, Japan, ()
- EGI Community Forum, 9-13 Nov 2015, Bari, Italy (presentation "DARIAH requirements and roadmap in EGI" )
- EGI Community Forum, 9-13 Nov 2015, Bari, Italy (presentation "The SADE mini-project of the EGI DARIAH Competence Centre" )
- EGI Conference, 6-8 April 2016, Amsterdam, Netherlands (presentation )
- e-Infrastructure Days at University Computing Centre, 23-25 May 2016, Zagreb, Croatia (presentation  in Croatian)
- Digital Humanities 2016 , 11-16 July 2016, Kraków, Poland (poster , abstract )
- EURALEX Conference, 6-10 September 2016, Tbilisi, Georgia (workshop, paper and oral presentation)
- Digital Infrastructure for Research (DI4R), 28-30 September 2016, Kraków, Poland (DARIAH CC presentation)
- DARIAH Annual Meeting, Ghent, Belgium, 10-12 October 2016 (dedicated session in the afternoon of the 11th of October)
CC project background information
CC mailing list: cc-dariah AT mailman.egi.eu
Type of Competence Centre: Science-oriented
Target user communities: Digital Arts, Humanities, and Social Sciences
List of organizations representing the user communities: DARIAH-EU
Duration of the CC: 30 M
Starting at Project Month: 1 M
Ending at Project Month: 30 M
Task 1: User Support and Training
Duration: M6-M30, Leader: RBI
Objectives: Raise awareness of Arts and Humanities (A&H) researchers about the necessity of digital research and to qualify them to work with EGI DARIAH CC services
- to organize Workshops and Training Courses
- to provide ICT consulting services to A&H community
- to provide technical support services
Task 2: DARIAH eScience Gateway on EGI
Duration: M1-M24, Leader: SZTAKI
This task develops a DARIAH eScience Gateway that provides a user friendly web-based environment capable of exploiting EGI cloud and grid resources transparently with the help of WS-PGRADE/gUSE and its customization methodologies (Remote API or Application Specific Module). The specific objectives of this task are:
- To enable the connection between the existing DARIAH virtual research environments and the EGI infrastructure;
- To EGI-enable new identified A&H applications (e.g. SADE from Task 3) by creating new custom web-based interfaces.
The work of this task can basically split into five different topics: requirements collection, application interface adaptation and implementation, testing, documentation and eScience gateway operation. WS-PGRADE/gUSE offers different approaches for integrating EGI resources into an application:
- If the application already has a user interface, than it is possible to set up a WSPGRADE/gUSE gateway, develop workflow(s) running on EGI for the given application on this WS-PGRADE/gUSE gateway, and use the gateway’s Remote API to submit workflows from the already existing user interface. The advantage of this approach is that users still will be able to use the interface they are familiar with, but with the possibility to use EGI resources transparently;
- If the application doesn’t have a user interface yet, one may create a custom interface for that using some technology (PHP, Ruby on Rails, etc.), which can rely on the Remote API mentioned above to run workflows;
- If the application doesn’t have a user interface yet, another possibility is to use the Application Specific Module (ASM API) of WS-PGRADE/gUSE to develop a portlet-based user interface.
Task 3: Storing and Accessing DARIAH contents on EGI (SADE)
Duration: M1-M12, Leader: INFN
The overall goal of this mini-project is to create a digital repository of DARIAH contents using gLibrary, the framework developed by the Italian National Institute of Nuclear Physics (INFN) to create and manage archives of digital assets (data and metadata) on local, Grid and Cloud storage resources. The digital repository will be created taking into account the requirements of the DARIAH end-users.
For this specific mini-project the data-sets will be provided by the Austrian Academy of Sciences (AAS), one of the leading Austrian research institutions with a very long-running experience and interest in the Arts and Humanities domain. The AAS datasets represent the work on a 100+ years old collection on Bavarian dialects within the Austrian-Hungarian monarchy from the beginnings of German language to nowadays. Several data types are taken into account: text, multimedia (images, audio files etc.), URIs; as well as primary collection data, interpreted data, secondary background data and geo-data with different license opportunities.
An extract is available at the website of the Database of Bavarian dialects in Austria  electronically mapped.
- Headwords (about 50,000 A-Z);
- Records (about 40,000 plants; about 70,000 in general);
- Multimedia with Link to Audio-file (examples; to be improved);
- Multimedia with Collection (about 3,000; planned to be published within the mini-project);
- Multimedia connected to Headword (about 3,000; planned to be digitized);
- Project specific biographies;
The AAS datasets will be orchestrated by the INFN gLibrary Digital Repository System whose high-level overview is shown in the following figure:
The repositories will be exposed to end-user through two channels:
- As a (series of) portlet(s) integrated both in one of the already existing Science Gateways implemented with the Catania Science Gateway Framework  and in the Science Gateway developed by the lighthouse project;
- As native apps for mobile appliances based on Android and iOS operating systems and downloadable from the official App Stores. The mobile apps will be coded using a cross-platform development environment so that other mobile operating systems could be supported, if needed. Furthermore, the apps could exploit geo-localisation services available on smartphones and tablets to find “near” contents.
Task 4: Multi-Source Distributed Real-Time Search and Information Retrieval (SIR)
Duration: M1-M12, Leader: GWDG
The Multi-Source Distributed Real-Time Search and Information Retrieval (SIR) pathfinding mini-project investigates and implements the possibility of using distributed real-time search engines built on-top of a big-data search and analytics platform to offer A&H users information retrieval and search functionality on heterogeneous data sets, commonly found in the domain of A&H, similar to industry-like platforms such as Google Search or commercial data analytic dashboards. The multi-source distributed real-time search and analytics aims on bringing a next-generation search and data retrieval platform sourced by different systems and heterogeneous data to the users from the Arts and Humanities community. For this reasons, a data hub is implemented based on big data stream and batch processing techniques. The data hub is the foundation for sourcing the distributed source engine based on elastic search that is driven by a software called Common Data Storage Architecture (CDSTAR). On top of this stack, as an interface for the user, a portlet for the DARIAH infrastructure is created that uses the REST-interface of CDSTAR for interacting with the big-data-enabled search and analytics stack. The portlet will form the user interface for searching and browsing data, and displaying the search results by lists, interactive graphics or dash boards that allow a refinement of data retrieval queries.
Task 5: Exploitation
Duration: M7-M30, Leader: DANS
Objectives: Ensure a successful transfer of the mini-projects’ results to its targeted user community and researchers and to increase the applicability and impact of the mini-projects on research conducted.
- to increase the awareness of exploiting EGI infrastructure in the domain of A&H
- to define the usage policy, increase applicability and impact of the mini-projects
- to involve integrate RI resources and services into EGI
- to disseminate the mini-project results
DARIAH EU aims to develop and maintain an eInfrastructure that supports of A&H research practices. EGI DARIAH CC will foster this aim by promoting the exploitation of EGI Grid and Cloud infrastructure to DARIAH user community. Systematic plan of approaching towards the DARIAH user community and beyond must take place by applying and integrating national roadmaps that rely on different eInfrastructure and eScience technologies. In this regard, exploitation activities will include horizontal and vertical user support. Horizontal support will be based on the dissemination of knowledge and skills and vertical by establishing of new technological solutions and environments. The grounds for this vertical component will be placed from the mini-project activities and used for designing a roadmap for sustainable A&H community involvement in EGI.
Milestones (M) and Deliverables (D)
The following gives an overview of deliverables/milestones scheduled. The 'Scope' of the deliverable defines the applicability and visibility of a deliverable. The deliverables marked as 'internal' are internal deliverables for EGI DARIAH CC and if 'external' then it is a deliverable of the EGI-Engage project that is presentable to the EC.
|Code||Title||Lead task||Lead participant||Type||Scope (Internal=Not sent to the EC; External=Sent to EC)||Delivery PM||Delivery CM|
||User support and training plan
|D1.2a||Report on dissemination and training activities
||Report on dissemination and training activities
|D2.2||Initial technical concept and integration plan
|D2.3a||Mini-projects progress report
|D2.3b||Mini-projects progress report
|D5.1||Sustainability and user involvement plan DRAFT
||Data repository for DARIAH||SA2.6 DARIAH||INFN
||Final version of Multi-Source Distributed Real-Time Search and Information Retrieval application
||Production level gateway for Arts and Humanities
||Prototype version of the EGI-enabled application
||First version of the repository in place
See the list of deliverables in the table above, together with links to the already produced deliverables.
The Competence Centre has established and operates a Virtual Organisation in EGI to serve users from arts and humanities. Further information is available at: EGI Virtual Organisation for arts and humanities: vo.dariah.eu
The competence centre is in the process of setting up three science gateway environments to provide user friendly environments to carry out data and compute intensive simulations in arts and humanities.
- WS-PGRADE: To create and run compute intensive applications. The gateway is integrated with the EGI Federated Cloud technology, and is connected to the Virtual Organisation for arts and humanities.
- gLibrary: To create searchable digital data repositories
- CDSTAR: Common Data Storage Architecture
- The competence centre is working on integrating the 'Multi-Source Distributed Real-Time Search and Information Retrieval' application with the EGI Federated Cloud, using CDSTAR.
- The competence centre is creating a searchable digital repository of Bavarian dialects with the gLibrary science gateway.
- DARAIH-CC Parallel Semantic Search Engine (SSE)
The Competence Centre has launched the survey to collect information from the digital arts and humanities communities about their experiences, expectations and needs concerning e-Infrastructures. The responses collected through this survey will help the EGI DARIAH Competence Centre select, deploy and sustainably operate appropriate online tools, training and user support services for digital arts and humanities groups within and beyond Europe.
- DARIAH e-Infrastructure survey: link
- e-Infrastructure survey results (presentation given at EGI Community Forum, Bari, Nov. 2016): 
- Poster presentation at 38. International convention on information and communication technology, electronics and microelectronics (MIPRO), 25-29 May 2015, Opatija, Croatia
- Poster and 2-min oral presentation at 12th ESWC2015 conference, May 31st to June 4th 2015, Portoroz, Slovenia (Presentation , Poster , Abstract )
- Presentation: How to innovate Lexicography by means of Research Infrastructures – The European example of DARIAH (), JaDH 2015 (The Japanese Association for Digital Humanities annual meeting), 1-3 Sep 2015, Kyoto, Japan
- Presentation: DARIAH requirements and roadmap in EGI (), EGI Community Forum, 9-13 Nov 2015, Bari, Italy
- Presentation: The SADE mini-project of the EGI DARIAH Competence Centre (), EGI Community Forum, 9-13 Nov 2015, Bari, Italy
- Presentation: e-Infrastructure demonstrators by the DARIAH Competence Centre for digital Arts and Humanities (), EGI Conference, 6-8 April 2016, Amsterdam, Netherlands
- Presentation: EGI in Digital Humanities ( in Croatian), e-Infrastructure Days at University Computing Centre, 23-25 May 2016, Zagreb, Croatia
- TODO: Poster presentation: "TITLE", XVII EURALEX Congress, 6-10 Sep 2016, Tbilisi, Georgia
- TODO: Workshop "Open Up! Introducing the new DARIAH CC Science Gateway for Lexicographer" @ XVII EURALEX Congress, 6-10 Sep 2016, Tbilisi, Georgia
- TODO: Register DARIAH CC services (once established) in this registry: http://www.civic-epistemologies.eu/outcomes/registry-of-resources/v1/
List of publications
The list of publications, presentations and posters is available on google docs (access permission required)
- Announcements, meetings agendas, and materials associated with DARIAH CC activities (CC meetings, events, presentations, etc...): https://indico.egi.eu/indico/categoryDisplay.py?categId=141
- EGI trainings and webinars: https://indico.egi.eu/indico/categoryDisplay.py?categId=114
- DARIAH-EU home page: https://www.dariah.eu/
- Easy-to-use platform for researchers to access compute, storage and application services (EGI Platform for the Long-tail of science)
- EGI-Engage template for milestones and deliverables: https://documents.egi.eu/document/2501. Two documents: One for DOC type milestones and deliverables, the second (software) for OTHER and DEM type milestones and deliverables.
Available resources in the DARIAH CC
Current available resources in the Competence Centre
|Cloud and storage
INFN-CATANIA-STACK site capacity:
INFN-BARI site capacity:
GWDG site capacity:
IRB site capacity:
MTA-SZTAKI site capacity: