VT Speech processing

From EGIWiki
Jump to: navigation, search
EGI Activity groups Special Interest groups Policy groups Virtual teams Distributed Competence Centres

EGI Virtual teams: Main Active Projects Closed Projects Guidelines

General Project Information

  • Leader: Ing. Milan Rusko <milan.rusko@savba.sk>, IISAS, Slovakia (Administration: Gergely Sipos <gergely.sipos@egi.eu>)
  • Mailing List: vt-speech-processing@mailman.egi.eu
  • Status: Complete
  • Start Date: 7/Mar/2011
  • End Date: 14 May 2013
  • Meetings: 1st telcon: 23/April - 1st_SPEED_VT_meeting


Current automatic speech processing technology is strongly oriented to data-driven approaches demanding huge computational power especially in the training and testing phases. The evaluation of an automatic speech recognition (ASR) system with one setting typically requires several hours of computing on a one hundred core computer cluster. Since there are tens of parameters and settings, most of the iteration based optimization seem to be too computationally expensive. Moreover, optimization of one part of the recognizer is not independent from the settings of the other parts. Speech processing community should therefore take the opportunity of exploiting the benefits of grid technology and its enormous computing power in an effort to achieve satisfactory optimization of the contemporary ASR systems. Furthermore, approaches useful for ASR can be easily extended to modern speech synthesis systems since both problems are commonly based on very similar principles of modeling.


The expected output was two-fold. First, through a dedicated user-interface, Grid computing would become available to a wide scientific community of researchers dealing with speech processing. Second, a set of methods for optimization and diagnostics specifically in speech processing and tools implementing these methods in the grid platform will be developed.

The eventual results of the work are reported in the [Final Report], a copy of which can be found in the EGI DocDB.


The required output for the project will be achieved by the following tasks:

  1. Establishment of contacts, investigation of the state of the art, formation of a consortium
  2. Methodology development for
    1. holistic optimization
      1. ASR (may include speaker identification, speaker recognition and language recognition)
      2. Text to Speech (TTS) systems
    2. holistic diagnostics
      1. ASR
      2. TTS
  3. Implementation aspects
    1. porting the computations in the Automatic Speech Processing domain to the Grid platform
    2. solving particular domain-dependent problems of using Grid computing in automatic speech processing
      1. Problem of needed high data transfers and its influence on Grid computing speed
      2. Data security and program security
  4. Storage possibilities for large databases in Grid
  5. Porting commercial applications to Grid


If you wish to join the project, then please send an email to ucst@egi.eu.

  • Countries:
    • Austria
      • Michael Pucher (Telecommunications Research Center Vienna (FTW))
    • Finland
      • Ville Savolainen (CSC - IT Centre for Science)
      • Mikko Kurimo (Aalto University)
    • Ireland
      • Nick Campbell (SFI Stokes Professor of Speech & Communication Technology at Trinity College Dublin (The University of Dublin))
    • Republic of South Africa
      • Bruce Becker (South African National Grid)
      • Nic de Vries (CSIR Meraka Institute)
    • Switzerland
      • Miloš Cerňak (ETH ZURICH)
    • Slovakia:
      • Milan Rusko (IISAS - Institute of Informatics of the Slovak Academy of Sciences (Leader))
        • Speech processing group
      • Ladislav Hluchy (IISAS - Institute of Informatics of the Slovak Academy of Sciences (NIL))
        • Grid computing group
      • Technical University in Košice, Slovak Republic,
    • Switzerland:
      • Milos Cernak, Idiap research institute
    • UK:
      • Martin Wynne (University of Oxford)
      • John Coleman (Phonetics Laboratory at Oxford University)
      • Claire Devereux (STFC)
      • Ladan Baghai-Ravary (Phonetics at Oxford University)
    • US:
      • Jiahong Yuan(Department of Linguistics University of Pennsylvania)
    • Netherlands:
      • Paul Boersma (University van Amsterdam)
  • EGI.eu institute:
    • Nuno Ferreira
    • Gergely Sipos
    • Karolis Eigelis



VT teleconference

A SKYPE teleconference of the SPEED EGI Virtual Team project was held on the 23rd of april 2012 at 11:00.

Participants were:

EGI: Gergely Sipos
UI SAV (Slovakia): Ladislav Hluchý, Milan Rusko, Jolana Sebestyénová, Peter Kurdel, Marian Trnka, Marian Ritomský.
TU Košice (Slovakia): Jozef Juhár, Matúš Pleva
IDIAP (Switzerland): Milos Cernak
CSC (Finland): Ville Savolainen

After the greeting and introduction by Ladislav Hluchý, Milan Rusko has presented his idea of the SPEED project. The other participants have reacted and presented their opinions. The communication was made hard by very bad technical quality of the Skype connection. Some of the voices were barely distinguishable. It was decided that e-mail communication will be prefered in future.

Meeting of the part of the VT

A two day meeting of SPEED partners from Košice and Bratislava (Slovakia) and IDIAP (Switzerland – via Skype) took place at II SAS Bratislava on the 6th and 7th of June 2012 at the II SAS Bratislava. Twelve participants from four teams (II SAS speech processing, II SAS grid computing and TU Košice speech processing, and IDIAP) discussed problems of speech processing and their parallelization and portability to the grid. We mention some of the papers:

  • Matúš Pleva (TU Košice) has presented the present state of his broadcast news transcription system and has identified the parts, where parallelization and huge computing power would bring the biggest improvement.
  • Daniel Hládek, Ján Staš and Jozef Juhár (TU Košice) have presented a paper: Building Organized Text Corpus for Speech Technologies for the Slovak language. The possibility of porting the text corpora to the grid was discussed. It was concluded that the experts from CLARIN and META Net should be contacted and consulted. Their presence in the SPEED team would be very welcomed.
  • Stanislav Ondáš (TU Košice) has presented a paper “A New Architecture of the Multimodal Dialogue System with distributed Dialogue Manager”. The challenges are mainly: continuous input stream processing, asynchronous system processes with the synchronous interaction and ability to immediate turn-taking, agent-based architecture.
  • Miloš Cerňak (IDIAP) presented the parallelization of the ASR training using HTK.
  • Ján Staš (TU Košice) has presented the progress in language modeling at their university.
  • Marian Trnka and Sakhia Darjaa (II SAS) have presented the most problematic issues that have to be solved in the Slovak large dictionary spontaneous speech recognizer and automatic speech and text databases acquisition from freely available sources. They concluded, that ROVER – type cooperation of multiple speech recognizers, which can be needed for this task is extremely computationally consuming and to get the results in reasonable time, the grid computing would be one of the possible solutions.


Preliminary experiments with parallel computing enabled acoustic model training on the computer cluster were made to test the behavior of the algorithm when ported to highly parallel computing environment.

Miloš Cerňak (IDIAP) and Ján Astaloš (II SAS) have built a framework for parallel acoustic model training based on HTK (Hidden Markov Toolkit, http://htk.eng.cam.ac.uk/) data parallelism:

  • A single task (CPU) takes 15-20 hours
  • 6 CPU (IBM server) accelerates the task to 6-8 hours
  • The “SMART” cluster (located at II SAS) with 26 CPUs computes the task in 2-3 hours
  • The “SIVVP” cluster (located also at II SAS) with 244 CPU computes the task in 45 minutes - 1 hour

As expected, more CPUs don't make the computation automatically faster. The SIVVP cluster, that was used has 524 working nodes, however we found it quite ineffective to run the task directly on all the nodes. This is because the computation time of single nodes decreases under 1 minute, and then job management time of the cluster is too high comparable to computation time.

Therefore we have proposed a new framework where one job allocates its own set of working nodes. So on the top of standard HTK data parallelism, we plan to effectively manage the data on the cluster, and later we plan to use also MPI parallelism (http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html). We aim to effectively manage the number of jobs based on the amount of processed data, and within a single job, a MPI master will distribute the data to available working nodes. Such a 'hybrid' parallelization could bring us further task acceleration. This new framework is aimed to be tested on a big cluster and later on a grid (if available).

This experiment represents only a partial, small scale computation task in comparison to ASR or TTS optimization and it was not meant to demonstrate the complexity and memory and computer power consuming features of the speech processing computations.

Presentation of the project – LREC Istanbul 2012

Milan Rusko (II SAS) has presented a paper on the SPEED EGI VT project at the CoCoFLaRE Workshop "Reinforcing International Collaboration in LRE" at The eighth international conference on Language Resources and Evaluation (LREC) Istanbul, Turkey, on May 26th 2012 (workshop program)

The discussion has shown that the reputable experts from the speech community feel certain skepticism about the non-commercial model of EGI and NGIs, and they believe more in the commercial Cluster computing models, that are already widely used by the speech community. Practically no one of the participants has ever heard about EGI. Therefore we are sure that a focused information strategy and an easy-to-approach policy could open wide possibilities of use of grid computing in speech processing.

Analysis of Requirements

It appeared that the focus, scope and goals of the VT had to be changed after the project held its first teleconference and after the Slovak participants meeting took place. The authors of the project have learned that the speech community is generally not aware of the existence of EGI and national grid infrastructures. It was therefore unrealistic to expect, that the members of the established laboratories and companies will react to our call for participation in SPEED VT project.

An exhaustive list of the speech laboratories with their e-mail addresses was therefore sent to EGI coordinator. The idea was that these addresses will be passed to the NILs in corresponding countries and these will help establish contact between the researchers in their country and the SPEED team. The NILs could inform the potential partners about the existence, capacity and possibilities of exploitation of the local Grid and the possibilities of coordinated use with other countries' national Grids. This approach did not bring the expected results.

Investigation of the state of the art

Solving the tasks