Amharic Speech Corpus

DOI

tba

Abstract

This is an Amharic speech corpus which is suitable for the development and evaluation of speech recognition and retrieval systems. The corpus contains 110 hours of speech data with syllable and grapheme-based transcriptions collected from public domain or ressources with specific permissive licenses.

License

The corpus is prepared from audio-books, news domain read-speech and multi-genre radio programs which are either public domain or with permissive licenses. We have also used publicly available data-sets. By downloading this corpus, you agree that the corpus should only be used for research purposes.

Citation

When using this data, please cite the original publication:

Nirayo Hailu Gebreegziabher, and Andreas Nürnberger. "An Amharic Syllable-Based Speech Corpus for Continuous Speech Recognition." In Statistical Language and Speech Processing. SLSP 2019, Ljubljana, Slovenia. Lecture Notes in Computer Science, vol 11816. Springer, Cham, 2019. Available at: https://link.springer.com/content/pdf/10.1007/978-3-030-31372-2.pdf#page=180

Download

Amharic Speech Corpus

Readme

Description

The corpus is partitioned into training and validation set which contains smaller audio segments not longer than 28 seconds. Utterances in each partition are re-sampled with a sampling frequency of 16 kHz with a sample size of 16 bits, 256kbs bitrate with a mono channel and stored as a wav file. The syllable and grapheme-based transcriptions are provided in plain text and including the audio details as a json and csv files. For more details about the corpus, refer to the original publication.