Amharic Speech Corpus




This is an Amharic speech corpus which is suitable for the development and evaluation of speech recognition and retrieval systems. The corpus contains 110 hours of speech data with syllable and grapheme-based transcriptions collected from public domain or ressources with specific permissive licenses.


The corpus is prepared from audio-books, news domain read-speech and multi-genre radio programs which are either public domain or with permissive licenses. We have also used publicly available data-sets. By downloading this corpus, you agree that the corpus should only be used for research purposes.


When using this data, please cite the original publication:

Nirayo Hailu Gebreegziabher, and Andreas Nürnberger. "An Amharic Syllable-Based Speech Corpus for Continuous Speech Recognition." In Statistical Language and Speech Processing. SLSP 2019, Ljubljana, Slovenia. Lecture Notes in Computer Science, vol 11816. Springer, Cham, 2019. Available at:


Amharic Speech Corpus



The corpus is partitioned into training and validation set which contains smaller audio segments not longer than 28 seconds. Utterances in each partition are re-sampled with a sampling frequency of 16 kHz with a sample size of 16 bits, 256kbs bitrate with a mono channel and stored as a wav file. The syllable and grapheme-based transcriptions are provided in plain text and including the audio details as a json and csv files. For more details about the corpus, refer to the original publication.

Last Modification: 26.11.2019 - Contact Person: