Language Datasets
Description: This dataset is the first large-scale human-annotated Twitter sentiment dataset for Hausa, Igbo, Nigerian-Pidgin, and Yorùbá, the four most widely spoken languages in Nigeria.
Authors: Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Sebastian Ruder, Ibrahim Said Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choudhury, Chris Chinenye Emezue, Saheed Salahudeen Abdullahi, Anuoluwapo Aremu, Alipio Jeorge, and Pavel Brazdil
Languages: Hausa, Igbo, Nigerian-Pidgin, and Yorùbá
Dataset: access here
Description: This evaluation dataset automatically quantifies the quality of machine translation systems for Afar, Amharic, Oromo, Somali and Tigrinya.
Authors: Asmelash Teka Hadgu, Gebrekirstos G. Gebremeskel, Abel Aregawi
Translators: Afar – Mohammed Deresa, Yasin Nur; Amharic – Tigist Taye, Selamawit Hailemariam, Wako Tilahun; Oromo – Gemechis Melkamu, Galata Girmaye; Somali – Abdiselam mohamed, Beshir Abdi; Tigrinya – Michael Minassie, Berhanu Abadi Weldegiorgis, Nureddin Mohammedshiek
Languages: Afar, Amharic, Oromo, Somali and Tigrinya
Dataset: access here
Description: This project collected text and speech corpora for three languages in Kenya: Kiswahili, Dholuo and 3 Luhya dialects (Lumarachi, Logooli and Lubukusu). Primary data was collected from the respective language communities, which included Indigenous stories and narratives from student compositions, native language media stations, and publishers – in order to include genres of texts representative of everyday language use in the communities. A total of 4,442 texts were collected: 2909 for Swahili, 546 texts for Dholuo, 483 texts for Lumarachi, 135 texts for Lubukusu, and 359 texts for Logooli. A total of 1,152 files containing spontaneous speech data were collected, which total to 176 hours, 29 minutes, and 46 seconds: 104 files (19 hours, 10 minutes, 57 seconds) for Swahili, 512 files (99 hours, 3 minutes, 8 seconds) for Dholuo, 138 files (15 hours, 37 minutes, 46 seconds) for Lumarachi, 354 files (30 hours, 11 minutes) for Lubukusu, and annotated 44 files (12 hours, 26 minutes, 55 seconds) for Lulogooli.
Authors: Owen McOnyango (Maseno University), Florence Indede (Maseno University), Lilian D.A. Wanzare (Maseno University), Barack Wanjawa (University of Nairobi), Edward Ombui (Africa Nazarene University), Lawrence Muchemi (University of Nairobi)
Languages: Kiswahili, Dholuo, Luhya-Lubukusu, Luhya-Logooli, Luhya-Lumarachi
Dataset: access here
Description: This project developed a Part of Speech (POS) Tagged dataset of 2 languages in Kenya: Dholuo and 3 Luhya dialects (Lumarachi, Lulogooli, and Lubukusi). The project tagged approximately 143,000 words, which includes about 50,000 words for Dholuo, 27,900 words for Lumarachi, 34,300 words for Logooli, and 30,900 words for Lubukusu words.
Authors: Florence Indede (Maseno University), Owen McOnyango (Maseno University), Lilian D.A. Wanzare (Maseno University), Barack Wanjawa (University of Nairobi), Edward Ombui (Africa Nazarene University), Lawrence Muchemi (University of Nairobi)
Languages: Dholuo, Luhya-Lumarachi, Luhya-Lulogooli, Luhya-Lubukusu
Dataset: access here
Description: This project produced a speech dataset that includes both read and spontaneous speech recordings, recorded in Kenya with native Swahili speakers, and corresponding transcripts. In total, the dataset includes 27 hours, 31 minutes, 50 seconds of speech data from 26 speakers (19 females and 7 males). The recordings are of the following audio format: .wav, 16 bits, 16kHz, mono and Little Endian. Of the total recordings, 26 hours, 32 minutes, and 37 seconds represent the read speech data while 59 minutes, 13 seconds represent the spontaneous speech recordings. Additionally, this dataset includes a phonelist file containing all the Swahili phones as used by KenCorpus. This phone-list file is crucial, as its contents have been used to create the KenCorpus Swahili lexicon-phone dictionary, which contains all the words in the KenCorpus transcripts with their corresponding pronunciations as per the Swahili phones in the phone-list. The lexicon-phone dictionary contains about 30,000 words.
Authors: Dorcas Awino (University of Nairobi), Lawrence Muchemi (University of Nairobi), Lilian D.A. Wanzare (Maseno University), Edward Ombui (Africa Nazarene University), Barack Wanjawa (Maseno University), Owen McOnyango (Maseno University), Florence Indede (Maseno University)
Language: Swahili
Dataset: access here
Description: This project produced a parallel corpus between Swahili and two other Kenya Languages: Dholuo and 3 Luhya dialects (Lumarachi, Logooli and Lubukusu). A total of about 12,400 sentences were translated to Kiswahili from a sample of Dholuo and Luhya texts (1,500 Dholuo-Kiswahili sentence pairs and 10,900 Luhya-Kiswahili sentence pairs).
Authors: Lilian D.A Wanzare (Maseno University), Florence Indede (Maseno University), Owen McOnyango (Maseno University), Edward Ombui (Africa Nazarene University), Barack Wanjawa (University of Nairobi), Lawrence Muchemi (University of Nairobi)
Languages: Dholuo, Luhya-Lumarachi, Luhya-Lubukusu, Luhya-Lulogooli
Dataset: access here
Description: This project produced a large Machine Reading Comprehension dataset for the Kiswahili Language. A total of 7,526 Question-Answer (QA) pairs were developed based on 1,445 Swahili story texts. Each text has at least 5 QA pairs, where the questions were written based on the story, and the answers are either a single word or a short text.
Authors: Barack Wanjawa (University of Nairobi), Lilian D.A. Wanzare (Maseno University), Florence Indede (Maseno University), Owen McOnyango (Maseno University), Lawrence Muchemi (University of Nairobi), Edward Ombui (Africa Nazarene University)
Language: Swahili
Dataset: access here
Description: MasakhaNER 2.0 is the largest human-annotated named entity recognition dataset for 20 African languages. Each language has between 4,800 – 11,000 parallel sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu. More information about the data can be found in their EMNLP paper here.
Contact: David Ifeoluwa Adelani, D.ADELANI@UCL.AC.UK
Authors: available here.
Languages: Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu
Dataset: access here.
Description: The MAFAND-MT dataset is a few thousand high-quality and human translated parallel sentences for 16 African languages in the news domain. Each language has between 1,466 – 7838 parallel sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Twi, Wolof, and isiXhosa. Further details on this dataset can be found in the team’s NAACL 2022 paper: https://arxiv.org/abs/2205.02022
Contact: David Ifeoluwa Adelani, D.ADELANI@UCL.AC.UK
Authors: available here.
Languages: Bambara, Ghomala, Ewe, Fon, Hausa, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Twi, Wolof, and isiXhosa
Dataset: access here.
Description: MasakhaPOS is the largest human-annotated part of speech tagging dataset for 20 African languages. Each language has between 1200 – 1500 sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu.
Contact: David Ifeoluwa Adelani, D.ADELANI@UCL.AC.UK
Authors: available here.
Languages: Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu
Dataset: access here.
Description: This speech dataset for the Ghanian languages Akan (Akuapem Twi, Asante Twi, Fante) and Ga includes 104,000 utterances (speech) across the four dialects/languages with approximately 200 speakers per dialect/language. This amounts to about 148 hours of speech in total. The dataset was developed to support the development of financial applications in native Ghanaian languages to allow illiterate and semi-literate people to fully benefit from digital financial services. Secondly, it aims to answer research questions related to domain-specific vs. general-purpose dataset development, dialects, as well as NLP system development in low resource settings.
Contact: Dennis Asamoah Owusu, DOWUSU@ASHESI.EDU.GH
Authors: available here.
Languages: Akan (Akuapem Twi, Asante Twi, Fante) and Ga
Dataset: access here.
Description: This dataset is the first spoken corpus of labelled and unlabeled datasets for Igbo Natural Language Processing (NLP) tasks. It consists of approximately 40 hours of naturally occurring Igbo speech that is representative of all the dialects of Igbo. The dataset lays the foundation for Igbo NLP tasks such as machine translation, tree bank, speech-to-text, automatic POS tagging, digital dictionary, and automatic spelling checker.
Contacts: Gerald Nweya (GERALDNWEYA@GMAIL.COM) and Emeka Onwuegbuzia (EONWUEGBUZIA@GMAIL.COM)
Authors:
Languages:
Dataset: access here.
Description: The Bayelemabaga dataset consists of 46,976 parallel machine translation-ready Bambara-French sentence pairs, originating from the Bambara Reference Corpus from INALCO’s LLACAN Lab. The text in the dataset is extracted from 264 text files, ranging from periodicals, books, short stories, blog posts, to parts of the Bible and the Quran.
Contacts: Christopher Homan, christopher.m.homan.phd@gmail.com
Authors: Allahsera Auguste Tapo, Michael Leventhal, Valentin Vydrin, Sebastian Diarra, Marcos Zampieri, Emily Prud’Hommeaux, Jean Jacque Méric,
Languages: Bambara, French
Dataset: access here.
Description: Makerere University has created text and speech datasets for low-resourced East African Languages in Uganda, Tanzania, and Kenya. This dataset contains 10,000 parallel sentiment-tagged sentences, 100,000 Kiswahili sentences, 100,000 Luganda sentences, 40,037 Acoli sentences, and 39,999 Lumasaaba sentences. On Common Voice, the text dataset comprises 100,000 Luganda sentences and 100,000 Swahili sentences. The text datasets can be used for building machine translation, next-word predictor/auto-completion, topic modeling and classification, sentiment analysis, and language models. The Luganda and Swahili voice datasets can empower entrepreneurs to innovate around existing gaps in their communities to build systems for visually impaired or physically handicapped people, native language tutors, medical transcription tools, and more. Application developers interested in translation engines, text editors, and text and grammar spelling systems in the East African community will benefit from the datasets.
Contact: Andrew Katumba | andrew.katumba@mak.ac.ug
Datasets:
- Text data: access here.
- Luganda voice data: access here.
- Swahili voice data: access here.
Authors and Affiliations:
- Makerere University: Katumba Andrew, Nakatumba-Nabende Joyce, Babirye Claire, Mukiibi Jonathan, Tusubira Jeremy, Bateesa Tobias, Wairagala Eric Peter, Fridah Katushemererwe, Mutebi Chodrine, Nabende Peter, Sentanda Medadi, Ssenkungu Ivan
- Wanzare Lilian (Maseno University)
- Davis David (TYD Innovation Incubator)
- Okidi George
- Ayugi Carolyne
- Muzaki Naomi
Contact: Claytone Sikasote | claytonsikasote@gmail.com
The BIG-C (Bemba Image Grounded Conversations) dataset is comprised of multi-turn dialogues between Bemba speakers grounded on images, transcribed and translated to English. Specifically, there are over 92,000 sentences, amounting to over 180 hours of speech data with corresponding Bemba transcriptions and English translations.
Authors and Affiliations:
- Claytone Sikasote, University of Zambia, Zambia
- Eunice Mukonde – Mulenga, University of Zambia, Zambia
- Md Mahfuz Ibn Alam, George Mason University, USA
- Antonios Anastasopoulos, George Mason University, USA
Dataset: https://github.com/csikasote/bigc
Publication: https://aclanthology.org/2023.acl-long.115
Contact: Aminata Ndiaye | amina.ndiaye@jokalante.com and Elodie Gauthier | elodie.gauthier@orange.com
This dataset will strengthen natural language processing resources for Wolof, Pulaar, and Serer, the three most widely spoken languages in Senegal. This dataset’s repository of transcribed speech includes over 55 hours (12 files) of transcribed speech in Wolof, 38 hours (105 files) in Serer, and 31 hours (83 files) in Pulaar. The repository also includes over 12 hours of verified recordings in each language, textual data containing over 947,000 words in Wolof, and 593,000 in Pulaar. It also includes a pronunciation lexicon of over 54,000 phonetized entries in Wolof.
Authors and Affiliations:
- Project Leader: Aminata Ndiaye Diallo (Jokalante, Dakar, Senegal)
- Stakeholders: Elodie Gauthier (Orange Innovation, Lannion, France), Abdoulaye Guissé (Ecole Polytechnique de Thiès, Senegal)
- Intern: Boubacar Diallo (Assane Seck University, Ziguinchor, Senegal) – Collection of textual dataset
- Trainees: Maimouna Diallo (Cheikh Anta Diop University, Dakar, Senegal) – Wolof transcription, Houleye Amadou Kane (Cheikh Anta Diop University, Dakar, Senegal) – Pulaar transcription, Fatou Diouf (Cheikh Anta Diop University, Dakar, Senegal) – Serer transcription
Dataset:
Languages: Hausa, Igbo, and Yoruba
Contact: For partnerships, collaborations, or questions, reach out to info@naijavoices.com
The NaijaVoices project has curated 1,867 hours of speech and text data featuring over 5,000 speakers in the three major Nigerian languages — Hausa, Igbo, and Yoruba. As of its release, it is the largest ever multi-speaker African speech dataset. The dataset consists of circa 1,917,686 instances – each instance is made up of audio, a transcript, the language of the transcript, the speaker ID, gender, and age bracket. The dataset enables audio-based NLP tasks like automatic speech recognition (ASR) and text-to-speech (TTS). Additionally, the authentic sentences in the dataset can enhance text-based natural language processing (NLP) tasks, including language modeling, part-of-speech tagging, and named entity recognition.
Linguistic applications of this dataset include understanding sociolinguistic profiles, analyzing pronunciation variations, studying phonetic and phonemic differences, and advancing natural language processing (NLP) capabilities for the three Nigerian languages. The NaijaVoices method intentionally incorporated discourse about marginalized populations, such as women, children, and people living with disabilities, as well as underrepresented topic areas, such as traditional counting systems and agriculture. The dataset also represents diverse voices, with over 5,000 participants with unique speaker patterns and dialects.
Authors and Affiliations: The NaijaVoices Community (https://naijavoices.com/)
Dataset: https://naijavoices.com/membership#membership_tiers
- Access to this dataset is free and openly available upon filling out the registration form at the link above (Click the “Explorer” tier). Once you fill out the form, you will receive an email that gives you a link to the dataset.
Languages: Amharic, Hausa, Swahili, Yorùbá, and Zulu
Contact: Jesujoba O. Alabi | jalabi@lsv.uni-saarland.de
AFRIDOC-MT is a document-level and multi-way translation dataset from English into five African languages — Amharic, Hausa, Swahili, Yorùbá, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all of which were human-translated from English to these languages. Each domain has at least 10,000 parallel sentences per language pair and supports multiway translation, allowing translation not only between English and the African languages but also among the African languages themselves.
This dataset can be used to evaluate the ability of existing neural machine translation (NMT) models and large language models (LLMs) to translate at the document level and to train such models. Recently, there has been interest in document-level translation with multiple sentences, where sentences are translated with their context rather than in isolation. Previously, efforts were focused on high-resource languages, where document-level datasets are readily available, and not on low-resource African languages. In addition, it can be used for sentence-level translation and a couple of other language tasks if properly annotated.
Authors and Affiliations:
- Saarland University: Jesujoba O. Alabi, Israel Abebe, Miaoran Zhang, Dawei Zhu, Dietrich Klakow
- German Research Center for Artificial Intelligence (DFKI): Cristina España-Bonet
- INRIA: Rachel Bawden
- McGill University and Mila: David Adelani
- University of Ibadan: Clement Oyeleke Odoje, Idris Akinade
- National Institute of Informatics (NII): Iffat Maab
- Selcom: Davis David
- Imperial College, London: Shamsuddeen Hassan
- University of KwaZulu-Natal: Nokwanda Putini
- Loughborough University, U.K.: David Oluwajoju Ademuyiwa
- University of Cambridge: Andrew Caines
Languages: Amharic, Ewe, Hausa, Igbo, Lingala, Luganda, Oromo, Kinyarwanda, Shona, Sesotho, Swahili, Twi, Wolof, Xhosa, Yoruba and Zulu
Contact: David Adelani | david.adelani@mila.quebec
This team has developed five conversational AI and benchmark datasets for 16 languages across the African continent: Amharic, Ewe, Hausa, Igbo, Lingala, Luganda, Oromo, Kinyarwanda, Shona, Sesotho, Swahili, Twi, Wolof, Xhosa, Yoruba and Zulu. The first dataset, AfriXNLI, is a natural language inference dataset used to determine the linguistic relationship (entailment, neutral, and contradiction) between two sentences; it has 1,050 sentence pairs per language. The second dataset, AfriMMLU, is a knowledge-based multi-choice question-answering dataset covering five subjects: elementary mathematics, high-school geography, international law, global facts, and high school microeconomics. The team collected 608 question-answer pairs per language. The third dataset, AfriMGSM, was developed as a free-form grade school mathematics question-answering dataset, which was formed with 258 question-answer pairs. AfriIntent, which involves the collection of 3,200 sentences per language, is an intent classification dataset covering various domains such as banking (e.g., “pay bill”), home (e.g., “play music”), kitchen and dining (e.g., “confirm reservation”), travel (e.g., plug type), and utility (e.g. “make call”). Finally, using 3,200 sentences per language, the team developed AfriSlot for slot classification in categories such as food items, language names, etc.
These five text-only datasets are useful for conversational chatbots in real-life applications such as banking, restaurants, travel agencies, and more. The team has created strong benchmarks for evaluating the performance of large language models such as GPT-4o on African languages.
Authors and Affiliations:
- McGill University & Mila: David Ifeoluwa Adelani, Hao Yu
- SADiLaR: Andiswa Bukula, Mmasibidi Setaka, Rooweither Mabuya
- OntarioTech University: En-Shiun Annie Lee
- Saarland University: Israel Abebe Azime, Jesujoba O. Alabi
- Toronto University: Jian Yun Zhuang
- Princeton University: Happy Buzaaba
- Masakhane: Blessing Sibanda, Godson Kalipe, Jonathan Mukiibi, Salomon Kabongo, Lolwethu Ndolela, Nkiruka Odu, Salomey Osei, Sokhar Samb, Tadesse Kebede Guge, Juliet Murage
- Imperial College: Shamsuddeen Hassan Muhammad
Languages: Luganda, Lumasaba, Hausa, and Kanuri
Contact:
- Andrew Katumba|katumba@mak.ac.ug
- Milena Haykowska|milena.haykowska@clearglobal.org
- Peter Nabende|nabende@gmail.com
This dataset contains annotated sentences with personally identifiable information (PII) in Luganda, Lumasaba, Hausa, and Kanuri. These four languages span Central and Eastern Uganda, Nigeria, Ghana, and Northern Cameroon. The team collected 3,000 sentences for both Kanuri and Hausa, 5,000 for Lumasaba, and 4,000 for Luganda. Potential use cases for these datasets include named entity recognition (NER), text classification, privacy-preserving data analysis and research, language modeling, machine translation, and linguistic research.
The team aimed to curate a dataset that is gender inclusive, and their work highlighted the need for standardized guidelines for annotating low-resourced languages. Having these guidelines would help to avoid common pitfalls and errors when labeling text data in these low-resource languages.
Authors and Affiliations:
- Marconi Research and Innovations Lab, Makerere University: Andrew Katumba, Jenifer Winfred Namuyanja, Nakakande Bridget Cecile
- Makerere Artificial Intelligence Lab: Joyce Nakatumba-Nabende, Ann Lisa Nabiryo, Peter Nabende, Eric Peter Wairagala
- Clear Global: Milena Haykowska, Andrew Bredenkamp, Mariam Mohanna, Alp Öktem, Etienne de Crecy
Dataset: https://doi.org/10.7910/DVN/CGHWZE
Languages: Hausa, Yoruba, Igbo, Nigerian Pidgin, Algerian Arabic, Moroccan Arabic, Swahili, IsiXhosa, IsiZulu, Kinyarwanda, Twi, Amharic, Oromo, Somali, Tigrinya
Contact:
- Abinew Ali Ayele | abinewaliayele@gmail.com
- Seid Muhie Yiman | muhie.yimam@uni-hamburg.de
- Shamsuddeen Hassan Muhammad | muhammad@imperial.ac.uk
AfriHate is a hate and offensive speech corpus for 15 African languages: Hausa, Yoruba, Igbo, Nigerian Pidgin, Algerian Arabic, Moroccan Arabic, Swahili, IsiXhosa, IsiZulu, Kinyarwanda, Twi, Amharic, Oromo, Somali, Tigrinya. The AfriHate dataset annotated tweets using “offensive,” “hateful,” and “normal” classes, with specific target classes (topics) such as politics, ethnicity, gender, religion, and disability. Within this project, the team created another dataset, AfriEmotion, a new corpus for the detection of emotion, including the intensity of emotions such as joy, sadness, fear, anger, surprise, and disgust. Overall, the team collected and annotated 10,000 instances each for hate and offensive speech and emotion detection per language, making a total of 150,000 annotated observations.
This project is the first to develop and make a publicly available dataset for hate and offensive speech and emotion detection in the target languages. To ensure a representative dataset, the target languages are cut across all regions of Africa. Similarly, for each language, the team collected texts using a diverse set of strategies to ensure even representation among the corpus and used annotators of diverse backgrounds in terms of gender, status, and educational level.
The AfriHate dataset supports various Natural Language Processing (NLP) tasks and applications for African languages, including hate speech detection, abusive language identification, contextual analysis, and language modeling. It serves several use cases, such as psychological research, policy making, and content moderation. The dataset helps to detect hate speech effectively in low-resource language settings, identify linguistic patterns of hate speech, understand contextual influences, and improve NLP tools for nuanced content moderation in African languages.
Similarly, the AfriEmotion dataset facilitates various NLP tasks and applications for African languages, including emotion detection, analysis, and synthesis. Its use cases include social media monitoring to understand public sentiment and emotion, mental health support with early detection of distress, educational tools promoting emotional intelligence, literary analysis through an emotional lens, and policy insights for informed decision-making. The dataset addresses questions regarding linguistic and cultural influences on emotional expression, similarities and differences across languages and cultures, adaptation of NLP models for low-resource languages, and challenges and opportunities of cross-lingual emotion processing in African contexts.
Authors and Affiliations:
- ICT4D, Bahir Dar University: Esubalew Alemneh Jalew, Abinew Ali Ayele
- Bayero University Kano, Department of Computing: Shamsudeen Hassan Muhammad, Ibrahim Said Ahmad
- Imperial College London: Shamsuddeen Hassan Muhammad
- Idris Abdulmumin (Ahmadu Bello University, Department of Computer Science).
- Seid Muhie Yimam (University of Hamburg, Language Technology Group, Department of Informatics)
Dataset: Site is currently under construction – this dataset will be available soon. Thank you for your patience!
Languages: Amharic, Tigrigna, Oromo, Somali, Afar, Sidama
Contact: Solomon Teferra Abate | solomon.teferra@aau.edu.et
Ethio Speech Corpora is comprised of over 391 hours of recorded audio in six different Ethiopian languages: Amharic (68 hours), Tigrigna (62 hours), Oromo (70 hours), Somali (56 hours), Afar (68 hours), and Sidama (68 hours). This project will be a valuable resource for the development of well-performing automatic speech recognition (ASR) systems for these six languages (in a monolingual setup) and for other related languages (in a multilingual and/or cross-lingual setup) that are useful in various aspects of daily life.
Use cases of speech recognition systems using this dataset include dictation systems, transcription systems, assistive technologies, spoken dialogue systems, speech translation, and other similar speech technologies. To make the data set representative, the team selected six working languages that are used across regional states of Ethiopia while also maintaining the gender and age balance of readers.
Authors and Affiliations:
- School of Information Science of the Addis Ababa University: Solomon Teferra Abate (PhD), Martha Yifiru Tachbelie (PhD), Michael Melese Woldeyohannes (PhD), Hafte Abera, Bantegize Addis Alemayehu, Wondwossen Mulugeta (PhD)
Website: https://ethiospeech.com/
Dataset: Site is currently under construction – this dataset will be available soon. Thank you for your patience!
Languages: Kidaw’ida, Kalenjin, and Dholuo, Kiswahili
Contact: Audrey Mbogho | ambogho@usiu.ac.ke
This team collected parallel text corpora for three Kenyan indigenous languages, Kidaw’ida, Kalenjin, and Dholuo, alongside Kiswahili, resulting in approximately 90,000 sentence pairs in total. After collection, the team separated out the Kidaw’ida, Kalenjin, and Dholuo sentences and used them as monolingual datasets for crowd-sourcing speech data, facilitated by uploading the sentences to Mozilla Common Voice. A total of 109 members of the three language communities were recruited to read and record sentences from their respective native languages. Emphasizing gender balance and including different ages and regional variants helped to make the datasets more representative. The voice datasets offer a substantial amount of speech data, comprising 56 hours of Kidaw’ida, 92 hours of Kalenjin, and 120 hours of Dholuo, for a total of 268 hours.
Use cases for these parallel corpora include training models to translate text between Kiswahili and Kidaw’ida, Kalenjin, and Dholuo. The speech data on Mozilla Common Voice, along with its associated text data, is intended to be used for the development of speech recognition applications. The languages that comprise this dataset are low-resource, especially Kidaw’ida, which has only around 400,000 speakers and faces a more immediate risk of loss. By collecting the text and speech data, this team contributed to the preservation of these languages. They hope that once enough data has been collected to train accurate models and create NLP applications for these three languages, they will become more relevant in the modern digital age, thus mitigating the risk of loss.
Authors and Affiliations:
- USIU-Africa: Audrey Mbogho, Quin Awuor
- Maseno University: Lilian Wanzare, Vivian Oloo
- Andrew Kipkebut (Kabarak University)
- Rose Lugano (University of Florida)
Dataset:
Languages: Emakhuwa, Portuguese
Contact: Felermino D. M. A. Ali | felermino.ali@unilurio.ac.mz or felerminoali@gmail.com
This dataset includes the translation of 1,897 news articles comprising 660,242 words from Portuguese to Emakhuwa, an indigenous language of Mozambique. Each article includes the news headline, content, and label for topic classification. For news topic classification, the articles were divided into three primary areas: training (1,337 articles), development (185 articles), and testing (375 articles). The articles were then further categorized by topic: politics, economy, culture, sports, health, society, and world news.
The intended use cases for this dataset include topic classification, translation, and loanword recognition. To ensure that the dataset was representative, the team translated different categories of news articles and prioritized Mozambique-related news and articles, contributing to lexicon diversity. The datasets have shown promising outcomes when fine-tuning multilingual models like ByT5, M2M100, and NLLB200. This team’s work has already generated improvements in translation quality when using loanword information as additional data. They plan to continue refining models and ensuring high-quality outputs for all use cases.
Authors and Affiliations:
- Felermino Dário Mário António Ali: Lurio University, Faculty of Engineering; Artificial Intelligence and Computer Science Lab (LIACC); Centre of Linguistics (CLUP) of the University of Porto
- Henrique Lopes Cardoso: Faculty of Engineering of the University of Porto (FEUP), Artificial Intelligence and Computer Science Lab (LIACC)
- Rui Sousa Silva: Faculty of Arts and Humanities, Centre of Linguistics (CLUP) of the University of Porto
Dataset: https://huggingface.co/collections/LIACC/makhuwa-nlp-66a93ea22df7f4b31e96a5ab
Papers:
All Lacuna Fund datasets are licensed under the CC-BY 4.0 International license unless otherwise noted.