Language Datasets
Description: This dataset is the first large-scale human-annotated Twitter sentiment dataset for Hausa, Igbo, Nigerian-Pidgin, and Yorùbá, the four most widely spoken languages in Nigeria.
Authors: Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Sebastian Ruder, Ibrahim Said Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choudhury, Chris Chinenye Emezue, Saheed Salahudeen Abdullahi, Anuoluwapo Aremu, Alipio Jeorge, and Pavel Brazdil
Languages: Hausa, Igbo, Nigerian-Pidgin, and Yorùbá
Dataset: access here
Description: This evaluation dataset automatically quantifies the quality of machine translation systems for Afar, Amharic, Oromo, Somali and Tigrinya.
Authors: Asmelash Teka Hadgu, Gebrekirstos G. Gebremeskel, Abel Aregawi
Translators: Afar – Mohammed Deresa, Yasin Nur; Amharic – Tigist Taye, Selamawit Hailemariam, Wako Tilahun; Oromo – Gemechis Melkamu, Galata Girmaye; Somali – Abdiselam mohamed, Beshir Abdi; Tigrinya – Michael Minassie, Berhanu Abadi Weldegiorgis, Nureddin Mohammedshiek
Languages: Afar, Amharic, Oromo, Somali and Tigrinya
Dataset: access here
Description: This project collected text and speech corpora for three languages in Kenya: Kiswahili, Dholuo and 3 Luhya dialects (Lumarachi, Logooli and Lubukusu). Primary data was collected from the respective language communities, which included Indigenous stories and narratives from student compositions, native language media stations, and publishers – in order to include genres of texts representative of everyday language use in the communities. A total of 4,442 texts were collected: 2909 for Swahili, 546 texts for Dholuo, 483 texts for Lumarachi, 135 texts for Lubukusu, and 359 texts for Logooli. A total of 1,152 files containing spontaneous speech data were collected, which total to 176 hours, 29 minutes, and 46 seconds: 104 files (19 hours, 10 minutes, 57 seconds) for Swahili, 512 files (99 hours, 3 minutes, 8 seconds) for Dholuo, 138 files (15 hours, 37 minutes, 46 seconds) for Lumarachi, 354 files (30 hours, 11 minutes) for Lubukusu, and annotated 44 files (12 hours, 26 minutes, 55 seconds) for Lulogooli.
Authors: Owen McOnyango (Maseno University), Florence Indede (Maseno University), Lilian D.A. Wanzare (Maseno University), Barack Wanjawa (University of Nairobi), Edward Ombui (Africa Nazarene University), Lawrence Muchemi (University of Nairobi)
Languages: Kiswahili, Dholuo, Luhya-Lubukusu, Luhya-Logooli, Luhya-Lumarachi
Dataset: access here
Description: This project developed a Part of Speech (POS) Tagged dataset of 2 languages in Kenya: Dholuo and 3 Luhya dialects (Lumarachi, Lulogooli, and Lubukusi). The project tagged approximately 143,000 words, which includes about 50,000 words for Dholuo, 27,900 words for Lumarachi, 34,300 words for Logooli, and 30,900 words for Lubukusu words.
Authors: Florence Indede (Maseno University), Owen McOnyango (Maseno University), Lilian D.A. Wanzare (Maseno University), Barack Wanjawa (University of Nairobi), Edward Ombui (Africa Nazarene University), Lawrence Muchemi (University of Nairobi)
Languages: Dholuo, Luhya-Lumarachi, Luhya-Lulogooli, Luhya-Lubukusu
Dataset: access here
Description: This project produced a speech dataset that includes both read and spontaneous speech recordings, recorded in Kenya with native Swahili speakers, and corresponding transcripts. In total, the dataset includes 27 hours, 31 minutes, 50 seconds of speech data from 26 speakers (19 females and 7 males). The recordings are of the following audio format: .wav, 16 bits, 16kHz, mono and Little Endian. Of the total recordings, 26 hours, 32 minutes, and 37 seconds represent the read speech data while 59 minutes, 13 seconds represent the spontaneous speech recordings. Additionally, this dataset includes a phonelist file containing all the Swahili phones as used by KenCorpus. This phone-list file is crucial, as its contents have been used to create the KenCorpus Swahili lexicon-phone dictionary, which contains all the words in the KenCorpus transcripts with their corresponding pronunciations as per the Swahili phones in the phone-list. The lexicon-phone dictionary contains about 30,000 words.
Authors: Dorcas Awino (University of Nairobi), Lawrence Muchemi (University of Nairobi), Lilian D.A. Wanzare (Maseno University), Edward Ombui (Africa Nazarene University), Barack Wanjawa (Maseno University), Owen McOnyango (Maseno University), Florence Indede (Maseno University)
Language: Swahili
Dataset: access here
Description: This project produced a parallel corpus between Swahili and two other Kenya Languages: Dholuo and 3 Luhya dialects (Lumarachi, Logooli and Lubukusu). A total of about 12,400 sentences were translated to Kiswahili from a sample of Dholuo and Luhya texts (1,500 Dholuo-Kiswahili sentence pairs and 10,900 Luhya-Kiswahili sentence pairs).
Authors: Lilian D.A Wanzare (Maseno University), Florence Indede (Maseno University), Owen McOnyango (Maseno University), Edward Ombui (Africa Nazarene University), Barack Wanjawa (University of Nairobi), Lawrence Muchemi (University of Nairobi)
Languages: Dholuo, Luhya-Lumarachi, Luhya-Lubukusu, Luhya-Lulogooli
Dataset: access here
Description: This project produced a large Machine Reading Comprehension dataset for the Kiswahili Language. A total of 7,526 Question-Answer (QA) pairs were developed based on 1,445 Swahili story texts. Each text has at least 5 QA pairs, where the questions were written based on the story, and the answers are either a single word or a short text.
Authors: Barack Wanjawa (University of Nairobi), Lilian D.A. Wanzare (Maseno University), Florence Indede (Maseno University), Owen McOnyango (Maseno University), Lawrence Muchemi (University of Nairobi), Edward Ombui (Africa Nazarene University)
Language: Swahili
Dataset: access here
Description: MasakhaNER 2.0 is the largest human-annotated named entity recognition dataset for 20 African languages. Each language has between 4,800 – 11,000 parallel sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu. More information about the data can be found in their EMNLP paper here.
Contact: David Ifeoluwa Adelani, D.ADELANI@UCL.AC.UK
Authors: available here.
Languages: Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu
Dataset: access here.
Description: The MAFAND-MT dataset is a few thousand high-quality and human translated parallel sentences for 16 African languages in the news domain. Each language has between 1,466 – 7838 parallel sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Twi, Wolof, and isiXhosa. Further details on this dataset can be found in the team’s NAACL 2022 paper: https://arxiv.org/abs/2205.02022
Contact: David Ifeoluwa Adelani, D.ADELANI@UCL.AC.UK
Authors: available here.
Languages: Bambara, Ghomala, Ewe, Fon, Hausa, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Twi, Wolof, and isiXhosa
Dataset: access here.
Description: MasakhaPOS is the largest human-annotated part of speech tagging dataset for 20 African languages. Each language has between 1200 – 1500 sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu.
Contact: David Ifeoluwa Adelani, D.ADELANI@UCL.AC.UK
Authors: available here.
Languages: Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu
Dataset: access here.
Description: This speech dataset for the Ghanian languages Akan (Akuapem Twi, Asante Twi, Fante) and Ga includes 104,000 utterances (speech) across the four dialects/languages with approximately 200 speakers per dialect/language. This amounts to about 148 hours of speech in total. The dataset was developed to support the development of financial applications in native Ghanaian languages to allow illiterate and semi-literate people to fully benefit from digital financial services. Secondly, it aims to answer research questions related to domain-specific vs. general-purpose dataset development, dialects, as well as NLP system development in low resource settings.
Contact: Dennis Asamoah Owusu, DOWUSU@ASHESI.EDU.GH
Authors: available here.
Languages: Akan (Akuapem Twi, Asante Twi, Fante) and Ga
Dataset: access here.
Description: This dataset is the first spoken corpus of labelled and unlabeled datasets for Igbo Natural Language Processing (NLP) tasks. It consists of approximately 40 hours of naturally occurring Igbo speech that is representative of all the dialects of Igbo. The dataset lays the foundation for Igbo NLP tasks such as machine translation, tree bank, speech-to-text, automatic POS tagging, digital dictionary, and automatic spelling checker.
Contacts: Gerald Nweya (GERALDNWEYA@GMAIL.COM) and Emeka Onwuegbuzia (EONWUEGBUZIA@GMAIL.COM)
Authors:
Languages:
Dataset: access here.
Description: The Bayelemabaga dataset consists of 46,976 parallel machine translation-ready Bambara-French sentence pairs, originating from the Bambara Reference Corpus from INALCO’s LLACAN Lab. The text in the dataset is extracted from 264 text files, ranging from periodicals, books, short stories, blog posts, to parts of the Bible and the Quran.
Contacts: Christopher Homan, christopher.m.homan.phd@gmail.com
Authors: Allahsera Auguste Tapo, Michael Leventhal, Valentin Vydrin, Sebastian Diarra, Marcos Zampieri, Emily Prud’Hommeaux, Jean Jacque Méric,
Languages: Bambara, French
Dataset: access here.
Description: Makerere University has created text and speech datasets for low-resourced East African Languages in Uganda, Tanzania, and Kenya. This dataset contains 10,000 parallel sentiment-tagged sentences, 100,000 Kiswahili sentences, 100,000 Luganda sentences, 40,037 Acoli sentences, and 39,999 Lumasaaba sentences. On Common Voice, the text dataset comprises 100,000 Luganda sentences and 100,000 Swahili sentences. The text datasets can be used for building machine translation, next-word predictor/auto-completion, topic modeling and classification, sentiment analysis, and language models. The Luganda and Swahili voice datasets can empower entrepreneurs to innovate around existing gaps in their communities to build systems for visually impaired or physically handicapped people, native language tutors, medical transcription tools, and more. Application developers interested in translation engines, text editors, and text and grammar spelling systems in the East African community will benefit from the datasets.
Contact: Andrew Katumba | andrew.katumba@mak.ac.ug
Datasets:
- Text data: access here.
- Luganda voice data: access here.
- Swahili voice data: access here.
Authors and Affiliations:
- Makerere University: Katumba Andrew, Nakatumba-Nabende Joyce, Babirye Claire, Mukiibi Jonathan, Tusubira Jeremy, Bateesa Tobias, Wairagala Eric Peter, Fridah Katushemererwe, Mutebi Chodrine, Nabende Peter, Sentanda Medadi, Ssenkungu Ivan
- Wanzare Lilian (Maseno University)
- Davis David (TYD Innovation Incubator)
- Okidi George
- Ayugi Carolyne
- Muzaki Naomi
Contact: Claytone Sikasote | claytonsikasote@gmail.com
The BIG-C (Bemba Image Grounded Conversations) dataset is comprised of multi-turn dialogues between Bemba speakers grounded on images, transcribed and translated to English. Specifically, there are over 92,000 sentences, amounting to over 180 hours of speech data with corresponding Bemba transcriptions and English translations.
Authors and Affiliations:
- Claytone Sikasote, University of Zambia, Zambia
- Eunice Mukonde – Mulenga, University of Zambia, Zambia
- Md Mahfuz Ibn Alam, George Mason University, USA
- Antonios Anastasopoulos, George Mason University, USA
Dataset: https://github.com/csikasote/bigc
Publication: https://aclanthology.org/2023.acl-long.115
Contact: Aminata Ndiaye | amina.ndiaye@jokalante.com and Elodie Gauthier | elodie.gauthier@orange.com
This dataset will strengthen natural language processing resources for Wolof, Pulaar, and Serer, the three most widely spoken languages in Senegal. This dataset’s repository of transcribed speech includes over 55 hours (12 files) of transcribed speech in Wolof, 38 hours (105 files) in Serer, and 31 hours (83 files) in Pulaar. The repository also includes over 12 hours of verified recordings in each language, textual data containing over 947,000 words in Wolof, and 593,000 in Pulaar. It also includes a pronunciation lexicon of over 54,000 phonetized entries in Wolof.
Authors and Affiliations:
- Project Leader: Aminata Ndiaye Diallo (Jokalante, Dakar, Senegal)
- Stakeholders: Elodie Gauthier (Orange Innovation, Lannion, France), Abdoulaye Guissé (Ecole Polytechnique de Thiès, Senegal)
- Intern: Boubacar Diallo (Assane Seck University, Ziguinchor, Senegal) – Collection of textual dataset
- Trainees: Maimouna Diallo (Cheikh Anta Diop University, Dakar, Senegal) – Wolof transcription, Houleye Amadou Kane (Cheikh Anta Diop University, Dakar, Senegal) – Pulaar transcription, Fatou Diouf (Cheikh Anta Diop University, Dakar, Senegal) – Serer transcription
Dataset:
All Lacuna Fund datasets are licensed under the CC-BY 4.0 International license unless otherwise noted.