Language Domain

Lacuna Fund language datasets create openly accessible text and speech resources that fuel natural language processing technologies in diverse languages across low- and middle-income contexts globally. Explore and download released datasets below.

2020 Awards

Description: This dataset is the first large-scale human-annotated Twitter sentiment dataset for Hausa, Igbo, Nigerian-Pidgin, and Yorùbá, the four most widely spoken languages in Nigeria. 

Authors: Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Sebastian Ruder, Ibrahim Said Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choudhury, Chris Chinenye Emezue, Saheed Salahudeen Abdullahi, Anuoluwapo Aremu, Alipio Jeorge, and Pavel Brazdil

Languages: Hausa, Igbo, Nigerian-Pidgin, and Yorùbá

Dataset: access here

Description: This evaluation dataset automatically quantifies the quality of machine translation systems for Afar, Amharic, Oromo, Somali and Tigrinya. 

Authors: Asmelash Teka Hadgu, Gebrekirstos G. Gebremeskel, Abel Aregawi

Translators: Afar – Mohammed Deresa, Yasin Nur; Amharic – Tigist Taye, Selamawit Hailemariam, Wako Tilahun; Oromo – Gemechis Melkamu, Galata Girmaye; Somali – Abdiselam mohamed, Beshir Abdi; Tigrinya – Michael Minassie, Berhanu Abadi Weldegiorgis, Nureddin Mohammedshiek

Languages: Afar, Amharic, Oromo, Somali and Tigrinya

Dataset: access here

Description: This project collected text and speech corpora for three languages in Kenya: Kiswahili, Dholuo and 3 Luhya dialects (Lumarachi, Logooli and Lubukusu). Primary data was collected from the respective language communities, which included Indigenous stories and narratives from student compositions, native language media stations, and publishers – in order to include genres of texts representative of everyday language use in the communities. A total of 4,442 texts were collected: 2909 for Swahili, 546 texts for Dholuo, 483 texts for Lumarachi, 135 texts for Lubukusu, and 359 texts for Logooli. A total of 1,152 files containing spontaneous speech data were collected, which total to 176 hours, 29 minutes, and 46 seconds: 104 files (19 hours, 10 minutes, 57 seconds) for Swahili, 512 files (99 hours, 3 minutes, 8 seconds) for Dholuo, 138 files (15 hours, 37 minutes, 46 seconds) for Lumarachi, 354 files (30 hours, 11 minutes) for Lubukusu, and annotated 44 files (12 hours, 26 minutes, 55 seconds) for Lulogooli.

Authors: Owen McOnyango (Maseno University), Florence Indede (Maseno University), Lilian D.A. Wanzare (Maseno University), Barack Wanjawa (University of Nairobi), Edward Ombui (Africa Nazarene University), Lawrence Muchemi (University of Nairobi)

Languages: Kiswahili, Dholuo, Luhya-Lubukusu, Luhya-Logooli, Luhya-Lumarachi

Dataset: access here

Description: This project developed a Part of Speech (POS) Tagged dataset of 2 languages in Kenya: Dholuo and 3 Luhya dialects (Lumarachi, Lulogooli, and Lubukusi). The project tagged approximately 143,000 words, which includes about 50,000 words for Dholuo, 27,900 words for Lumarachi, 34,300 words for Logooli, and 30,900 words for Lubukusu words.

Authors: Florence Indede (Maseno University), Owen McOnyango (Maseno University), Lilian D.A. Wanzare (Maseno University), Barack Wanjawa (University of Nairobi), Edward Ombui (Africa Nazarene University), Lawrence Muchemi (University of Nairobi)

Languages: Dholuo, Luhya-Lumarachi, Luhya-Lulogooli, Luhya-Lubukusu

Dataset: access here

Description: This project produced a speech dataset that includes both read and spontaneous speech recordings, recorded in Kenya with native Swahili speakers, and corresponding transcripts. In total, the dataset includes 27 hours, 31 minutes, 50 seconds of speech data from 26 speakers (19 females and 7 males). The recordings are of the following audio format: .wav, 16 bits, 16kHz, mono and Little Endian. Of the total recordings, 26 hours, 32 minutes, and 37 seconds represent the read speech data while 59 minutes, 13 seconds represent the spontaneous speech recordings. Additionally, this dataset includes a phonelist file containing all the Swahili phones as used by KenCorpus. This phone-list file is crucial, as its contents have been used to create the KenCorpus Swahili lexicon-phone dictionary, which contains all the words in the KenCorpus transcripts with their corresponding pronunciations as per the Swahili phones in the phone-list. The lexicon-phone dictionary contains about 30,000 words.

Authors: Dorcas Awino (University of Nairobi), Lawrence Muchemi (University of Nairobi), Lilian D.A. Wanzare (Maseno University), Edward Ombui (Africa Nazarene University), Barack Wanjawa (Maseno University), Owen McOnyango (Maseno University), Florence Indede (Maseno University)

Language: Swahili

Dataset: access here

Description: This project produced a parallel corpus between Swahili and two other Kenya Languages: Dholuo  and 3 Luhya dialects (Lumarachi, Logooli and Lubukusu). A total of about 12,400 sentences were translated to Kiswahili from a sample of Dholuo and Luhya texts (1,500 Dholuo-Kiswahili sentence pairs and 10,900 Luhya-Kiswahili sentence pairs).

Authors: Lilian D.A Wanzare (Maseno University), Florence Indede (Maseno University), Owen McOnyango (Maseno University), Edward Ombui (Africa Nazarene University), Barack Wanjawa (University of Nairobi), Lawrence Muchemi (University of Nairobi)

Languages: Dholuo, Luhya-Lumarachi, Luhya-Lubukusu, Luhya-Lulogooli

Dataset: access here

Description: This project produced a large Machine Reading Comprehension dataset for the Kiswahili Language. A total of 7,526 Question-Answer (QA) pairs were developed based on 1,445 Swahili story texts. Each text has at least 5 QA pairs, where the questions were written based on the story, and the answers are either a single word or a short text.

Authors: Barack Wanjawa (University of Nairobi), Lilian D.A. Wanzare (Maseno University), Florence Indede (Maseno University), Owen McOnyango (Maseno University), Lawrence Muchemi (University of Nairobi), Edward Ombui (Africa Nazarene University)

Language: Swahili

Dataset: access here

All Lacuna Fund datasets are licensed under the CC-BY 4.0 International license unless otherwise noted.