Sharing New Lacuna-Funded Text and Speech Data Resources for Selected Languages in Kenya

3 June 2022

We are excited to share recently published Lacuna-funded datasets in language! The KenCorpus team, a collaborative of researchers founded by Maseno University, the University of Nairobi, and Africa Nazarene University, have developed rich textual and speech data resources for selected languages spoken in Kenya. These datasets will foster equal opportunities, inclusivity, participation in decision-making, and accessibility. Learn about the new resources below!

  • Kencorpus: Kenyan Languages Corpus for Machine Learning and Natural Language Processing  |  This project collected text and speech corpora for three languages in Kenya: Kiswahili, Dholuo and 3 Luhya dialects (Lumarachi, Logooli and Lubukusu). The team collected primary data from the respective language communities, which included Indigenous stories and narratives from student compositions, native language media stations, and publishers – in order to include genres of texts representative of everyday language use in the communities. 4,442 total texts were collected: 2,909 for Swahili, 546 texts for Dholuo, 483 texts for Lumarachi, 135 texts for Lubukusu, and 359 texts for Logooli. 1,152 files containing spontaneous speech data were collected, totaling over 176 hours across languages.

Languages: Kiswahili, Dholuo, Luhya-Lubukusu, Luhya-Lulogooli, Luhya-Lumarachi

  • KenPos: Kenyan Languages Part of Speech Tagged dataset  |  This project developed a Part of Speech tagged dataset of two languages in Kenya: Dholuo and 3 Luhya dialects (Lumarachi, Logooli and Lubukusi). The project tagged approximately 143,000 words, which includes about 50,000 words for Dholuo, 27,900 words for Lumarachi, 34,300 words for Logooli, and 30,900 words for Lubukusu words.

Languages: Dholuo, Luhya-Lumarachi, Luhya-Lulogooli, Luhya-Lubukusu

  • KenSpeech: Swahili Speech Transcriptions  |  This project produced a speech dataset that includes both read and spontaneous speech recordings, recorded in Kenya with native Swahili speakers, and corresponding transcripts. The dataset includes over 27 hours of speech data from 26 speakers. Additionally, this dataset includes a file containing all the Swahili phones (speech sounds) as used by KenCorpus. This phone-list file is crucial, as its contents have been used to create the KenCorpus Swahili lexicon-phone dictionary, which contains all the words in the KenCorpus transcripts with their corresponding pronunciations as per the Swahili phones in the phone-list. The lexicon-phone dictionary contains approximately 30,000 words.

Language: Swahili

Languages: Dholuo, Luhya-Lumarachi, Luhya-Lubukusu, Luhya-Lulogooli

  • KenSwQuAD – A Question Answering Dataset for Swahili Low Resource Language  |  This project produced a large Machine Reading Comprehension dataset for the Kiswahili Language. A total of 7,526 Question-Answer (QA) pairs were developed based on 1,445 Swahili story texts. Each text has at least 5 QA pairs, where the questions were written based on the story, and the answers are either a single word or a short text.

Language: Swahili

We thank the KenCorpus team for their work to create these open, accessible resources. We are also grateful to our co-founders, whose support made these datasets possible: The Rockefeller Foundation, Google.org, Canada’s International Development Research Centre, and GIZ on behalf of the German Ministry of Economic Cooperation and Development. 

Learn more about these and other published Lacuna-funded datasets on our Datasets page!

We share released datasets on a quarterly basis on our website and social media platforms. Subscribe to the Lacuna Fund newsletter below and follow us on social media to stay updated on these announcements.

Meridian Institute serves as Secretariat for the Lacuna Fund.