Skip to content

Announcing New Datasets for African Languages — 2020 Natural Language Processing (NLP) Awardees 

20 December 2022

Upcoming Calls for Proposals 

Lacuna Fund will be issuing two new calls for proposals to build more equitable and accessible Machine Learning datasets in 2023. We will be inviting proposals to develop datasets in two domains:  

  • Sexual and Reproductive Health and Rights  
  • Climate and Forests 

Look for details in the new year.


Announcing New Datasets for African Languages 

2020 Natural Language Processing (NLP) Awardees 

We are excited to announce our recently published datasets in the language domain!  These datasets will foster equal opportunities, inclusivity, participation in decision-making, and accessibility. Together, they span more than 22 African languages, such as Bambara, Dholuo, Fon, Akan, and Wolof. We thank these teams for their work to create these inclusive, open data resources, which will allow for artificial intelligence resources to be more readily accessible and available on the African continent. 

The Masakhane team and affiliates have created three datasets for multiple African languages focused on named entity recognition and parts of speech tagging.  

  • MasakhaNER 2.0: Named Entity Recognition datasets for 20 African languages  
  • MAFAND-MT: Masakhane Anglo & Franco African News Corpus for Machine Translation 
  • MasakhaPOS: Part-of-Speech Tagging Dataset for 20 African Languages 

Asheshi University and Nokwary Technologies have created a financial inclusion speech dataset for Ghanaian languages Akan (Akuapem Twi, Asante Twi, Fante) and Ga.  

  • Financial Inclusion Speech Dataset for some Ghanaian Languages 

University of Ibadan and Afe-Babalola University have created the first spoken corpus of labelled and unlabeled datasets for Igbo Natural Language Processing (NLP) tasks.  

  • IgboSynCorp: Dataset for Igbo Natural Language Processing Tasks  

We are also grateful to our co-founders, whose support made these datasets possible: The Rockefeller Foundation, Google.org, Canada’s International Development Research Centre, and GIZ’s FAIR Forward programme on behalf of the German Federal Ministry of Economic Cooperation and Development (BMZ).  

See below for links to these datasets and information about the teams that created them and potential use cases. 

Named Entity Recognition and Parts of Speech datasets for African languages  

CONTACT: DAVID IFEOLUWA ADELANI, D.ADELANI@UCL.AC.UK 

MasakhaNER 2.0: Named Entity Recognition datasets for 20 African languages 

MasakhaNER 2.0 is the largest human-annotated named entity recognition dataset for 20 African languages. Each language has between 4,800 – 11,000 parallel sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu. More information about the data can be found in their EMNLP paper here.   

AFFILIATIONS AND AUTHORS:  

Masakhane 

David Ifeoluwa Adelani | Michael Beukman | Shamsuddeen H. Muhammad | Peter Nabende | Bonaventure F. P. Dossou | Blessing Sibanda | Happy Buzaaba | Jonathan Mukiibi | Godson Kalipe | Derguene Mbaye | Fatoumata Kabore | Chris Chinenye Emezue | Anuoluwapo Aremu | Perez Ogayo | Catherine Gitau | Edwin Munkoh-Buabeng | Victoire M. Koagne | Allahsera Auguste Tapo |  
Tebogo Macucwa | Vukosi Marivate | Elvis Mboning | Tajuddeen Gwadabe | Tosin Adewumi | Orevaoghene Ahia | Joyce Nakatumba-Nabende | Neo L. Mokono | Ignatius Ezeani | Chiamaka Chukwuneke | Mofetoluwa Adeyemi | Gilles Q. Hacheme | Idris Abdulmumin | Odunayo Ogundepo | Oreen Yousuf | Tatiana Moteu Ngoli  

Saarland University, Germany 

David Ifeoluwa Adelani | Jesujoba O. Alabi | Dietrich Klakow  

CMU, United States  

Graham Neubig | Shruti Rijhwani | Perez Ogayo  

Google Research  

Sebastian Ruder  

University of Witwatersrand, South Africa  

Michael Beukman  

Brandeis University, United States  

Chester Palen-Michel | Constantine Lignos  

LIAAD-INESC TEC, Portugal  

Shamsuddeen H. Muhammad  

Makerere University  

Peter Nabende | Jonathan Mukiibi | Joyce Nakatumba-Nabende  

University of Bergen, Norway  

Cheikh M. Bamba Dione  

SaDiLaR  

Andiswa Bukula | Rooweither Mabuya  

MILA, Canada 

Bonaventure F. P. Dossou  

RIKEN, Japan  

Happy Buzaaba 

Baamtu, Senegal  

Derguene Mbaye  

Malawi University of Business and Applied Science  

Amelia Taylor  

Uppsala University, Sweden  

Fatoumata Kabore  

Technical University of Munich, Germany  

Chris Chinenye Emezue  

TU Clausthal, Germany  

Edwin Munkoh-Buabeng  

RIT, United States  

Allahsera Auguste Tapo  

University of Pretoria, South Africa  

Tebogo Macucwa | Vukosi Marivate | Neo L. Mokono  

Luleå University of Technology, Sweden  

Tosin Adewumi  

University of Washington, United States  

Orevaoghene Ahia  

Lancaster University, UK  

Ignatius Ezeani | Chiamaka Chukwuneke  

University of Waterloo, Canada  

Mofetoluwa Adeyemi | Odunayo Ogundepo  

Ahmadu Bello University, Nigeria  

Idris Abdulmumin 

 

MAFAND-MT: Masakhane Anglo & Franco African News Corpus for Machine Translation 

The MAFAND-MT dataset is a few thousand high-quality and human translated parallel sentences for 16 African languages in the news domain. Each language has between 1,466 – 7838 parallel sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Twi, Wolof, and isiXhosa. Further details on this dataset can be found in the team’s NAACL 2022 paper https://arxiv.org/abs/2205.02022

AFFILIATIONS AND AUTHORS:  

Masakhane 

David Ifeoluwa Adelani | Jesujoba O. Alabi | Michael Beukman | Shamsuddeen H. Muhammad | Peter Nabende | Bonaventure F. P. Dossou | Blessing Sibanda | Happy Buzaaba | Jonathan Mukiibi | Godson Kalipe | Derguene Mbaye | Fatoumata Ouoba Kabore | Chris Chinenye Emezue | Anuoluwapo Aremu | Perez Ogayo | Edwin Munkoh-Buabeng | Victoire Memdjokam Koagne | Allahsera Auguste Tapo  Tajuddeen Gwadabe | Gilles Q. Hacheme | Idris Abdulmumin | Oreen Yousuf  Freshia Sackey | Colin Leong | Guyo Jarso | Andre Niyongabo Rubungo | Eric Peter Wairagala | Muhammad Umair Nasir | Benjamin Ajibade | Tunde Ajayi | Yvonne Gitau | Jade Abbott | Mohamed Ahmed | Millicent Ochieng | Valencia Wagner | Ayodele Awokoya  

Inria 

Jesujoba O. Alabi  

Meta AI  

Angela Fan  

Amazon Alexa AI   

Xiaoyu Shen  

The University of Tokyo  

Machel Reid  

Jacobs University  

Bonaventure F. P. Dossou  

Saarland University, Germany 

David Ifeoluwa Adelani | Dietrich Klakow | Dana Ruiter | Ernie Chang

Google Research  

Julia Kreutzer  

Makerere University  

Peter Nabende | Jonathan Mukiibi | Eric Peter Wairagala  

Technical University of Munich, Germany  

Chris Chinenye Emezue  

University of Dayton

Colin Leong  

University of Witwatersrand, South Africa  

Michael Beukman

Universitat Politècnica de Catalunya  

Andre Niyongabo Rubungo  

Microsoft  

Mohamed Ahmed | Millicent Ochieng  

CMU, United States  

Perez Ogayo  

Uppsala University, Sweden  

Fatoumata Ouoba Kabore  

Baamtu, Senegal  

Derguene Mbaye  

Rochester Institute of Technology, United States  

Allahsera Auguste Tapo  

Ahmadu Bello University, Nigeria  

Idris Abdulmumin  

University of Ibadan  

Ayodele Awokoya  

University of Malawi  

Sam Manthalu  

LIAAD-INESC TEC, Portugal  

Shamsuddeen H. Muhammad  

RIKEN, Japan  

Happy Buzaaba  

SaDiLaR  

Andiswa Bukula 

 

MasakhaPOS: Part-of-Speech Tagging Dataset for 20 African Languages 

MasakhaPOS is the largest human-annotated part of speech tagging dataset for 20 African languages. Each language has between 1200 – 1500 sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu. 

AFFILIATIONS AND AUTHORS:  

Masakhane 

David Ifeoluwa Adelani | Shamsuddeen H. Muhammad | Peter Nabende | Bonaventure F. P. Dossou | Blessing Sibanda | Happy Buzaaba | Jonathan Mukiibi | Godson Kalipe | Derguene Mbaye | Fatoumata Kabore | Chris Chinenye Emezue | Anuoluwapo Aremu | Perez Ogayo | Catherine Gitau | Edwin Munkoh-Buabeng | Victoire M. Koagne | Allahsera Auguste Tapo | Tebogo Macucwa | Vukosi Marivate | Elvis Mboning | Tajuddeen Gwadabe | Cheikh M. Bamba Dione  

Saarland University, Germany 

David Ifeoluwa Adelani | Jesujoba O. Alabi | Dietrich Klakow  

CMU, United States  

Perez Ogayo 

LIAAD-INESC TEC, Portugal  

Shamsuddeen H. Muhammad  

Makerere University  

Peter Nabende | Jonathan Mukiibi

University of Bergen, Norway  

Cheikh M. Bamba Dione 

SaDiLaR  

Andiswa Bukula | Rooweither Mabuya  

MILA, Canada  

Bonaventure F. P. Dossou

RIKEN, Japan  

Happy Buzaaba

Baamtu, Senegal  

Derguene Mbaye  

Malawi University of Business and Applied Science  

Amelia Taylor  

Uppsala University, Sweden  

Fatoumata Kabore  

Technical University of Munich, Germany  

Chris Chinenye Emezue  

TU Clausthal, Germany  

Edwin Munkoh-Buabeng  

RIT, United States  

Allahsera Auguste Tapo  

University of Pretoria, South Africa  

Tebogo Macucwa | Vukosi Marivate   

University of Buea, Cameroon  

Gratien Atindogbe  

 

Financial Inclusion Speech Dataset for some Ghanaian Languages 

CONTACT: DENNIS ASAMOAH OWUSU (DOWUSU@ASHESI.EDU.GH)  

This speech dataset for the Ghanian languages Akan (Akuapem Twi, Asante Twi, Fante) and Ga includes 104,000 utterances (speech) across the four dialects/languages with approximately 200 speakers per dialect/language. This amounts to about 148 hours of speech in total. The dataset was developed to support the development of financial applications in native Ghanaian languages to allow illiterate and semi-literate people to fully benefit from digital financial services. Secondly, it aims to answer research questions related to domain-specific vs. general-purpose dataset development, dialects, as well as NLP system development in low resource settings.  

Read more about the team’s dataset and approach here:  
https://ashesi-org.github.io/dataset/nlp/ai/ghana/africa/speech/2022/05/16/release-of-financial-inclusion-dataset-ghanaian-languages.html

AFFILIATIONS AND AUTHORS: 

Asheshi University 

Dennis Asamoah Owusu  

Ayorkor Korsah  

David Sampah  

David Adjepon-Yamoah  

Stephane Nwolley Jnr.  

Nokwary Technologies 

Dennis Asamoah Owusu  

Benedict Quartey  

David Sampah  

Lily Omane Boateng  

 

 

IgboSynCorp: Dataset for Igbo Natural Language Processing Tasks 

CONTACTS: GERALD NWEYA (GERALDNWEYA@GMAIL.COM) AND EMEKA ONWUEGBUZIA (EONWUEGBUZIA@GMAIL.COM)  

This dataset is the first spoken corpus of labelled and unlabeled datasets for Igbo Natural Language Processing (NLP) tasks. It consists of approximately 40 hours of naturally occurring Igbo speech that is representative of all the dialects of Igbo. The dataset lays the foundation for Igbo NLP tasks such as machine translation, tree bank, speech-to-text, automatic POS tagging, digital dictionary, and automatic spelling checker. 

AFFILIATIONS AND AUTHORS: 

University of Ibadan, Ibadan, Nigeria 

Gerald Okey Nweya 

Amarachi Akudo Osuagwu 

Emeka Felix. Onwuegbuzia 

Samuel Obinna Ejinwa 

Anita Ifeoma Adiboshi 

Daniel Success Nwokwo 

Peter Ugochukwu Ihunna 

Afe-Babalola University, Ado-Ekiti, Nigeria 

Oluwole Solomon Akinola 

 

Learn more about these and other published Lacuna-funded datasets on our Datasets page! 

We share datasets on a quarterly basis on our website and social media platforms. Subscribe to the Lacuna Fund newsletter below and follow us on social media to stay updated on these announcements. 

Meridian Institute serves as Secretariat for the Lacuna Fund.