Announcing New Datasets for African Languages — 2020 Natural Language Processing (NLP) Awardees

20 December 2022

Upcoming Calls for Proposals

Lacuna Fund will be issuing two new calls for proposals to build more equitable and accessible Machine Learning datasets in 2023. We will be inviting proposals to develop datasets in two domains:

Sexual and Reproductive Health and Rights
Climate and Forests

Look for details in the new year.

Announcing New Datasets for African Languages

2020 Natural Language Processing (NLP) Awardees

We are excited to announce our recently published datasets in the language domain! These datasets will foster equal opportunities, inclusivity, participation in decision-making, and accessibility. Together, they span more than 22 African languages, such as Bambara, Dholuo, Fon, Akan, and Wolof. We thank these teams for their work to create these inclusive, open data resources, which will allow for artificial intelligence resources to be more readily accessible and available on the African continent.

The Masakhane team and affiliates have created three datasets for multiple African languages focused on named entity recognition and parts of speech tagging.

MasakhaNER 2.0: Named Entity Recognition datasets for 20 African languages

MAFAND-MT: Masakhane Anglo & Franco African News Corpus for Machine Translation
MasakhaPOS: Part-of-Speech Tagging Dataset for 20 African Languages

Asheshi University and Nokwary Technologies have created a financial inclusion speech dataset for Ghanaian languages Akan (Akuapem Twi, Asante Twi, Fante) and Ga.

Financial Inclusion Speech Dataset for some Ghanaian Languages

University of Ibadan and Afe-Babalola University have created the first spoken corpus of labelled and unlabeled datasets for Igbo Natural Language Processing (NLP) tasks.

IgboSynCorp: Dataset for Igbo Natural Language Processing Tasks

We are also grateful to our co-founders, whose support made these datasets possible: The Rockefeller Foundation, Google.org, Canada’s International Development Research Centre, and GIZ’s FAIR Forward programme on behalf of the German Federal Ministry of Economic Cooperation and Development (BMZ). 

See below for links to these datasets and information about the teams that created them and potential use cases.

Named Entity Recognition and Parts of Speech datasets for African languages 

CONTACT: DAVID IFEOLUWA ADELANI, D.ADELANI@UCL.AC.UK 

MasakhaNER 2.0: Named Entity Recognition datasets for 20 African languages 

MasakhaNER 2.0 is the largest human-annotated named entity recognition dataset for 20 African languages. Each language has between 4,800 – 11,000 parallel sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu. More information about the data can be found in their EMNLP paper here.  

AFFILIATIONS AND AUTHORS: 

Masakhane 

Saarland University, Germany 

David Ifeoluwa Adelani | Jesujoba O. Alabi | Dietrich Klakow 

CMU, United States 

Graham Neubig | Shruti Rijhwani | Perez Ogayo 

Google Research 

Sebastian Ruder 

University of Witwatersrand, South Africa 

Michael Beukman 

Brandeis University, United States 

Chester Palen-Michel | Constantine Lignos 

LIAAD-INESC TEC, Portugal 

Shamsuddeen H. Muhammad 

Makerere University 

Peter Nabende | Jonathan Mukiibi | Joyce Nakatumba-Nabende 

University of Bergen, Norway 

Cheikh M. Bamba Dione 

SaDiLaR 

Andiswa Bukula | Rooweither Mabuya

MILA, Canada 

Bonaventure F. P. Dossou 

RIKEN, Japan 

Happy Buzaaba 

Baamtu, Senegal 

Derguene Mbaye 

Malawi University of Business and Applied Science 

Amelia Taylor 

Uppsala University, Sweden 

Fatoumata Kabore 

Technical University of Munich, Germany 

Chris Chinenye Emezue 

TU Clausthal, Germany 

Edwin Munkoh-Buabeng 

RIT, United States 

Allahsera Auguste Tapo 

University of Pretoria, South Africa 

Tebogo Macucwa | Vukosi Marivate | Neo L. Mokono 

Luleå University of Technology, Sweden 

Tosin Adewumi 

University of Washington, United States 

Orevaoghene Ahia 

Lancaster University, UK 

Ignatius Ezeani | Chiamaka Chukwuneke 

University of Waterloo, Canada 

Mofetoluwa Adeyemi | Odunayo Ogundepo 

Ahmadu Bello University, Nigeria 

Idris Abdulmumin 

MAFAND-MT: Masakhane Anglo & Franco African News Corpus for Machine Translation

The MAFAND-MT dataset is a few thousand high-quality and human translated parallel sentences for 16 African languages in the news domain. Each language has between 1,466 – 7838 parallel sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Twi, Wolof, and isiXhosa. Further details on this dataset can be found in the team’s NAACL 2022 paper https://arxiv.org/abs/2205.02022

AFFILIATIONS AND AUTHORS: 

Masakhane 

Inria 

Jesujoba O. Alabi 

Meta AI 

Angela Fan 

Amazon Alexa AI  

Xiaoyu Shen 

The University of Tokyo 

Machel Reid 

Jacobs University 

Bonaventure F. P. Dossou 

Saarland University, Germany 

David Ifeoluwa Adelani | Dietrich Klakow | Dana Ruiter | Ernie Chang

Google Research 

Julia Kreutzer 

Makerere University 

Peter Nabende | Jonathan Mukiibi | Eric Peter Wairagala 

Technical University of Munich, Germany 

Chris Chinenye Emezue 

University of Dayton

Colin Leong 

University of Witwatersrand, South Africa 

Michael Beukman

Universitat Politècnica de Catalunya 

Andre Niyongabo Rubungo 

Microsoft 

Mohamed Ahmed | Millicent Ochieng 

CMU, United States 

Perez Ogayo 

Uppsala University, Sweden 

Fatoumata Ouoba Kabore 

Baamtu, Senegal 

Derguene Mbaye 

Rochester Institute of Technology, United States 

Allahsera Auguste Tapo 

Ahmadu Bello University, Nigeria 

Idris Abdulmumin 

University of Ibadan 

Ayodele Awokoya 

University of Malawi 

Sam Manthalu 

LIAAD-INESC TEC, Portugal 

Shamsuddeen H. Muhammad 

RIKEN, Japan 

Happy Buzaaba 

SaDiLaR 

Andiswa Bukula

MasakhaPOS: Part-of-Speech Tagging Dataset for 20 African Languages

MasakhaPOS is the largest human-annotated part of speech tagging dataset for 20 African languages. Each language has between 1200 – 1500 sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu.

AFFILIATIONS AND AUTHORS: 

Masakhane 

Saarland University, Germany 

David Ifeoluwa Adelani | Jesujoba O. Alabi | Dietrich Klakow 

CMU, United States 

Perez Ogayo

LIAAD-INESC TEC, Portugal 

Shamsuddeen H. Muhammad 

Makerere University 

Peter Nabende | Jonathan Mukiibi

University of Bergen, Norway 

Cheikh M. Bamba Dione

SaDiLaR 

Andiswa Bukula | Rooweither Mabuya 

MILA, Canada 

Bonaventure F. P. Dossou

RIKEN, Japan 

Happy Buzaaba

 Baamtu, Senegal 

Derguene Mbaye 

Malawi University of Business and Applied Science 

Amelia Taylor 

Uppsala University, Sweden 

Fatoumata Kabore 

Technical University of Munich, Germany 

Chris Chinenye Emezue 

TU Clausthal, Germany 

Edwin Munkoh-Buabeng 

RIT, United States 

Allahsera Auguste Tapo 

University of Pretoria, South Africa 

Tebogo Macucwa | Vukosi Marivate  

University of Buea, Cameroon 

Gratien Atindogbe 

Financial Inclusion Speech Dataset for some Ghanaian Languages

CONTACT: DENNIS ASAMOAH OWUSU (DOWUSU@ASHESI.EDU.GH) 

This speech dataset for the Ghanian languages Akan (Akuapem Twi, Asante Twi, Fante) and Ga includes 104,000 utterances (speech) across the four dialects/languages with approximately 200 speakers per dialect/language. This amounts to about 148 hours of speech in total. The dataset was developed to support the development of financial applications in native Ghanaian languages to allow illiterate and semi-literate people to fully benefit from digital financial services. Secondly, it aims to answer research questions related to domain-specific vs. general-purpose dataset development, dialects, as well as NLP system development in low resource settings. 

Read more about the team’s dataset and approach here:  
https://ashesi-org.github.io/dataset/nlp/ai/ghana/africa/speech/2022/05/16/release-of-financial-inclusion-dataset-ghanaian-languages.html 

AFFILIATIONS AND AUTHORS:

Asheshi University 

Dennis Asamoah Owusu 

Ayorkor Korsah 

David Sampah 

David Adjepon-Yamoah 

Stephane Nwolley Jnr. 

Nokwary Technologies 

Dennis Asamoah Owusu 

Benedict Quartey 

David Sampah 

Lily Omane Boateng 

IgboSynCorp: Dataset for Igbo Natural Language Processing Tasks 

CONTACTS: GERALD NWEYA (GERALDNWEYA@GMAIL.COM) AND EMEKA ONWUEGBUZIA (EONWUEGBUZIA@GMAIL.COM) 

This dataset is the first spoken corpus of labelled and unlabeled datasets for Igbo Natural Language Processing (NLP) tasks. It consists of approximately 40 hours of naturally occurring Igbo speech that is representative of all the dialects of Igbo. The dataset lays the foundation for Igbo NLP tasks such as machine translation, tree bank, speech-to-text, automatic POS tagging, digital dictionary, and automatic spelling checker. 

AFFILIATIONS AND AUTHORS:

University of Ibadan, Ibadan, Nigeria 

Gerald Okey Nweya 

Amarachi Akudo Osuagwu 

Emeka Felix. Onwuegbuzia 

Samuel Obinna Ejinwa 

Anita Ifeoma Adiboshi 

Daniel Success Nwokwo 

Peter Ugochukwu Ihunna 

Afe-Babalola University, Ado-Ekiti, Nigeria 

Oluwole Solomon Akinola 

Learn more about these and other published Lacuna-funded datasets on our Datasets page! 

We share datasets on a quarterly basis on our website and social media platforms. Subscribe to the Lacuna Fund newsletter below and follow us on social media to stay updated on these announcements. 

Meridian Institute serves as Secretariat for the Lacuna Fund.