Announcing New Datasets for African Languages — 2020 Natural Language Processing (NLP) Awardees
20 December 2022Upcoming Calls for Proposals
Lacuna Fund will be issuing two new calls for proposals to build more equitable and accessible Machine Learning datasets in 2023. We will be inviting proposals to develop datasets in two domains:
- Sexual and Reproductive Health and Rights
- Climate and Forests
Look for details in the new year.
Announcing New Datasets for African Languages
2020 Natural Language Processing (NLP) Awardees
We are excited to announce our recently published datasets in the language domain! These datasets will foster equal opportunities, inclusivity, participation in decision-making, and accessibility. Together, they span more than 22 African languages, such as Bambara, Dholuo, Fon, Akan, and Wolof. We thank these teams for their work to create these inclusive, open data resources, which will allow for artificial intelligence resources to be more readily accessible and available on the African continent.
The Masakhane team and affiliates have created three datasets for multiple African languages focused on named entity recognition and parts of speech tagging.
- MasakhaNER 2.0: Named Entity Recognition datasets for 20 African languages
- MAFAND-MT: Masakhane Anglo & Franco African News Corpus for Machine Translation
- MasakhaPOS: Part-of-Speech Tagging Dataset for 20 African Languages
Asheshi University and Nokwary Technologies have created a financial inclusion speech dataset for Ghanaian languages Akan (Akuapem Twi, Asante Twi, Fante) and Ga.
- Financial Inclusion Speech Dataset for some Ghanaian Languages
University of Ibadan and Afe-Babalola University have created the first spoken corpus of labelled and unlabeled datasets for Igbo Natural Language Processing (NLP) tasks.
- IgboSynCorp: Dataset for Igbo Natural Language Processing Tasks
We are also grateful to our co-founders, whose support made these datasets possible: The Rockefeller Foundation, Google.org, Canada’s International Development Research Centre, and GIZ’s FAIR Forward programme on behalf of the German Federal Ministry of Economic Cooperation and Development (BMZ).
See below for links to these datasets and information about the teams that created them and potential use cases.
Named Entity Recognition and Parts of Speech datasets for African languages
CONTACT: DAVID IFEOLUWA ADELANI, D.ADELANI@UCL.AC.UK
MasakhaNER 2.0: Named Entity Recognition datasets for 20 African languages
MasakhaNER 2.0 is the largest human-annotated named entity recognition dataset for 20 African languages. Each language has between 4,800 – 11,000 parallel sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu. More information about the data can be found in their EMNLP paper here.
AFFILIATIONS AND AUTHORS:
Masakhane
David Ifeoluwa Adelani | Michael Beukman | Shamsuddeen H. Muhammad | Peter Nabende | Bonaventure F. P. Dossou | Blessing Sibanda | Happy Buzaaba | Jonathan Mukiibi | Godson Kalipe | Derguene Mbaye | Fatoumata Kabore | Chris Chinenye Emezue | Anuoluwapo Aremu | Perez Ogayo | Catherine Gitau | Edwin Munkoh-Buabeng | Victoire M. Koagne | Allahsera Auguste Tapo | Saarland University, Germany David Ifeoluwa Adelani | Jesujoba O. Alabi | Dietrich Klakow CMU, United States Graham Neubig | Shruti Rijhwani | Perez Ogayo Google Research Sebastian Ruder University of Witwatersrand, South Africa Michael Beukman Brandeis University, United States Chester Palen-Michel | Constantine Lignos LIAAD-INESC TEC, Portugal Shamsuddeen H. Muhammad Makerere University Peter Nabende | Jonathan Mukiibi | Joyce Nakatumba-Nabende University of Bergen, Norway Cheikh M. Bamba Dione SaDiLaR Andiswa Bukula | Rooweither Mabuya MILA, Canada Bonaventure F. P. Dossou RIKEN, Japan Happy Buzaaba |
Baamtu, Senegal
Derguene Mbaye Malawi University of Business and Applied Science Amelia Taylor Uppsala University, Sweden Fatoumata Kabore Technical University of Munich, Germany Chris Chinenye Emezue TU Clausthal, Germany Edwin Munkoh-Buabeng RIT, United States Allahsera Auguste Tapo University of Pretoria, South Africa Tebogo Macucwa | Vukosi Marivate | Neo L. Mokono Luleå University of Technology, Sweden Tosin Adewumi University of Washington, United States Orevaoghene Ahia Lancaster University, UK Ignatius Ezeani | Chiamaka Chukwuneke University of Waterloo, Canada Mofetoluwa Adeyemi | Odunayo Ogundepo Ahmadu Bello University, Nigeria Idris Abdulmumin |
MAFAND-MT: Masakhane Anglo & Franco African News Corpus for Machine Translation
The MAFAND-MT dataset is a few thousand high-quality and human translated parallel sentences for 16 African languages in the news domain. Each language has between 1,466 – 7838 parallel sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Twi, Wolof, and isiXhosa. Further details on this dataset can be found in the team’s NAACL 2022 paper https://arxiv.org/abs/2205.02022
AFFILIATIONS AND AUTHORS:
Masakhane
David Ifeoluwa Adelani | Jesujoba O. Alabi | Michael Beukman | Shamsuddeen H. Muhammad | Peter Nabende | Bonaventure F. P. Dossou | Blessing Sibanda | Happy Buzaaba | Jonathan Mukiibi | Godson Kalipe | Derguene Mbaye | Fatoumata Ouoba Kabore | Chris Chinenye Emezue | Anuoluwapo Aremu | Perez Ogayo | Edwin Munkoh-Buabeng | Victoire Memdjokam Koagne | Allahsera Auguste Tapo Tajuddeen Gwadabe | Gilles Q. Hacheme | Idris Abdulmumin | Oreen Yousuf Freshia Sackey | Colin Leong | Guyo Jarso | Andre Niyongabo Rubungo | Eric Peter Wairagala | Muhammad Umair Nasir | Benjamin Ajibade | Tunde Ajayi | Yvonne Gitau | Jade Abbott | Mohamed Ahmed | Millicent Ochieng | Valencia Wagner | Ayodele Awokoya Inria Jesujoba O. Alabi Meta AI Angela Fan Amazon Alexa AI Xiaoyu Shen The University of Tokyo Machel Reid Jacobs University Bonaventure F. P. Dossou Saarland University, Germany David Ifeoluwa Adelani | Dietrich Klakow | Dana Ruiter | Ernie Chang Google Research Julia Kreutzer Makerere University Peter Nabende | Jonathan Mukiibi | Eric Peter Wairagala Technical University of Munich, Germany Chris Chinenye Emezue University of Dayton Colin Leong University of Witwatersrand, South Africa Michael Beukman |
Universitat Politècnica de Catalunya
Andre Niyongabo Rubungo Microsoft Mohamed Ahmed | Millicent Ochieng CMU, United States Perez Ogayo Uppsala University, Sweden Fatoumata Ouoba Kabore Baamtu, Senegal Derguene Mbaye Rochester Institute of Technology, United States Allahsera Auguste Tapo Ahmadu Bello University, Nigeria Idris Abdulmumin University of Ibadan Ayodele Awokoya University of Malawi Sam Manthalu LIAAD-INESC TEC, Portugal Shamsuddeen H. Muhammad RIKEN, Japan Happy Buzaaba SaDiLaR Andiswa Bukula |
MasakhaPOS: Part-of-Speech Tagging Dataset for 20 African Languages
MasakhaPOS is the largest human-annotated part of speech tagging dataset for 20 African languages. Each language has between 1200 – 1500 sentences for training and/or evaluation. The languages covered span across West, Central, East and Southern Africa, and include Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian-Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu.
AFFILIATIONS AND AUTHORS:
Masakhane
David Ifeoluwa Adelani | Shamsuddeen H. Muhammad | Peter Nabende | Bonaventure F. P. Dossou | Blessing Sibanda | Happy Buzaaba | Jonathan Mukiibi | Godson Kalipe | Derguene Mbaye | Fatoumata Kabore | Chris Chinenye Emezue | Anuoluwapo Aremu | Perez Ogayo | Catherine Gitau | Edwin Munkoh-Buabeng | Victoire M. Koagne | Allahsera Auguste Tapo | Tebogo Macucwa | Vukosi Marivate | Elvis Mboning | Tajuddeen Gwadabe | Cheikh M. Bamba Dione Saarland University, Germany David Ifeoluwa Adelani | Jesujoba O. Alabi | Dietrich Klakow CMU, United States Perez Ogayo LIAAD-INESC TEC, Portugal Shamsuddeen H. Muhammad Makerere University Peter Nabende | Jonathan Mukiibi University of Bergen, Norway Cheikh M. Bamba Dione SaDiLaR Andiswa Bukula | Rooweither Mabuya MILA, Canada Bonaventure F. P. Dossou RIKEN, Japan Happy Buzaaba |
Baamtu, Senegal
Derguene Mbaye Malawi University of Business and Applied Science Amelia Taylor Uppsala University, Sweden Fatoumata Kabore Technical University of Munich, Germany Chris Chinenye Emezue TU Clausthal, Germany Edwin Munkoh-Buabeng RIT, United States Allahsera Auguste Tapo University of Pretoria, South Africa Tebogo Macucwa | Vukosi Marivate University of Buea, Cameroon Gratien Atindogbe |
Financial Inclusion Speech Dataset for some Ghanaian Languages
CONTACT: DENNIS ASAMOAH OWUSU (DOWUSU@ASHESI.EDU.GH)
This speech dataset for the Ghanian languages Akan (Akuapem Twi, Asante Twi, Fante) and Ga includes 104,000 utterances (speech) across the four dialects/languages with approximately 200 speakers per dialect/language. This amounts to about 148 hours of speech in total. The dataset was developed to support the development of financial applications in native Ghanaian languages to allow illiterate and semi-literate people to fully benefit from digital financial services. Secondly, it aims to answer research questions related to domain-specific vs. general-purpose dataset development, dialects, as well as NLP system development in low resource settings.
Read more about the team’s dataset and approach here:
https://ashesi-org.github.io/dataset/nlp/ai/ghana/africa/speech/2022/05/16/release-of-financial-inclusion-dataset-ghanaian-languages.html
AFFILIATIONS AND AUTHORS:
Asheshi University
Dennis Asamoah Owusu Ayorkor Korsah David Sampah David Adjepon-Yamoah Stephane Nwolley Jnr. |
Nokwary Technologies
Dennis Asamoah Owusu Benedict Quartey David Sampah Lily Omane Boateng
|
IgboSynCorp: Dataset for Igbo Natural Language Processing Tasks
CONTACTS: GERALD NWEYA (GERALDNWEYA@GMAIL.COM) AND EMEKA ONWUEGBUZIA (EONWUEGBUZIA@GMAIL.COM)
This dataset is the first spoken corpus of labelled and unlabeled datasets for Igbo Natural Language Processing (NLP) tasks. It consists of approximately 40 hours of naturally occurring Igbo speech that is representative of all the dialects of Igbo. The dataset lays the foundation for Igbo NLP tasks such as machine translation, tree bank, speech-to-text, automatic POS tagging, digital dictionary, and automatic spelling checker.
AFFILIATIONS AND AUTHORS:
University of Ibadan, Ibadan, Nigeria
Gerald Okey Nweya
Amarachi Akudo Osuagwu
Emeka Felix. Onwuegbuzia
Samuel Obinna Ejinwa
Anita Ifeoma Adiboshi
Daniel Success Nwokwo
Peter Ugochukwu Ihunna
Afe-Babalola University, Ado-Ekiti, Nigeria
Oluwole Solomon Akinola
Learn more about these and other published Lacuna-funded datasets on our Datasets page!
We share datasets on a quarterly basis on our website and social media platforms. Subscribe to the Lacuna Fund newsletter below and follow us on social media to stay updated on these announcements.
Meridian Institute serves as Secretariat for the Lacuna Fund.