Lacuna Fund Releases 18 New AI Datasets Empowering Local Communities to Tackle Challenges in Agriculture, Climate, Health and Language
12 March 202518 New AI Datasets in Agriculture, Climate, Health and Language Domains
Today, we are excited to announce eighteen newly published datasets for training artificial intelligence (AI) in the domains of Agriculture, Climate, Health, and Natural Language Processing (NLP). These datasets harness the power of AI to address urgent social and economic problems in Africa, Asia, and Latin America, as well as in low-income communities in the United States.
Learn more about these datasets and how to access them below!
Lacuna Fund is a coalition of funders, data scientists, and data users committed to filling data gaps and making machine learning and AI more equitable, accurate, and accessible worldwide.
We extend our deep gratitude to our funders, including The Rockefeller Foundation, Google.org, Canada’s International Development Research Centre, German Federal Ministry for Economic Cooperation and Development (BMZ) and its FAIR Forward initiative, Wellcome, Gordon and Betty Moore Foundation, Patrick J. McGovern Foundation, and Robert Wood Johnson Foundation, who make the creation of these datasets possible.

Agriculture
Lacuna Fund agriculture datasets unlock the power of machine learning to alleviate food security challenges, spur economic opportunities, and give researchers, farmers, communities, and policymakers access to superior agricultural datasets.
CropHarvest: Informing decision-making around agricultural development, early warning systems, and trade in Sub-Saharan Africa
Countries: Kenya, Mali, Togo, Rwanda, Uganda, Ethiopia, Malawi, Zambia, Tanzania, Namibia, Sudan, and Nigeria
Contact: Catherine Nakalembe | cnakalem@umd.edu
CropHarvest increases the understanding of the main types of food production in Sub-Saharan Africa and can help inform decision-making around agricultural development, early warning systems, and regional trade. It is a global, open-source remote sensing dataset for crop-type classification in Sub-Saharan Africa – specifically in Kenya, Mali, Togo, Rwanda, Uganda, Ethiopia, Malawi, Zambia, Tanzania, Namibia, Sudan, and Nigeria.
The team expanded on an existing dataset published in 2021 to now include the following: new labeled data points using Collect Earth Online, ground data for crop type mapping, street-level images, crowdsources labeled images, and price data. In addition, the Collect Earth Online data was randomly sampled to cover the entire country, filling critical data gaps in crop patterns and yields.
Authors and Affiliations:
- NASA Harvest: Tseng, G.
- University of Maryland, College Park: Zvonkov, I., Nakalembe, C.L. and Kerner, H.
Dataset: https://github.com/nasaharvest/cropharvest
Improving livelihoods in Ghana and Uganda: Drone-based Agricultural Dataset for Crop Yield Estimation of cashew, cocoa, and coffee
Countries: Ghana, Uganda
Contact: Darlington Akogo | darlington@gudra-studio.com
This dataset supports yield estimation, crop type detection and classification, fruit detection and counting, and fruit maturity stage detection (unripe, ripe, and spoiled) for three products that are important sources of livelihood for millions of households in Sub-Saharan Africa.
It contains 14,870 drone images with bounding box annotations of cashew, cocoa, and coffee trees collected across multiple farms in Ghana and Uganda. Conventional methods of yield estimation are expensive, require a lot of labor and time, and are prone to error due to incomplete ground observations. This results in poor crop yield estimations and hinders farmers’ ability to appropriately plan and manage their fields and production pipelines. This dataset will help transform African agriculture into agribusiness by allowing for the development of yield estimation solutions that enable farmers to make good business decisions. Having key details about agricultural production readily accessible enables a timely harvest, helping farmers ensure healthy, fresh produce and, in addition, better sales.
Authors and Affiliations:
- KaraAgro AI: Darlington Akogo, Cyril Akafia, Harriet Fiagbor, Stephen Torkpo, Christian Kusi
- Makerere AI Lab: Joyce Nakatumba-Nabende
To see all of Lacuna Fund’s agriculture datasets, visit: https://lacunafund.org/datasets/agriculture/
Health
Lacuna Fund health datasets close the gap in health disparities by providing accurate, robust machine learning datasets that help providers and patients make decisions that lead to more equitable healthcare outcomes.
Intraoperative Anesthesia and Outcomes Dataset: Improving patient outcomes by predicting risk of mortality and post-operative recovery
Region: Sub-Saharan Africa
Contact: Bhiken Naik | bin4n@uvahealth.org
This dataset can be used to identify patterns of intraoperative anesthesia practice and predict postoperative length of stay and risk of mortality based on intraoperative variables. It includes 2,066 intraoperative anesthesia records from two academic centers in sub-Saharan Africa. The team photographed completed intraoperative anesthesia records using a smartphone, de-identified the images, and securely uploaded them to a HIPAA-compliant server. Using a combination of computer vision AI and manual extraction techniques, the team collected the following comprehensive intraoperative data: demographic data, medication data, hemodynamic data, physiological data, anesthesia type, surgery type, postoperative length of stay, and 30-day postoperative mortality.
Intraoperative anesthesia data encompasses a wide range of information that is essential for patient care during surgical procedures. However, capturing this depth of information is particularly challenging in low- and middle-income countries (LMICs), where the current electronic intraoperative anesthesia datasets are often limited in scope. As a result, a significant number of key data elements, which could be vital for clinical decision-making and research, are either missing or not available. This limitation hinders the ability to fully understand and improve patient outcomes in LMICs, so this dataset fills a critical gap by developing a method to include all data elements from the intraoperative anesthesia records.
Authors and Affiliations:
- University of Virginia: Bhiken Naik
- School of Medicine and Pharmacy, University of Rwanda and King Faisal Hospital, African Health Sciences University: Paulin Banguti
- Safe Surgery South Africa: Hyla Kluyts
- University of Virginia: Ryan Folks
Dataset: https://portal.ithriv.org/#/public_commons/project/d9fc062c-64c9-4481-80e7-3db4aba17e00
Brain Tumor Segmentation Africa (BraTS-Africa) Dataset
Country: Nigeria
Contact: Udunna Anazodo | udunna.anazodo@mcgill.ca
The BraTS-Africa dataset is an aggregation of magnetic resonance imaging (MRI) scans from six centers in Nigeria aimed at providing a public dataset for the development of machine-learning solutions for the management of brain tumors in African patients. This dataset serves as a starting framework for future expansion in other regions of Africa. The team processed and annotated a total of 584 images from 146 patient scans. Ninety-five of these scans are presumed to have diffuse glioma, and 51 of them have other types of central nervous system (CNS) neoplasms. Expert radiologists annotated three distinct tumor sub-regions to delineate the enhancing tumor (ET), the necrotic tumor core (NCR), and the peritumoral oedematous/infiltrated tissue (ED) sub-regions.
Prior to this study, there was no known comprehensive annotated brain imaging dataset available to the public from Africa. This study filled that gap to ensure that novel machine-learning solutions for neurological disease management, such as brain tumors, can solve the unmet clinical needs in Sub-Saharan Africa.
Authors and Affiliations:
- Medical Artificial Intelligence (MAI) Lab (Lagos, Nigeria): Maruf Adewole, Abiodun Fatade, Oluyemisi Toyobo, Farouk, Dako, Udunna Anazodo
- The National Hospital (Abuja, Nigeria): Feyisayo Daji, Chinasa Kalaiwo
- Lagos University Teaching Hospital: Olubukola Omidiji
- Lagos State University Teaching Hospital: Rachel Akinola
- NSIA-Kano Diagnostic Center: Mohammad Abba Suwaid
- Federal Medical Centre (Umuahia, Nigeria): Kenneth Aguh
- Lily Hospital (Benin, Nigeria): Mayomi Onuwaje
- University of Pennsylvania (Philadelphia, USA): Farouk Dako
- Indiana University (Indianapolis, USA): Spyridon Bakas
- Scripps Clinic Medical Group (San Diego, USA): Jeffery Rudie
- McGill University (Montreal, Canada): Udunna Anazodo
Dataset: https://www.cancerimagingarchive.net
AI-Assisted Smartphone Microscopy for detection of Diarrhea Parasites
Country: Nepal
Contact: Bishesh Khanal | bishesh.khanal@naamii.org.np
This dataset helps to detect diarrhea-causing parasites in resource-limited rural areas, particularly across the Global South, where access to expensive diagnostic tools is limited. It contains approximately 400,000 microscopic slide images from water, vegetable, and stool samples from four different provinces across Nepal, making it one of the largest datasets of its kind. The team collected water samples from different sources (i.e., tap water, bottled water, lake, river, pond, stream, spring water, wetland, well, and borewell) and used seven different types of vegetables. Using the dataset and annotations available, this team trained different deep-learning models to automatically detect parasites, specifically Giardia and Cryptosporidium cysts.
The sample images were captured using both smartphone and brightfield microscopes before being uploaded to an online data collection and annotation platform. This platform allows multiple users to upload images of samples with permission-based features for quality control. Permitted users can review the uploaded images, approve or reject them, add comments on individual images, and filter the view for samples based on a certain date range or province. As a first step, this dataset focused on Nepal, but it is designed to be applicable across similar regions worldwide.
Authors and Affiliations:
- Nepal Applied Mathematics and Informatics Institute for Research (NAAMII): Bishesh Khanal, Udit Chandra Aryal, Safal Thapaliya
- Kathmandu Institute of Applied Science (KIAS): Dr Basant Giri, Dr. Susma Giri, Dr Bhanu Neupane, Asmita Adhikari, Asmita Karki, Ramdeep Shrestha, Aayusha Upreti, Pramikshya Bagale, Deepa Prajapati, Prashamsa Shrestha, Celeus Baral
- Nyaya Health Nepal, Bayalpata: Mandeep Pathak, Ekendra Kunwar, Khadak Chaudhary, Sunil Buda, Tapendra Kunwar, Ramesh Badahit, Nim Prakash Sharma
- Provincial Public Health Laboratory (PPHL)-Janakpur: Shravan Kumar Mishra, Santosh Kumar Yadav, Jitendra Kumar Sah, Amrendra Kumar Mishra, Sarwajit Yadav, Ashish Jha
- Kathmandu Institute of Child Health (KIOCH)-Damak: Dr. Bhagawan Koirala, Dr. Sandeepa Karki, Dr. Jayamani Shrestha
Dataset: https://zenodo.org/records/13913469
To see all of Lacuna Fund’s health datasets, visit: https://lacunafund.org/datasets/health/
Climate
From understanding the impacts of climate change on health outcomes to strengthening electrification planning, Lacuna Fund climate datasets allow communities around the world to better mitigate and adapt to climate change.
Project Climate Change, Health, and Artificial Intelligence (CCHAIN): Public Health Data Insights for the Philippines
Country: Philippines
Contact: Thinking Machines Data Science | data-for-development@thinkingmachin.es
The Project CCHAIN dataset is an open, linked, analysis-ready dataset of validated health, climate, environmental, and socioeconomic variables collected at the village (“barangay”) level in 12 Philippine cities spanning 20 years (2003-2022). This dataset includes observations of about 17 diseases collected through field visits to the Philippines Department of Health (DOH) and the Philippine Statistical Authority (PSA). Another component of this dataset is “Open Buildings,” which also functions as a standalone dataset that the team created containing 12,000 building outlines that show neighborhood densities, terrains, and levels of urbanization, as well as areas not yet mapped in OpenStreetMap. Each outline was drawn using a combination of visual inspection of satellite imagery, local knowledge, and validation from household survey data to cover all buildings present in the tile.
In the Philippines, researchers and other end users in need of public health data from rural healthcare facilities, all the way to national-level programs, can request this information from the Philippine Department of Health (DOH), where a review committee makes the final approval decisions. However, this process does not yet meet the definition of truly open data, which should be proactively provided to the public to foster transparency, innovation, and collaboration without needing requests or permissions.
Privacy and security concerns remain significant barriers to data access as agencies try to balance public benefits with confidentiality risks. Another barrier to accessibility and availability is the scarcity of digitized data at the community level due to the lack of staff training and budget constraints. By assembling Project CCHAIN, an open, analysis-ready dataset, we alleviate the burden on users who would otherwise need to manage significant data logistics and multidisciplinary expertise to gather and process data from various sources, formats, and geographic specifications. Focusing on the village or “barangay,” the smallest administrative unit in the Philippines, also helps disaggregate health risks for vulnerable communities, particularly those in informal settlements, and provides actionable insights for local governments.
Authors and Affiliations:
- Thinking Machines Data Science, Inc.: Patricia Anne Faustino, JC Albert Peralta, Veronica Marie Araneta, Dafrose Camille Bajaro, Abigail Moreno
- Epimetrics, Inc.: John Q. Wong, Anne Kathlyn Baladad, Luis Antonio Desquitado, Matthew Limlengco, Carlos Miguel Resurreccion
- Manila Observatory: Faye Abigail Cruz, Dr. Julie Mae Dado, Leia Pauline Tonga
- Philippine Action for Community-led Shelter Initiatives, Inc.: Ericka Lynne Nava
Dataset: https://thinkingmachines.github.io/project-cchain/
Air quality dataset of abattoir centers in Southern Nigeria
Country: Nigeria
Contact: Emmanuel Chukwuma | emmanuel.chukwuma@apse-ngo.org
This air quality dataset is the first of its kind in the country from abattoir centers. The localized dataset is crucial in air quality monitoring and prediction, as well as accurate modeling of the air quality index for early warning signals and modeling of health risk. The data was obtained from abattoir centers in Southern Nigeria. The team collected data from representative samples of various states (i.e., Anambra, Enugu, Abia, Imo, Ebonyi, and Delta) within the research area. The team visited 27 stations and conducted on-site investigations, collecting over 200,000 numerical values of particulate matter (PM) concentrations using 10 air quality sensors for PM1, PM2.5, and PM10. Additionally, aerial view images were captured using a drone at varying heights (10m, 20m, 30m) during operational hours; the images will be trained with satellite imagery for the prediction of PM values.
A preliminary survey indicates that abattoir centers in developing countries rely heavily on wood and sometimes discarded tires for meat processing. The use of these items for meat processing releases a significant amount of gaseous pollutants. Thick smoke is seen in the morning hours around these abattoirs as the meat processing takes place. Smoke from wood burning, coupled with minimal air movement from wind, can lead to elevated particle concentrations in abattoirs. Exposure to particulate matter and black carbon released in abattoirs has detrimental health outcomes with elevated morbidity and mortality, as shown by previous studies. This project was undertaken by the Alliance for Progressive and Sustainable Environment (APSE), a local NGO focused on environmental sustainability (see more details here: www.apse-ngo.org).
Authors and Affiliations:
- Alliance for Progressive and Sustainable Environment: Emmanuel Chukwuma, Uche Okonkwo, Chukwuemeka Umeobi, Jervis Okafor, Sixtus Ezenwankwo, Shadrach Ugwu, Awonge Precious, Cynthia Egdede, Esther Eyo
Dataset: https://drive.google.com/drive/folders/1BRrVgYN-O6s7EsnEgAUCGqINvvfiXZC8?usp=drive_link
Global Horizontal Irradiance Dataset for Mauritius, Rodrigues, and Agalega Islands
Country: Mauritius, Rodrigues, and Agalega Islands
This dataset includes 146,025 real-time solar irradiance data lines from different locations around Mauritius, Rodrigues, and Agaléga. The solar irradiance data (GHI in W/m2) spans 2017 to 2021, at an interval of one hour, and covers the hours of 07:00-18:00 each day. This dataset allows for the real-time visualization of the solar irradiance profile at the specified locations, helping with better assessment and planning of solar-generated power. The team is now collecting data (from 2023 on) at an interval of 15 minutes and plans to update this data repository to reflect that in the future.
The targeted beneficiaries for this project are the Government of Mauritius, which has a goal of 60% electricity generation from renewable energy sources by the year 2030. Similarly, the Mauritius Renewable Energy Agency, which is tasked with ensuring the country’s energy demand is increasingly met by renewable energy and keeping up with international commitments, can use this data on solar irradiance and forecasting mechanisms to better manage the utility’s power plants, minimize carbon emissions, ensure no loss of loads (blackouts), and allow higher penetration of photovoltaic (PV) projects in the country. With free online solar maps and accuracy-enhanced solar energy data, local PV plant operators will also have accurate information for PV performance appraisal. Additionally, the public at large can benefit from a free online solar energy platform, improving acceptance of solar PV technology and increasing penetration of clean technologies in the country to further reduce greenhouse emissions. Finally, machine learning models can be trained for intra-day, daily, and even weekly predictions of the solar irradiance profiles.
Authors and Affiliations:
- University of Mauritius: Yogesh Beeharry, Ravish Gokool, Yatindra Kumar Ramgolam, Aatish Chiniah
Dataset: https://www.scidb.cn/en/detail?dataSetId=2b499b91a4464fffa9f60fc8b51da03e&version=V2
Labelled Open Solar Panel Data to measure solar energy adoption in Madagascar
Country: Madagascar
Contact: Fabienne Rafidiharinirina | f.rafidiharinirina@association-maidi.mg or assomaidi@gmail.com
This team annotated 2,125 Google Earth satellite images and 9,202 drone images, forming a combination of low and high-definition solar panel views in Madagascar. The Madagascar Initiatives for Digital Innovation (MAIDI) team performed field checks for up to 25% of satellite images and, in total, annotated 22,488 polygons.
This dataset will help data scientists and users develop a solar panel detection algorithm to measure solar energy adoption across Madagascar. Notably, this project represented all regions of the country; instead of focusing only on big cities, it also covered average and small villages as well as coasts and mountains.
Authors and Affiliations: Fabienne Rafidiharinirina (Madagascar Initiatives for Digital Innovation)
Climate Energy Dataset for Off-Grid Electricity Infrastructure
Country: Pakistan
Contact: Dr. Zeeshan Shafiq | zeeshanshafiq@uetpeshawar.edu.pk
This dataset comprises real-time electrical measurements of a specific climate zone in Pakistan, the Kalam Region, showcasing the energy generation and demand within an off-grid electricity infrastructure. It can be used for research in energy systems analysis, climate change studies, electrical engineering, and artificial intelligence applications. It includes voltages, currents, and power factors for three-phase and single-phase systems across generation, distribution, and consumption stages. Additionally, the dataset incorporates seven different climate parameters from the ERA5 dataset (provided by the Copernicus Climate Change Service), generating a total of 85,596 data points in areas such as temperature, dew point, wind components, precipitation, snowfall, and snow cover.
Collected every five minutes from June 3, 2023, to October 24, 2024, it includes over 45 million instances covering data from four micro-hydropower generators, 26 transformers (in addition to four data acquisition systems installed at Micro Hydro Power Plants (MHPs), and 585 end users. With local support, the team will continue monitoring the data until June 2025.
Authors and Affiliations:
- CISNR UET Peshawar: Zeeshan Shafiq, Prof. Dr. Gul Muhammad Khan, Engr. Sarmad Rafique, Engr. Muhammad Bilal Khan, Engr. Umer Khan, Engr. Mansoor Khan, Engr. Niaz Khan, Engr. Musa Khan, Engr. Abdul Moiz
Dataset: https://zenodo.org/records/14195731
Natural Language Processing
Lacuna Fund language datasets create openly accessible text and speech resources that fuel natural language processing technologies in diverse languages across low- and middle-income contexts globally.
NaijaVoices: Our Language is Our Strength
Languages: Hausa, Igbo, and Yoruba
Contact: For partnerships, collaborations, or questions, reach out to info@naijavoices.com
The NaijaVoices project has curated 1,867 hours of speech and text data featuring over 5,000 speakers in the three major Nigerian languages — Hausa, Igbo, and Yoruba. As of its release, it is the largest ever multi-speaker African speech dataset. The dataset consists of circa 1,917,686 instances – each instance is made up of audio, a transcript, the language of the transcript, the speaker ID, gender, and age bracket. The dataset enables audio-based NLP tasks like automatic speech recognition (ASR) and text-to-speech (TTS). Additionally, the authentic sentences in the dataset can enhance text-based natural language processing (NLP) tasks, including language modeling, part-of-speech tagging, and named entity recognition.
Linguistic applications of this dataset include understanding sociolinguistic profiles, analyzing pronunciation variations, studying phonetic and phonemic differences, and advancing natural language processing (NLP) capabilities for the three Nigerian languages. The NaijaVoices method intentionally incorporated discourse about marginalized populations, such as women, children, and people living with disabilities, as well as underrepresented topic areas, such as traditional counting systems and agriculture. The dataset also represents diverse voices, with over 5,000 participants with unique speaker patterns and dialects.
Authors and Affiliations: The NaijaVoices Community (https://naijavoices.com/)
Dataset: https://naijavoices.com/dataset
AFRIDOC-MT: Document-level MT Corpus for African Languages
Languages: Amharic, Hausa, Swahili, Yorùbá, and Zulu
Contact: Jesujoba O. Alabi | jalabi@lsv.uni-saarland.de
AFRIDOC-MT is a document-level and multi-way translation dataset from English into five African languages — Amharic, Hausa, Swahili, Yorùbá, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all of which were human-translated from English to these languages. Each domain has at least 10,000 parallel sentences per language pair and supports multiway translation, allowing translation not only between English and the African languages but also among the African languages themselves.
This dataset can be used to evaluate the ability of existing neural machine translation (NMT) models and large language models (LLMs) to translate at the document level and to train such models. Recently, there has been interest in document-level translation with multiple sentences, where sentences are translated with their context rather than in isolation. Previously, efforts were focused on high-resource languages, where document-level datasets are readily available, and not on low-resource African languages. In addition, it can be used for sentence-level translation and a couple of other language tasks if properly annotated.
Authors and Affiliations:
- Saarland University: Jesujoba O. Alabi, Israel Abebe, Miaoran Zhang, Dawei Zhu, Dietrich Klakow
- German Research Center for Artificial Intelligence (DFKI): Cristina España-Bonet
- INRIA: Rachel Bawden
- McGill University and Mila: David Adelani
- University of Ibadan: Clement Oyeleke Odoje, Idris Akinade
- National Institute of Informatics (NII): Iffat Maab
- Selcom: Davis David
- Imperial College, London: Shamsuddeen Hassan
- University of KwaZulu-Natal: Nokwanda Putini
- Loughborough University, U.K.: David Oluwajoju Ademuyiwa
- University of Cambridge: Andrew Caines
Dataset: https://github.com/masakhane-io/afridoc-mt
Masakhane-NLU: Conversational AI & Benchmark datasets for African languages
Languages: Amharic, Ewe, Hausa, Igbo, Lingala, Luganda, Oromo, Kinyarwanda, Shona, Sesotho, Swahili, Twi, Wolof, Xhosa, Yoruba and Zulu
Contact: David Adelani | david.adelani@mila.quebec
This team has developed five conversational AI and benchmark datasets for 16 languages across the African continent: Amharic, Ewe, Hausa, Igbo, Lingala, Luganda, Oromo, Kinyarwanda, Shona, Sesotho, Swahili, Twi, Wolof, Xhosa, Yoruba and Zulu. The first dataset, AfriXNLI, is a natural language inference dataset used to determine the linguistic relationship (entailment, neutral, and contradiction) between two sentences; it has 1,050 sentence pairs per language. The second dataset, AfriMMLU, is a knowledge-based multi-choice question-answering dataset covering five subjects: elementary mathematics, high-school geography, international law, global facts, and high school microeconomics. The team collected 608 question-answer pairs per language. The third dataset, AfriMGSM, was developed as a free-form grade school mathematics question-answering dataset, which was formed with 258 question-answer pairs. AfriIntent, which involves the collection of 3,200 sentences per language, is an intent classification dataset covering various domains such as banking (e.g., “pay bill”), home (e.g., “play music”), kitchen and dining (e.g., “confirm reservation”), travel (e.g., plug type), and utility (e.g. “make call”). Finally, using 3,200 sentences per language, the team developed AfriSlot for slot classification in categories such as food items, language names, etc.
These five text-only datasets are useful for conversational chatbots in real-life applications such as banking, restaurants, travel agencies, and more. The team has created strong benchmarks for evaluating the performance of large language models such as GPT-4o on African languages.
Authors and Affiliations:
- McGill University & Mila: David Ifeoluwa Adelani, Hao Yu
- SADiLaR: Andiswa Bukula, Mmasibidi Setaka, Rooweither Mabuya
- OntarioTech University: En-Shiun Annie Lee
- Saarland University: Israel Abebe Azime, Jesujoba O. Alabi
- Toronto University: Jian Yun Zhuang
- Princeton University: Happy Buzaaba
- Masakhane: Blessing Sibanda, Godson Kalipe, Jonathan Mukiibi, Salomon Kabongo, Lolwethu Ndolela, Nkiruka Odu, Salomey Osei, Sokhar Samb, Tadesse Kebede Guge, Juliet Murage
- Imperial College: Shamsuddeen Hassan Muhammad
Dataset: https://github.com/masakhane-io/masakhane-nlu
Lacuna PII Multilingual Dataset
Languages: Luganda, Lumasaba, Hausa, and Kanuri
Contact:
- Andrew Katumba|katumba@mak.ac.ug
- Milena Haykowska|milena.haykowska@clearglobal.org
- Peter Nabende|nabende@gmail.com
This dataset contains annotated sentences with personally identifiable information (PII) in Luganda, Lumasaba, Hausa, and Kanuri. These four languages span Central and Eastern Uganda, Nigeria, Ghana, and Northern Cameroon. The team collected 3,000 sentences for both Kanuri and Hausa, 5,000 for Lumasaba, and 4,000 for Luganda. Potential use cases for these datasets include named entity recognition (NER), text classification, privacy-preserving data analysis and research, language modeling, machine translation, and linguistic research.
The team aimed to curate a dataset that is gender inclusive, and their work highlighted the need for standardized guidelines for annotating low-resourced languages. Having these guidelines would help to avoid common pitfalls and errors when labeling text data in these low-resource languages.
Authors and Affiliations:
- Marconi Research and Innovations Lab, Makerere University: Andrew Katumba, Jenifer Winfred Namuyanja, Nakakande Bridget Cecile
- Makerere Artificial Intelligence Lab: Joyce Nakatumba-Nabende, Ann Lisa Nabiryo, Peter Nabende, Eric Peter Wairagala
- Clear Global: Milena Haykowska, Andrew Bredenkamp, Mariam Mohanna, Alp Öktem, Etienne de Crecy
Dataset: https://doi.org/10.7910/DVN/CGHWZE
Hate and Offensive Speech Detection Dataset for African Languages
Languages: Hausa, Yoruba, Igbo, Nigerian Pidgin, Algerian Arabic, Moroccan Arabic, Swahili, IsiXhosa, IsiZulu, Kinyarwanda, Twi, Amharic, Oromo, Somali, Tigrinya
Contact:
- Abinew Ali Ayele | abinewaliayele@gmail.com
- Seid Muhie Yiman | muhie.yimam@uni-hamburg.de
- Shamsuddeen Hassan Muhammad | muhammad@imperial.ac.uk
AfriHate is a hate and offensive speech corpus for 15 African languages: Hausa, Yoruba, Igbo, Nigerian Pidgin, Algerian Arabic, Moroccan Arabic, Swahili, IsiXhosa, IsiZulu, Kinyarwanda, Twi, Amharic, Oromo, Somali, Tigrinya. The AfriHate dataset annotated tweets using “offensive,” “hateful,” and “normal” classes, with specific target classes (topics) such as politics, ethnicity, gender, religion, and disability. Within this project, the team created another dataset, AfriEmotion, a new corpus for the detection of emotion, including the intensity of emotions such as joy, sadness, fear, anger, surprise, and disgust. Overall, the team collected and annotated 10,000 instances each for hate and offensive speech and emotion detection per language, making a total of 150,000 annotated observations.
This project is the first to develop and make a publicly available dataset for hate and offensive speech and emotion detection in the target languages. To ensure a representative dataset, the target languages are cut across all regions of Africa. Similarly, for each language, the team collected texts using a diverse set of strategies to ensure even representation among the corpus and used annotators of diverse backgrounds in terms of gender, status, and educational level.
The AfriHate dataset supports various Natural Language Processing (NLP) tasks and applications for African languages, including hate speech detection, abusive language identification, contextual analysis, and language modeling. It serves several use cases, such as psychological research, policy making, and content moderation. The dataset helps to detect hate speech effectively in low-resource language settings, identify linguistic patterns of hate speech, understand contextual influences, and improve NLP tools for nuanced content moderation in African languages.
Similarly, the AfriEmotion dataset facilitates various NLP tasks and applications for African languages, including emotion detection, analysis, and synthesis. Its use cases include social media monitoring to understand public sentiment and emotion, mental health support with early detection of distress, educational tools promoting emotional intelligence, literary analysis through an emotional lens, and policy insights for informed decision-making. The dataset addresses questions regarding linguistic and cultural influences on emotional expression, similarities and differences across languages and cultures, adaptation of NLP models for low-resource languages, and challenges and opportunities of cross-lingual emotion processing in African contexts.
Authors and Affiliations:
- ICT4D, Bahir Dar University: Esubalew Alemneh Jalew, Abinew Ali Ayele
- Bayero University Kano, Department of Computing: Shamsudeen Hassan Muhammad, Ibrahim Said Ahmad
- Imperial College London: Shamsuddeen Hassan Muhammad
- Idris Abdulmumin (Ahmadu Bello University, Department of Computer Science).
- Seid Muhie Yimam (University of Hamburg, Language Technology Group, Department of Informatics)
Dataset:
Ethio Speech Corpora
Languages: Amharic, Tigrigna, Oromo, Somali, Afar, Sidama
Contact: Solomon Teferra Abate | solomon.teferra@aau.edu.et
Ethio Speech Corpora is comprised of over 391 hours of recorded audio in six different Ethiopian languages: Amharic (68 hours), Tigrigna (62 hours), Oromo (70 hours), Somali (56 hours), Afar (68 hours), and Sidama (68 hours). This project will be a valuable resource for the development of well-performing automatic speech recognition (ASR) systems for these six languages (in a monolingual setup) and for other related languages (in a multilingual and/or cross-lingual setup) that are useful in various aspects of daily life.
Use cases of speech recognition systems using this dataset include dictation systems, transcription systems, assistive technologies, spoken dialogue systems, speech translation, and other similar speech technologies. To make the data set representative, the team selected six working languages that are used across regional states of Ethiopia while also maintaining the gender and age balance of readers.
Authors and Affiliations:
- School of Information Science of the Addis Ababa University: Solomon Teferra Abate (PhD), Martha Yifiru Tachbelie (PhD), Michael Melese Woldeyohannes (PhD), Hafte Abera, Bantegize Addis Alemayehu, Wondwossen Mulugeta (PhD)
Website: https://ethiospeech.com/
Dataset: https://github.com/EthioSpeech and https/:/huggingface.co/EthioSpeech
Building Parallel Corpora for Kenya’s Indigenous Languages and Kiswahili
Languages: Kidaw’ida, Kalenjin, and Dholuo, Kiswahili
Contact: Audrey Mbogho | ambogho@usiu.ac.ke
This team collected parallel text corpora for three Kenyan indigenous languages, Kidaw’ida, Kalenjin, and Dholuo, alongside Kiswahili, resulting in approximately 90,000 sentence pairs in total. After collection, the team separated out the Kidaw’ida, Kalenjin, and Dholuo sentences and used them as monolingual datasets for crowd-sourcing speech data, facilitated by uploading the sentences to Mozilla Common Voice. A total of 109 members of the three language communities were recruited to read and record sentences from their respective native languages. Emphasizing gender balance and including different ages and regional variants helped to make the datasets more representative. The voice datasets offer a substantial amount of speech data, comprising 56 hours of Kidaw’ida, 92 hours of Kalenjin, and 120 hours of Dholuo, for a total of 268 hours.
Use cases for these parallel corpora include training models to translate text between Kiswahili and Kidaw’ida, Kalenjin, and Dholuo. The speech data on Mozilla Common Voice, along with its associated text data, is intended to be used for the development of speech recognition applications. The languages that comprise this dataset are low-resource, especially Kidaw’ida, which has only around 400,000 speakers and faces a more immediate risk of loss. By collecting the text and speech data, this team contributed to the preservation of these languages. They hope that once enough data has been collected to train accurate models and create NLP applications for these three languages, they will become more relevant in the modern digital age, thus mitigating the risk of loss.
Authors and Affiliations:
- USIU-Africa: Audrey Mbogho, Quin Awuor
- Maseno University: Lilian Wanzare, Vivian Oloo
- Andrew Kipkebut (Kabarak University)
- Rose Lugano (University of Florida)
Dataset:
- https://zenodo.org/records/13355021
- Mozilla Common Voice:
Expanding a parallel corpus of Portuguese and the Bantu language Emakhuwa
Languages: Emakhuwa, Portuguese
Contact: Felermino D. M. A. Ali | felermino.ali@unilurio.ac.mz or felerminoali@gmail.com
This dataset includes the translation of 1,897 news articles comprising 660,242 words from Portuguese to Emakhuwa, an indigenous language of Mozambique. Each article includes the news headline, content, and label for topic classification. For news topic classification, the articles were divided into three primary areas: training (1,337 articles), development (185 articles), and testing (375 articles). The articles were then further categorized by topic: politics, economy, culture, sports, health, society, and world news.
The intended use cases for this dataset include topic classification, translation, and loanword recognition. To ensure that the dataset was representative, the team translated different categories of news articles and prioritized Mozambique-related news and articles, contributing to lexicon diversity. The datasets have shown promising outcomes when fine-tuning multilingual models like ByT5, M2M100, and NLLB200. This team’s work has already generated improvements in translation quality when using loanword information as additional data. They plan to continue refining models and ensuring high-quality outputs for all use cases.
Authors and Affiliations:
- Felermino Dário Mário António Ali: Lurio University, Faculty of Engineering; Artificial Intelligence and Computer Science Lab (LIACC); Centre of Linguistics (CLUP) of the University of Porto
- Henrique Lopes Cardoso: Faculty of Engineering of the University of Porto (FEUP), Artificial Intelligence and Computer Science Lab (LIACC)
- Rui Sousa Silva: Faculty of Arts and Humanities, Centre of Linguistics (CLUP) of the University of Porto
Dataset: https://huggingface.co/collections/LIACC/makhuwa-nlp-66a93ea22df7f4b31e96a5ab
Papers:
- https://aclanthology.org/2024.emnlp-main.824
- https://aclanthology.org/2024.lrec-main.425
- https://aclanthology.org/2024.wmt-1.45