Two New Available Agriculture & Language Datasets, 2023 Grantee Convening

19 May 2023

2023 Lacuna Fund Grantee Convening

Each year, grantees have the opportunity to gather to network, share their projects, discuss lessons learned, and participate in workshops. Check out this video of Lacuna Fund grantees in action at last year’s inaugural gathering in Tunis, Tunisia!

The second annual Lacuna Fund grantee convening is in two weeks – we are so excited! Grantees in Agriculture, Natural Language Processing, and Equity & Health will gather in Kigali, Rwanda. Teams will share their datasets, participate in workshops about dataset utilization and sustainability, and discuss lessons learned. Grantees will also have the opportunity to attend the AfricAI Conference, organized by Canada’s International Development Research Centre (IDRC), Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) and Niyel.

Two New Datasets Enable Crop Pest and Disease Diagnosis and Machine Translation for Bambara Language

Today, we are excited to announce two recently published datasets to train artificial intelligence in the domains of Agriculture and Natural Language Processing (NLP). The first dataset focuses on five key food security crops in Sub-Saharan Africa: cassava, maize, beans, bananas, and cocoa. The dataset contains a large repository of images and spectral data. These can be used to identify and diagnose pests and diseases in crops. The second dataset contains a parallel text corpus for the Malian language Bambara and French, significantly expanding—as well as cleaning and correcting—the available bilingual pairs within an existing dataset. This translated corpus has increased both the quality and quantity of Bambara-language resources and has made the translations usable for machine translation purposes.

Read below for links to these datasets, along with more information about them, the teams who created them, and potential use cases.

We are grateful to our co-founders, without whom the creation of these inclusive and open machine learning datasets would not have been possible: The Rockefeller Foundation, Google.org, Canada’s International Development Research Centre, and GIZ’s FAIR Forward programme on behalf of the German Federal Ministry of Economic Cooperation and Development (BMZ).

Machine Learning Datasets for Crop Pest and Disease Diagnosis: Crop Imagery and Spectrometry Data

Contact: Joyce Nakatumba-Nabende | joyce.nabende@mak.ac.ug

Collaborators at Makerere Artificial Intelligence Lab, Nelson Mandela African Institute of Science and Technology, KaraAgro AI Foundation, and Namibia University of Science and Technology have created a repository of image and spectrometry datasets for five main food security crops in Sub-Saharan Africa: cassava, maize, beans, bananas, and cocoa. Collected and curated in collaboration with the in-country agricultural experts, the datasets deliver a wide range of machine learning applications, including classification, object detection, early crop disease detection, and spatial analysis. The team collected and annotated 127,046 images and 39,300 spectral data points.

Authors and Affiliations:

Joyce Nakatumba-Nabende, Makerere University (Uganda)
Andrew Katumba, Makerere University (Uganda)
Claire Babirye, Makerere University Artificial Intelligence lab (Uganda)
Jeremy Francis Tusubira, Makerere University Artificial Intelligence lab (Uganda)
Godliver Owomugisha, Makerere University Artificial Intelligence lab (Uganda)
Neema Mduma, Nelson Mandela African Institute of Science and Technology (Tanzania)
Darlington Akogo, KaraAgro AI Foundation (Ghana)
Blessing Sibanda, Namibia University of Science and Technology (Namibia)

Bayelemabaga Aligned Bambara-French Corpus for Machine Translation

Contact: Christopher Homan | christopher.m.homan.phd@gmail.com

Collaborators at Rochester Institute of Technology, RobotsMali, INALCO, George Mason University, and Boston College have created a parallel Bambara-French Corpus to be used for machine translation. Bambara is spoken by about 15 million people in West Africa, primarily in Mali, as well as in Senegal, Niger, Mauritania, Gambia, and Ivory Coast. The Bayelemabaga dataset consists of 46,976 parallel machine translation-ready Bambara-French sentence pairs, originating from the Bambara Reference Corpus from INALCO’s LLACAN Lab. Although there was a smaller existing collection of bilingual texts, they had not been translated or paired in a manner suitable for machine translation. This dataset has allowed Bambara to move from a language with insignificant resources to one that now has moderate, high-quality resources.

Creating these 46,976 text units required a total of 72,000 French and Bambara sentences, which were curated by eliminating or correcting duplicates and text that was not suitable or poorly translated. The text in the dataset is extracted from 264 text files, ranging from periodicals, books, short stories, blog posts, to parts of the Bible and the Quran. The team’s efforts grew the bilingual section of the Bambara Reference Corpus from 19,000 pairs to approximately 80,000.

Authors and Affiliations:

Allahsera Auguste Tapo, Rochester Institute of Technology (USA)
Michael Leventhal, RobotsMali (Mali)
Valentin Vydrin, INALCO (France)
Sebastian Diarra, RobotsMali (Mali)
Marcos Zampieri, George Mason University (USA)
Emily Prud’Hommeaux, Boston College (USA)
Jean Jacque Méric, INALCO (France)

Background:

Why do we need more open datasets in the domain of natural language processing (NLP)?

Timely and accurate access to information – spoken or written – in one’s own language is key to be able to fully participate in the digital world. Translations, the ability to understand and synthesize speech, and many other AI-enabled applications in the field of natural language processing (NLP) require training and evaluation data that does not exist for many low-resourced languages, some spoken by millions of people around the world. Therefore, Lacuna Fund supports the creation of open training and evaluation datasets for NLP in underserved languages. Learn more here.

Why do we need more open datasets in the domain of Agriculture?

Lacuna Fund agriculture datasets unlock the power of machine learning to alleviate food security challenges, spur economic opportunities, and give researchers, farmers, communities, and policymakers access to superior agricultural datasets. Learn more here.