Announcing Our First Five Published Datasets

23 March 2022

Today, we are pleased to announce a major Lacuna Fund milestone: the publication of our first five funded datasets! From enabling machine learning researchers to predict fish yield, to developing the first large-scale human-annotated Twitter sentiment dataset for the most widely spoken languages in Nigeria, these Lacuna-funded datasets are poised for impact in agriculture and language. 

Learn more about the resources below:

  • A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis  |  This project produced the first large-scale human-annotated Twitter sentiment dataset for Hausa, Igbo, Nigerian-Pidgin, and Yorùbá, the four most widely spoken languages in Nigeria. It consists of approximately 30,000 annotated tweets per language, including a significant fraction of code-mixed tweets. 

This project was conducted by a team from Bayero University and Masakhane.

  • Eyes on the Ground Image Data  |  This project created a large machine learning dataset for crop phenology monitoring of smallholder farmer’s fields. This is a unique dataset of georeferenced and timestamped crop images, which were captured using smartphone cameras following  standardized “picture-based insurance” protocol  along with labels on input use, crop management, crop growth stages, crop damage, and yield estimates, collected across eight counties in Kenya. 

This project was a collaboration between ACRE Africa, International Food Policy Research Institute, Dvara E-Registry and KALRO.

  • High-Accuracy Maize Plot Location and Yield Dataset in East Africa  |  This project improved the usability of the most expansive Eastern Africa crop cut yield estimation datasets by correcting the geolocations of the fields. This data was collected by the non-profit One Acre Fund from 2015 – 2019; covers major crop producing regions in Kenya, Rwanda, and Tanzania; and contains approximately 18,000 crop-cut yield data points for maize. The team also took initial steps to develop a method that can help correct inaccurate geolocations in other similar crop yield datasets. 

This project was conducted by a team from Zindi and the Big Data Platform of the CGIAR.

  • Machine Translation Benchmark Dataset for Languages in the Horn of Africa  |  This project developed an evaluation dataset that automatically quantifies the quality of machine translation systems for Afar, Amharic, Oromo, Somali and Tigrinya. This multi-way parallel corpus serves as a benchmark to accelerate progress in machine translation research and production systems for these five African languages.

This project was conducted by a team from Lesan AI, and Mermru.

  • Sensor Based Aquaponics Fish Pond Datasets: IoT Fish Pond Monitoring Datasets  |  This project built a remotely monitored and controlled Internet of Things (IoT) fish pond water quality management system for the generation of labeled datasets both for the conventional ponds and the aquaponic pond systems. It will enable machine learning researchers to build models for predicting fish yield in the aquaponics production system in terms of weight gain, water quality parameters, and feed consumption.

This project was conducted by a team from the University of Nigeria Nsukka.

We thank all the project teams for their work to create these open, accessible resources. We are grateful to our co-founders, whose support made these datasets possible: The Rockefeller Foundation, Google.org, Canada’s International Development Research Centre, and GIZ on behalf of the German Ministry of Economic Cooperation and Development. Our ask to everyone else is simple: put these resources to use and share them across your networks!

We plan to share released datasets on a quarterly basis on our website and social media platforms. Subscribe to the Lacuna Fund newsletter below and follow us on social media to stay updated on these announcements.

Meridian Institute serves as Secretariat for the Lacuna Fund.