Language Domain

Lacuna Fund language datasets create openly accessible text and speech resources that fuel natural language processing technologies in diverse languages across low- and middle-income contexts globally. Explore and download released datasets below.

2020 Awards

Description: This dataset is the first large-scale human-annotated Twitter sentiment dataset for Hausa, Igbo, Nigerian-Pidgin, and Yorùbá, the four most widely spoken languages in Nigeria. 

Authors: Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Sebastian Ruder, Ibrahim Said Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choudhury, Chris Chinenye Emezue, Saheed Salahudeen Abdullahi, Anuoluwapo Aremu, Alipio Jeorge, and Pavel Brazdil

Languages: Hausa, Igbo, Nigerian-Pidgin, and Yorùbá

Dataset: access here

Description: This evaluation dataset automatically quantifies the quality of machine translation systems for Afar, Amharic, Oromo, Somali and Tigrinya. 

Authors: Asmelash Teka Hadgu, Gebrekirstos G. Gebremeskel, Abel Aregawi

Translators: Afar – Mohammed Deresa, Yasin Nur; Amharic – Tigist Taye, Selamawit Hailemariam, Wako Tilahun; Oromo – Gemechis Melkamu, Galata Girmaye; Somali – Abdiselam mohamed, Beshir Abdi; Tigrinya – Michael Minassie, Berhanu Abadi Weldegiorgis, Nureddin Mohammedshiek

Languages: Afar, Amharic, Oromo, Somali and Tigrinya

Dataset: access here

All Lacuna Fund datasets are licensed under the CC-BY 4.0 International license unless otherwise noted.