Timely and accurate access to information – spoken or written – in one’s own language is a must to enable full participation in the digital world. Translations, the ability to understand and synthesize speech, and many other AI-enabled applications in the field of natural language processing (NLP) require labeled training data that does not exist for many languages, some spoken by millions of people around the world.
In the NLP for social good domain, rapid advances in machine translation (MT) are decreasing the amount of both parallel and monolingual data needed to train models.
Yet while the world’s largest languages, including several African languages, have at least some data coverage, a staggering number of languages outside of Europe and North America lack any text or speech data, data for fundamental tasks such as part of speech (POS) tagging, or confounding common NLP tasks such as MT and automatic speech recognition (ASR).
Volunteer initiatives, such as Masakhane, and philanthropic initiatives, such as Gamayun and Common Voice, are attempting to address the lack of machine translation and speech recognition for underserved languages through open data and methods. The Africa NLP Challenge and the Africa NLP workshop at ICLR 2020 have also built a community of practice around African NLP.
To complement and expand these efforts, Lacuna Fund hopes to fund the creation, expansion and maintenance of labeled data. Types of datasets we would like to support are listed below, but the RFP is intentionally open, to encourage new and innovative ideas that we may not have identified.
- Benchmarking datasets to enable further NLP tasks in underserved languages.
- Creating new or unlocking existing data for easier inclusion of underserved languages in multilingual models.
- Datasets to enable advances in NLP tasks for code-switched text or speech (speech alternating between multiple languages, dialects, or registers).
- Smaller mono- or multilingual datasets optimized for specific use cases (i.e. digit or place name ASR datasets, or MT or metadata extraction for legal or medical records)
- Other ideas! See our Grantmaking Philosophy.
The 2020 request for proposals on underserved languages will be supported by The Rockefeller Foundation, Google.org, Canada’s International Development Research Centre, and the German development agency GIZ on behalf of the Federal Ministry for Economic Cooperation and Development (BMZ).