Announcing Our Second Round of Funding for Datasets for Low Resource Languages

28 April 2021

We are proud to share a selection of supported projects from our second cohort, whose ten teams will create openly accessible text and speech datasets that will fuel natural language processing (NLP) technologies in 29 languages in Eastern, Western, and Southern Africa. The training datasets produced will have significant downstream impacts on education, financial inclusion, healthcare, agriculture, communication, and disaster response. Check back in the coming weeks to learn more about each project and view the full portfolio!

Building an Annotated Spoken Corpus for Igbo NLP Tasks

This project addresses the gap in the availability of an Igbo spoken corpus for NLP tasks. Existing corpora—such as the Igbo web Corpus (IgWaC) and literary, religious and grammar texts—are either unannotated or not archived for research and NLP tasks. This study will create an annotated 1000-sentence corpus and 25 hours of unannotated audio data to launch an open access spoken corpus that would be available for research and NLP tasks.

To achieve these objectives, data will be gathered from two sources: oral narratives and live Igbo news. Ethnographic interviews will be used to collect data that covers several domains of the Igbo life such as marriage, religion, language, burial, education, security, and trade. To ensure adequate representation, balance, and homogeneity, data collection will take place in the five south-eastern states where Igbo is predominantly spoken, and the team will recruit 50 different language speakers across the states to provide audio data. Igbo news recordings will be acquired from the Federal Radio Corporation of Nigeria across the five states.

We are excited to embark on this project due to the impact it will have on the NLP community as it particularly concerns the Igbo language. The need to build an annotated corpus of contemporary Igbo is one that is long overdue. It could be very interesting to study the language from naturally occurring contexts such as narratives, stories and conversations. Therefore, we are both overjoyed and grateful to the Meridian Institute for giving us this unique opportunity through Lacuna Fund. We are very hopeful that this will serve as a springboard for the use of Igbo for NLP tasks and other applied linguistic research”

Gerald Nweya
University of Ibadan, Building an Annotated Spoken Corpus for Igbo NLP Tasks Project Team

Building NLP Text and Speech Datasets for Low Resourced Languages in East Africa

The project will deliver open, accessible, and high-quality text and speech datasets for low-resource East African languages from Uganda, Tanzania, and Kenya. Taking advantage of the advances in NLP and voice technology requires a large corpora of high quality text and speech datasets. This project will aim to provide this data for these languages: Luganda, Runyankore-Rukiga, Acholi, Swahili, and Lumasaaba.

The speech data for Luganda and Swahilli will be geared towards training a speech-to-text engine for an SDG relevant use-case and general-purpose ASR models that could be used in tasks such as driving aids for people with disabilities and development of AI tutors to support early education. Monolingual and parallel text corpora will be used in several NLP applications that need NLP models, including natural language classification, topic classification, sentiment analysis, spell checking and correction, and machine translation.

Development of Corpus, Sentiment, and Hate Speech Lexicon for Major Nigerian Languages

Sentiment analysis is a novel field of research in Natural Language Processing that deals with the identification and classification of people’s opinions and sentiments about products and services contained in a piece of text, usually in web data. While there are various resources and datasets proposed in the research community, most of them are for English, Chinese, and European languages. However, there are several under-resourced languages being used in Nigeria. For example, the Hausa, the Yoruba, and the Igbo languages are the most widely spoken languages in Nigeria, with over 150 million speakers in Nigeria alone, and widely used in other African countries—though there are few resources for sentiment analysis in these languages.

The sentiment lexicon is one of the most crucial resources for most sentiment analysis tasks, and the huge amount of data generated in these languages through social media remain untapped. Thus, the team will seek to develop a corpus, sentiment lexicon, and hate speech lexicon for Hausa, Yoruba and Igbo languages.

Languages spoken in Africa are low-resource; they have insufficient resources like datasets for machine learning and other important AI tasks. In this project, we aim to produce the first large-scale high-quality datasets for machine learning from social media contents written in three major Nigerian languages (Hausa, Igbo, and Yoruba). These datasets will be useful in Natural Language Processing tasks such as sentiment analysis, emotion analysis, hate speech detection, and fake news detection. We are a team of researchers from HausaNLP research group of Faculty of Computer Science and Information Technology, Bayero University Kano-Nigeria. We have international collaborations with Masakhane, Sentiment Analysis Lab of Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, and INESC TEC’s Artificial Intelligence and Decision Support Laboratory (LIAAD).”

Project Team
Development of  Corpus, Sentiment, and Hate Speech Lexicon for Major Nigerian Languages

KenCorpus: Kenyan Languages Corpus

This project recognizes that language plays a central role in preserving identity and culture and in equalizing access to information. The team will build the KenCorpus (Kenyan Languages Corpus) with the goal of providing rich textual and speech data resources for selected languages spoken in Kenya.

The KenCorpus will be collected from Kiswahili, Luhyia, and Dholuo languages and is a deliberate effort to provide equal opportunities, inclusivity, participation in decision-making, and accessibility to information by providing a base dataset for building NLP tools (e.g., POS taggers, Machine Translation systems, Automatic Speech Recognition, Text to Speech, Question Answering and Conversational agents in African languages).

This project will have a great impact on the methodologies used in the rapid assembly of corpora for under-resourced languages, shed light on how to prepare and annotate speech and texts for use in multilingual communities, and inspire the growth of human language technology firms across Africa.

Every language and culture has a story to tell, and one’s native language speaks to one’s soul. Nelson Mandela once said; “If you talk to a man in a language he understands, that goes to his head. If you talk to him in his language, that goes to his heart”. The official and national languages of Kenya are English and Kiswahili. Kenya is a multilingual nation with approximately 68 native languages. “KenCorpus”, a Native Languages of Kenya (NaLaKe) Dataset for NLP and ML, seeks to bring Kenyan languages into the NLP space. Collection of quality linguistics datasets is a first step to achieving our long-term goal of availing life changing NLP tools for African languages as instruments of culture. Our ability to communicate new ideas and discoveries in Native Languages is crucial to scientific linguistic advancement.  In this project, we will get the chance to work with selected Native speakers across Kenya, involve students in data collection, annotation, and mentor them into building NLP tools for African Languages.”

Project Team
KenCorpus: Kenyan Languages Corpus

Masakhane MT: Decolonizing Scientific Writing for Africa

When it comes to scientific communication and education, language matters. The ability for science to be discussed in local indigenous languages can not only help expand knowledge to those who do not speak English or French as a first language, but also can integrate the facts and methods of science into cultures that have been denied it in the past. Thus, the team will build a multilingual parallel corpus of African research, by translating African preprint research papers released on AfricArxiv into 6 diverse African languages.

When it comes to scientific communication and education, language matters. The ability of science to be discussed in local indigenous languages not only can reach more people, but can open up African methodologies and research to the world. We’re exceptionally excited to bring African science to the global community and continue the journey of decolonization of scientific discourse.”

Jade Abbott
Masakhane MT: Decolonizing Scientific Writing for Africa Project Team

Multimodal Datasets for Bemba

This project will create the first multi-modal dataset for Bemba—the most populous language in Zambia, but one that lacks significant resources. The team will collect visually-grounded dialogues between native Bemba speakers, which will be diarized and transcribed in full. A sample of the data will also be translated into English. The dataset will enable the development of speech recognition and speech and text translation applications, as well as facilitate research in language grounding and multimodal model development.

We are grateful to the Meridian Institute for giving us the opportunity through Lacuna Fund to create ‘Multimodal datasets for the Bemba language’. This will be the first multi-modal speech dataset created for any Zambian language. We are excited about this project because the dataset will enable the development of speech recognition and speech to text translation applications, as well as facilitate research in language grounding and multimodal model development.”

Claytone Sikasote
Multimodal Datasets for Bemba Project Team

Named Entity Recognition and Parts of Speech Datasets for African Languages

Currently, the majority of existing NER datasets for African languages are automatically annotated and noisy, since the text quality for African languages is not verified—only a few African languages have human-annotated NER datasets. Likewise, the only open-source POS datasets that exist are for a small subset of languages in South Africa, and Yoruba, Naija, Wolof, and Bambara.

This project will develop a Part-of-Speech (POS) and Named Entity Recognition (NER) corpus for 20 African languages based on news data. NER is a core NLP task in information extraction, and NER systems are a requirement for numerous products from spell-checkers to localization of voice and dialogue systems, conversational agents, and information retrieval necessary to identify African names, places, and people.

We are grateful for the Lacuna Fund that will support our dataset creation initiatives. This project will lead to a better understanding of the linguistic structures of 20 African languages from four language families (Afro-Asiatic, English Creole, Niger-Congo, and Nilo-Saharan) and regions of Africa. It will also encourage benchmarking of African language datasets in natural language processing (NLP) research. We look forward to how this initiative will spur NLP research in African universities.”

Masakhane
Named Entity Recognition and Parts of Speech Datasets for African Languages Project Team

Open Source Datasets for Local Ghanaian Languages: A Case for Twi and Ga

This project will develop a new speech dataset that makes it possible for Twi (Asante, Akuapim, Fante dialects) and Ga speakers in Ghana with low English literacy to access digital financial services in their native language. Access to digital financial services will serve as the immediate use case—however, the bulk of the collected data will be additionally useful for other purposes. The team will build a phonetically balanced speech corpus (with transcriptions and rough English translations) that is focused on the financial domain, and since the speech corpus will be phonetically balanced, it should be useful in acoustic modeling for use cases beyond accessing digital financial services.

Ashesi University and Nokwary Technologies are excited and grateful that the Lacuna Fund has chosen to fund our building of a speech dataset in native Ghanaian languages. English illiteracy and low literacy is a barrier that keeps many Ghanaians from accessing the full benefits of the digital age and in particular, digital financial services. Advancement in speech and language technology can break this illiteracy barrier but it is impossible to apply these advances to our native languages without datasets in these languages. The funding from Lacuna fund will enable us to build a dataset in Twi and Ga that we believe will spur AI innovations that will help bring the full benefits of the digital age to all Ghanaians regardless of socio-economic status.”

Ashesi and Nokwary Team
Open source Datasets for Local Ghanaian Languages: A case for Twi and Ga