Language 2020 Awards
Announcing Our Second Round of Funding for Datasets for Low Resource Languages
We are proud to share a selection of supported projects from our second cohort, whose ten teams will create openly accessible text and speech datasets that will fuel natural language processing (NLP) technologies in 29 languages in Eastern, Western, and Southern Africa. The training datasets produced will have significant downstream impacts on education, financial inclusion, healthcare, agriculture, communication, and disaster response. Check back in the coming weeks to learn more about each project and view the full portfolio!
Building an Annotated Spoken Corpus for Igbo NLP Tasks
This project addresses the gap in the availability of an Igbo spoken corpus for NLP tasks. Existing corpora—such as the Igbo web Corpus (IgWaC) and literary, religious and grammar texts—are either unannotated or not archived for research and NLP tasks. This study will create an annotated 1000-sentence corpus and 25 hours of unannotated audio data to launch an open access spoken corpus that would be available for research and NLP tasks.
To achieve these objectives, data will be gathered from two sources: oral narratives and live Igbo news. Ethnographic interviews will be used to collect data that covers several domains of the Igbo life such as marriage, religion, language, burial, education, security, and trade. To ensure adequate representation, balance, and homogeneity, data collection will take place in the five south-eastern states where Igbo is predominantly spoken, and the team will recruit 50 different language speakers across the states to provide audio data. Igbo news recordings will be acquired from the Federal Radio Corporation of Nigeria across the five states.
Building NLP Text and Speech Datasets for Low Resourced Languages in East Africa
The project will deliver open, accessible, and high-quality text and speech datasets for low-resource East African languages from Uganda, Tanzania, and Kenya. Taking advantage of the advances in NLP and voice technology requires a large corpora of high quality text and speech datasets. This project will aim to provide this data for these languages: Luganda, Runyankore-Rukiga, Acholi, Swahili, and Lumasaaba.
The speech data for Luganda and Swahilli will be geared towards training a speech-to-text engine for an SDG relevant use-case and general-purpose ASR models that could be used in tasks such as driving aids for people with disabilities and development of AI tutors to support early education. Monolingual and parallel text corpora will be used in several NLP applications that need NLP models, including natural language classification, topic classification, sentiment analysis, spell checking and correction, and machine translation.
Development of Corpus, Sentiment, and Hate Speech Lexicon for Major Nigerian Languages
Sentiment analysis is a novel field of research in Natural Language Processing that deals with the identification and classification of people’s opinions and sentiments about products and services contained in a piece of text, usually in web data. While there are various resources and datasets proposed in the research community, most of them are for English, Chinese, and European languages. However, there are several under-resourced languages being used in Nigeria. For example, the Hausa, the Yoruba, and the Igbo languages are the most widely spoken languages in Nigeria, with over 150 million speakers in Nigeria alone, and widely used in other African countries—though there are few resources for sentiment analysis in these languages.
The sentiment lexicon is one of the most crucial resources for most sentiment analysis tasks, and the huge amount of data generated in these languages through social media remain untapped. Thus, the team will seek to develop a corpus, sentiment lexicon, and hate speech lexicon for Hausa, Yoruba and Igbo languages.
KenCorpus: Kenyan Languages Corpus
This project recognizes that language plays a central role in preserving identity and culture and in equalizing access to information. The team will build the KenCorpus (Kenyan Languages Corpus) with the goal of providing rich textual and speech data resources for selected languages spoken in Kenya.
The KenCorpus will be collected from Kiswahili, Luhyia, and Dholuo languages and is a deliberate effort to provide equal opportunities, inclusivity, participation in decision-making, and accessibility to information by providing a base dataset for building NLP tools (e.g., POS taggers, Machine Translation systems, Automatic Speech Recognition, Text to Speech, Question Answering and Conversational agents in African languages).
This project will have a great impact on the methodologies used in the rapid assembly of corpora for under-resourced languages, shed light on how to prepare and annotate speech and texts for use in multilingual communities, and inspire the growth of human language technology firms across Africa.
Masakhane MT: Decolonizing Scientific Writing for Africa
When it comes to scientific communication and education, language matters. The ability for science to be discussed in local indigenous languages can not only help expand knowledge to those who do not speak English or French as a first language, but also can integrate the facts and methods of science into cultures that have been denied it in the past. Thus, the team will build a multilingual parallel corpus of African research, by translating African preprint research papers released on AfricArxiv into 6 diverse African languages.
Multimodal Datasets for Bemba
This project will create the first multi-modal dataset for Bemba—the most populous language in Zambia, but one that lacks significant resources. The team will collect visually-grounded dialogues between native Bemba speakers, which will be diarized and transcribed in full. A sample of the data will also be translated into English. The dataset will enable the development of speech recognition and speech and text translation applications, as well as facilitate research in language grounding and multimodal model development.
Named Entity Recognition and Parts of Speech Datasets for African Languages
Currently, the majority of existing NER datasets for African languages are automatically annotated and noisy, since the text quality for African languages is not verified—only a few African languages have human-annotated NER datasets. Likewise, the only open-source POS datasets that exist are for a small subset of languages in South Africa, and Yoruba, Naija, Wolof, and Bambara.
This project will develop a Part-of-Speech (POS) and Named Entity Recognition (NER) corpus for 20 African languages based on news data. NER is a core NLP task in information extraction, and NER systems are a requirement for numerous products from spell-checkers to localization of voice and dialogue systems, conversational agents, and information retrieval necessary to identify African names, places, and people.
Open Source Datasets for Local Ghanaian Languages: A Case for Twi and Ga
This project will develop a new speech dataset that makes it possible for Twi (Asante, Akuapim, Fante dialects) and Ga speakers in Ghana with low English literacy to access digital financial services in their native language. Access to digital financial services will serve as the immediate use case—however, the bulk of the collected data will be additionally useful for other purposes. The team will build a phonetically balanced speech corpus (with transcriptions and rough English translations) that is focused on the financial domain, and since the speech corpus will be phonetically balanced, it should be useful in acoustic modeling for use cases beyond accessing digital financial services.