Skip to content

Language Resources

Resources for Proposals in NLP 

This document of 2024 NLP Resources (also listed below) represents a collection of resources from the Technical Advisory Panel (TAP) as an addition to those referenced in the RFP document. These are intended to provide assistance in obtaining relevant background information, preparing a competitive proposal, and completing quality work.  

These resources are not intended to be exhaustive nor authoritative. This document does not represent an endorsement of work by the Lacuna Fund Secretariat, the TAP, or individual members.  

ACADEMIC PAPERS (ENGLISH) 

  • Essentials of Language Documentation. It is a compilation of articles edited by J. Gippert, N. Himmelmann and U. Mossel on various topics related to language documentation. These include the discipline’s specific workflow, fundamental ethical aspects, and its relationship with other fields of linguistic work. This compilation can serve as a valuable starting point for those interested in the creation of corpora that will later be used in NLP.  
  • Decolonising Speech and Language Technology. In Proceedings of the 28th International Conference on Computational Linguistics. (COLING 2020):  It´s a review of colonizing discourses in speech and language technology, and suggests new ways of working with Indigenous communities, and seeks to open a discussion of a postcolonial approach to computational methods for supporting language vitality.  
  • Datasheets for Datasets. In the electronics industry, every component, no matter how simple or complex, is accompanied with a datasheet that describes its operating characteristics, test results, recommended uses, and other information. By analogy, this paper propose that every dataset be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on, in order to facilitate better communication between dataset creators and dataset consumers, and encourage the machine learning community to prioritize transparency and accountability. 

ACADEMIC PAPERS (SPANISH) 

FRAMEWORKS (ENGLISH) 

  • Check Before You Tech— A guide for communities choosing language apps and software. While meant for technology users, this resource can serve as a helpful guide for developers in order to lead with an ethical data approach to understand the questions communities are pondering when using language technology.  
  • International Decade of Indigenous Languages (2022-2032) The goal of the International Decade is to guarantee the right of indigenous peoples to preserve, revitalize and promote their languages, and to integrate aspects of linguistic diversity and multilingualism in sustainable development efforts, with a particular focus on digital empowerment and language technologies. 
  • Ethics in linguistics. A deep review of existing literature on ethics in linguistics, both as it relates to research and as it relates to broader practices, which we then situate within ongoing conversations across subfields.  

 

FRAMEWORKS (SPANISH) 

BOOKS 

  • Karën Fort on data annotation in NLP. This book presents a unique opportunity for constructing a consistent image of collaborative manual annotation for Natural Language Processing (NLP).  NLP has witnessed two major evolutions in the past 25 years: firstly, the extraordinary success of machine learning, which is now, for better or for worse, overwhelmingly dominant in the field, and secondly, the multiplication of evaluation campaigns or shared tasks. Both involve manually annotated corpora, for the training and evaluation of the systems. 
  • Bases de la documentación lingüística.  Besides being a manual of field techniques, this book offers a valuable set of reflections on linguistic fieldwork. It is an indispensable reference not only for those who work with indigenous languages and peoples, but also for those directly involved in the collection and management of linguistic data in general and those who work with linguistic practices of a given community, indigenous or not.  

DATABASES 

  • The South American Indigenous Language Structures (SAILS) is a large database of grammatical properties of languages gathered from descriptive materials (such as reference grammars) by a team directed by Pieter Muysken. SAILS Online was programmed by Harald Hammarström using the CLLD framework, with support from Robert Forkel.  
  • Sound of the Andes is a database containing lexical and phonological information on languages of the Quechua, Aymara, and Mapuche families. It is an interactive site that brings knowledge of these languages to the general public while presenting the information in a clear, transparent, and accessible manner for various computational analyses focused on comparing these languages. 
  • Glottolog 5.0 is a bibliographic database of the world’s lesser-known languages made by Hammarström, Forkel, Haspelmath & Bank  in the Max Planck Institute for Evolutionary Anthropology. 

OTHER RESOURCES ON OPEN DATA 

  • Metatext is List of Translation Datasets for Machine Learning Projects including High-quality datasets are the key to good performance in natural language processing (NLP) projects. They have collected a list of NLP datasets for Translation task, to get started machine learning projects. 
  • Ancient Natural Language Processing aims to provide resources and tools for scholars, students, and enthusiasts who are interested in applying NLP techniques to ancient languages. Here can be found information about various projects that use NLP for ancient languages, such as machine translation, text analysis, and language learning. Includes online courses and tutorials that teach how to use NLP tools for ancient languages.  

Prior Resources – 2020 NLP RFP

Lacuna Fund issued calls for proposals in NLP in both 2020 and 2021.  Here is a link to the Resources that were issued for prior calls for proposals:  2020 NLP Resources  Some key resources included:   

PREVIOUS WORK AND RELEVANT BACKGROUND 

Relevant recent challenges and other efforts: 

  • Papers from the recent International Conference on Learning Representations (ICLR) AfricaNLP workshop. (There may also be upcoming workshops at additional conferences) 
  • Abundant resources in the website of Widening ML Workshop at ACL, an organization that promote and support ideas and voices of underrepresented groups in Natural Language Processing (NLP). 
  • Common Voice, including ongoing efforts to create datasets for Luganda and Kinyarwanda. 
  • Masakhane, a grassroots African initiative to improve NLP in African languages. The group is undertaking many efforts related to African NLP. 

COMPILATIONS OF RESOURCES AND EXISTING DATASETS 

  • See papers from the recent International Conference on Learning Representations (ICLR) AfricaNLP workshop for information on active efforts and key considerations in a variety of languages. (There may also be upcoming workshops at additional conferences) 
  • Maskhane’s website (maskhane.io) has a strong listing of resources and existing efforts in many languages. 
  • Search ACL anthology, including LREC and ACL conferences and workshops, OPUS, and other existing repositories for datasets in languages of interest. 

GENERAL CONSIDERATIONS AND THE STATE OF THE FIELD 

  • Martinus, Laura, and Jade Z. Abbott. “A Focus on Neural Machine Translation for African Languages.” ArXiv:1906.05685 [Cs, Stat], June 14, 2019. http://arxiv.org/abs/1906.05685.
  • Nekoto, Wilhelmina, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Tajudeen Kolawole, Taiwo Fagbohungbe, Solomon Oluwole Akinola, et al. “Participatory Research for Low-Resourced Machine Translation: A Case Study in African Languages.” ArXiv:2010.02353 [Cs], October 5, 2020. http://arxiv.org/abs/2010.02353.
  • Joshi, Pratik, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. “The State and Fate of Linguistic Diversity and Inclusion in the NLP World.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6282–93. Online: Association for Computational Linguistics, 2020. https://doi.org/10.18653/v1/2020.acl- main.560.
  • Tracey, Jennifer, Stephanie Strassel, Ann Bies, Zhiyi Song, Michael Arrigo, Kira Griffitt, Dana Delgado, et al. “Corpus Building for Low Resource Languages in the DARPA LORELEI Program.” In Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages, 48–55. Dublin, Ireland: European Association for Machine Translation, 2019. https://www.aclweb.org/anthology/W19-6808.
  • Ruder, Sebastian. “Why You Should Do NLP Beyond English,”2020. https://ruder.io/nlp- beyond-english/.
  • Neubig, Graham. “The Low Resource NLP Toolkit: 2020 Edition” http://www.phontron.com/slides/neubig20africanlp.pdf. Presented at the Second AfricaNLP Workshop at ICLR 2020.

This is a rapidly evolving field, and new datasets and models are published almost weekly. 

PRIVACY AND ETHICS 

OTHER RESOURCES ON OPEN DATA