Resources for conservation, development, and documentation of low resource (human) languages.
-
Updated
May 9, 2024 - TeX
Resources for conservation, development, and documentation of low resource (human) languages.
This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.
Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
Speech synthesis (TTS) in low-resource languages by training from scratch with Fastpitch and fine-tuning with HifiGan
NLP pipelines for Tagalog using spaCy
CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates
SemEval2024-task 11: Bridging the Gap in Text-Based Emotion Detection
Python source code for EMNLP 2020 paper "Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT".
This is an ASR corpus for Bemba language. It contains read speech from diverse publicly available Bemba sources; Literature Books, Radio/TV shows transcripts, Youtube Video transcripts, Online sources. The corpus has 14, 438 utterances culminating into over 24 hours of speech.
Exploring the Limits of Low-Resource Neural Machine Translation
This is a repository for NaijaSenti. A Lacuna Funded Project for the development of sentiment corpus for four Nigerian languages: Igbo, Hausa, Yoruba and Pidgin.
Curated list of publicly available parallel corpus for Indian Languages
The EveryVoice TTS Toolkit - Text To Speech for your language
📖 LanMIT: A Toolkit for Improving Language Models in Low-resourced Speech Recognition based on Kaldi.
[ACL'24] MC^2: A Multilingual Corpus of Minority Languages in China (Tibetan, Uyghur, Kazakh, and Mongolian)
[ACL'24 Findings] Teaching Large Language Models an Unseen Language on the Fly
My thesis on "Open Source Code and Low Resource Languages" for an MSc in Language Science and Technology at Saarland University
Add a description, image, and links to the low-resource-languages topic page so that developers can more easily learn about it.
To associate your repository with the low-resource-languages topic, visit your repo's landing page and select "manage topics."