Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 4.99 KB

README.md

File metadata and controls

20 lines (15 loc) · 4.99 KB

The Language Model, Resources, and Computational Pipelines for the Under-Resourced Iranian Azerbaijani

Abstract

Iranian Azerbaijani is a dialect of the Azerbaijani language spoken by more than 16% of the population in Iran (>14 million). Unfortunately, a lack of computational resources is one of the factors that puts this language and its rich culture at risk of extinction. This work aims to create fundamental natural language processing (NLP) resources and pipelines for the processing and analysis of Iranian Azerbaijani introducing standard datasets and starter models for various NLP tasks such as language modeling, text classification, part-of-speech (POS) tagging, and machine translation. The proposed resources have been curated and preprocessed to facilitate the development of NLP models for Iranian Azerbaijani and provide a strong baseline for further research and development. This study is an example of bridging the gap in NLP for low-resource languages and promoting the advancement of language technologies in underrepresented languages. To the best of our knowledge, for the first time, this paper presents major infrastructures for the processing and analysis of Iranian Azerbaijani, with the ultimate goal of improving communication and information access for millions of individuals.

Data Availability

We have provided sample data for each task; however, if you require access to the complete dataset, please do not hesitate to contact us via email at one of the following addresses: [email protected], [email protected], or [email protected]. We will be more than happy to provide you with the data you need.

Models

Our models can be found on our Hugging Face repository.

Model Output Examples

Task Good Example Bad Example
Language Modeling (AzerBERT) بو [MASK] کتابی ده.
Hyposthesis: شعر
سن نجورسن [MASK]
Hyposthesis: .
Text Classification خزر دنیزی بؤیوک‌لوگونه و بعضی فیزیکی جوغرافی علامتلرینه گؤره دونیانین ان بؤیوک گؤلودور .
Reference: Geography
Hyposthesis: Geography
یوزیلین اوللرینده صفوی سولاله‌سی ضعیفله‌دیگیندن حاکیم کاظم خان مرکزی حاکیمیته تابئع‌چیلیکدن چیخدی.
Reference: History
Hyposthesis: Literature
POS tagging آلما آلیب گلرم، سن هئچ بیر شی آلما.‎‎
Hyposthesis: آلما [NOUN] آلیب [VERB] گلرم ، [PUNC] [VERB] سن [PRON] هئچ [DET] بیر [DET] شی [NOUN] آلما [VERB ] . [PUNC]
من بیلمیرم. سن؟
Hypothesis: من [PRON] بیلمیرم [VERB] . [PUNC] سن؟ [VERB]
Machine Translation fa2azb Source: و خدا از آسمان آبی فرود آورد
Reference: آللاه گؤیدن بیر یاغیش ائندیرر .
Hypothesis: آللاه گؤیدن یاغمور ائندیردی .
Source: چه کسی مرا راهنمایی خواهد کرد؟
Reference: منی کیمسه یول گؤسته‌رجک؟
Hypothesis: منی دوغرو یولا سالماق ایستهییرم .
Machine Translation azb2fa Source: ایسنانین اونا رحمی گلدی !
Reference: عیسی دلش به حال او سوخت .
Hypothesis: عیسی به او رحم کرد ،
Source: ایسنانین آناسێ وه قارداشلارێ اونون یانینا گلدی .
آمما ازدههاما گؤره اونا یاخینلاشا بیلمهدیلر .
Reference: آنگاه مادر و برادران عیسی برای دیدنش آمدند ،
اما به دلیل ازدحام جمعیت نتوانستند پیش او بروند .
Hypothesis: سپس مادر عیسی و برادرانش نزد او آمدند .
اما وقتی او را دیدند ، نیافتند