The Language Model, Resources, and Computational Pipelines for the Under-Resourced Iranian Azerbaijani
Iranian Azerbaijani is a dialect of the Azerbaijani language spoken by more than 16% of the population in Iran (>14 million). Unfortunately, a lack of computational resources is one of the factors that puts this language and its rich culture at risk of extinction. This work aims to create fundamental natural language processing (NLP) resources and pipelines for the processing and analysis of Iranian Azerbaijani introducing standard datasets and starter models for various NLP tasks such as language modeling, text classification, part-of-speech (POS) tagging, and machine translation. The proposed resources have been curated and preprocessed to facilitate the development of NLP models for Iranian Azerbaijani and provide a strong baseline for further research and development. This study is an example of bridging the gap in NLP for low-resource languages and promoting the advancement of language technologies in underrepresented languages. To the best of our knowledge, for the first time, this paper presents major infrastructures for the processing and analysis of Iranian Azerbaijani, with the ultimate goal of improving communication and information access for millions of individuals.
We have provided sample data for each task; however, if you require access to the complete dataset, please do not hesitate to contact us via email at one of the following addresses: [email protected], [email protected], or [email protected]. We will be more than happy to provide you with the data you need.
Our models can be found on our Hugging Face repository.
Task | Good Example | Bad Example |
---|---|---|
Language Modeling (AzerBERT) | بو [MASK] کتابی ده. Hyposthesis: شعر |
سن نجورسن [MASK] Hyposthesis: . |
Text Classification | خزر دنیزی بؤیوکلوگونه و بعضی فیزیکی جوغرافی علامتلرینه گؤره دونیانین ان بؤیوک گؤلودور . Reference: Geography Hyposthesis: Geography |
یوزیلین اوللرینده صفوی سولالهسی ضعیفلهدیگیندن حاکیم کاظم خان مرکزی حاکیمیته تابئعچیلیکدن چیخدی. Reference: History Hyposthesis: Literature |
POS tagging | آلما آلیب گلرم، سن هئچ بیر شی آلما. Hyposthesis: آلما [NOUN] آلیب [VERB] گلرم ، [PUNC] [VERB] سن [PRON] هئچ [DET] بیر [DET] شی [NOUN] آلما [VERB ] . [PUNC] |
من بیلمیرم. سن؟ Hypothesis: من [PRON] بیلمیرم [VERB] . [PUNC] سن؟ [VERB] |
Machine Translation fa2azb | Source: و خدا از آسمان آبی فرود آورد Reference: آللاه گؤیدن بیر یاغیش ائندیرر . Hypothesis: آللاه گؤیدن یاغمور ائندیردی . |
Source: چه کسی مرا راهنمایی خواهد کرد؟ Reference: منی کیمسه یول گؤستهرجک؟ Hypothesis: منی دوغرو یولا سالماق ایستهییرم . |
Machine Translation azb2fa | Source: ایسنانین اونا رحمی گلدی ! Reference: عیسی دلش به حال او سوخت . Hypothesis: عیسی به او رحم کرد ، |
Source: ایسنانین آناسێ وه قارداشلارێ اونون یانینا گلدی . آمما ازدههاما گؤره اونا یاخینلاشا بیلمهدیلر . Reference: آنگاه مادر و برادران عیسی برای دیدنش آمدند ، اما به دلیل ازدحام جمعیت نتوانستند پیش او بروند . Hypothesis: سپس مادر عیسی و برادرانش نزد او آمدند . اما وقتی او را دیدند ، نیافتند |