Merge pull request #24 from Divyajyoti02/main

Divyajyoti's profile
WING-NUS · Aug 3, 2024 · 50425b7 · 50425b7
2 parents 8cd7e0a + b9386d7
commit 50425b7
Show file tree

Hide file tree

Showing 10 changed files with 127 additions and 0 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/content/.DS_Store b/content/.DS_Store
diff --git a/content/authors/.DS_Store b/content/authors/.DS_Store
diff --git a/content/authors/divyajyoti/_index.md b/content/authors/divyajyoti/_index.md
@@ -0,0 +1,74 @@
+---
+# Display name
+title: Divyajyoti
+
+# Full Name (for SEO)
+first_name: Divyajyoti
+last_name: Panda
+
+# Is this the primary user of the site?
+superuser: false
+
+# Role/position
+role: Intern (May '24)
+
+# Organizations/Affiliations
+organizations:
+  - name: University of Southern California 
+    url: 'https://www.usc.edu/'
+
+# Short bio (displayed in user profile at end of posts)
+bio: Research Intern
+
+interests:
+  - Machine Learning
+  - Natural Language Processing
+  - Machine Translation
+  - Code-Switching
+  - Transliteration Systems for Indian languages
+  - Large Language Models
+
+education:
+  courses:
+    - course: MS in Computer Science
+      institution: University of Southern California
+      year: 2023-2025
+    - course: BTech in Computer Science
+      institution: National Institute of Technology, Rourkela
+      year: 2019-2023
+
+# Social/Academic Networking
+# For available icons, see: https://docs.hugoblox.com/getting-started/page-builder/#icons
+#   For an email link, use "fas" icon pack, "envelope" icon, and a link in the
+#   form "mailto:[email protected]" or "#contact" for contact widget.
+social:
+  - icon: envelope
+    icon_pack: fas
+    link: 'mailto:[email protected]'
+  - icon: github
+    icon_pack: fab
+    link: https://github.com/Divyajyoti02
+  - icon: linkedin
+    icon_pack: fab
+    link: https://www.linkedin.com/in/divyajyoti-panda/
+# Link to a PDF of your resume/CV from the About widget.
+# To enable, copy your resume/CV to `static/files/cv.pdf` and uncomment the lines below.
+# - icon: cv
+#   icon_pack: ai
+#   link: files/cv.pdf
+
+# Enter email to display Gravatar (if Gravatar enabled in Config)
+email: '[email protected]'
+
+# Highlight the author in author lists? (true/false)
+highlight_name: false
+
+# Organizational groups that you belong to (for People widget)
+#   Set this to `[]` or comment out if you are not using People widget.
+user_groups:
+  - Visitors / Interns
+#  - Researchers
+---
+
+Divyajyoti Panda is an intern in WING Research Group and working from May 2024. He is a master's student of Computer Science in USC and has interests in machine translation and transliteration, and large language models.
+
diff --git a/content/authors/divyajyoti/avatar.jpg b/content/authors/divyajyoti/avatar.jpg
diff --git a/content/publication/.DS_Store b/content/publication/.DS_Store
diff --git a/content/publication/das-et-al-2023/cite.bib b/content/publication/das-et-al-2023/cite.bib
@@ -0,0 +1,8 @@
+@article{das2023statistical,
+  title={Statistical machine translation for indic languages},
+  author={Das, Sudhansu Bala and Panda, Divyajyoti and Mishra, Tapas Kumar and Patra, Bidyut Kr},
+  journal={Natural Language Processing},
+  pages={1--18},
+  year={2023},
+  publisher={Cambridge University Press}
+}
diff --git a/content/publication/das-et-al-2023/index.md b/content/publication/das-et-al-2023/index.md
@@ -0,0 +1,17 @@
+---
+title: 'Statistical machine translation for indic languages'
+authors:
+- Sudhansu Bala Das
+- divyajyoti
+- Tapas Kumar Mishra
+- Bidyut Kumar Patra
+date: '2024-06-03'
+publishDate: '2024-08-02T16:08:01.479Z'
+publication_types:
+- article
+publication: '*Natural Language Processing*'
+abstract: Statistical Machine Translation (SMT) systems use various probabilistic and statistical Natural Language Processing (NLP) methods to automatically translate from one language to another language while retaining the originality of the context. This paper aims to discuss the development of bilingual SMT models for translating English into fifteen low-resource Indic languages (ILs) and vice versa. The process to build the SMT model is described and explained using a workflow diagram. Samanantar and OPUS corpus are utilized for training, and Flores200 corpus is used for fine-tuning and testing purposes. The paper also highlights various preprocessing methods used to deal with corpus noise. The Moses open-source SMT toolkit is being investigated for the system’s development. The impact of distance-based reordering and Morpho-syntactic Descriptor Bidirectional Finite-State Encoder (msd-bidirectional-fe) reordering on ILs is compared in the paper. This paper provides a comparison of SMT models with Neural Machine Translation (NMT) for ILs. All the experiments assess the translation quality using standard metrics such as BiLingual Evaluation Understudy, Rank-based Intuitive Bilingual Evaluation Score, Translation Edit Rate, and Metric for Evaluation of Translation with Explicit Ordering. From the result, it is observed that msd-bidirectional-fe reordering performs better than the distance-based reordering model for ILs. It is also noticed that even though the IL-English and English-IL systems are trained using the same corpus, the former performs better for all the evaluation metrics. The comparison between SMT and NMT shows that across various languages, SMT performs better in some cases, while NMT outperforms in others.
+links:
+- name: URL
+  url: https://www.cambridge.org/core/journals/natural-language-processing/article/statistical-machine-translation-for-indic-languages/022C193C28525D1C88A731C35DF1C388
+---
diff --git a/content/publication/das-et-al-2024/cite.bib b/content/publication/das-et-al-2024/cite.bib
@@ -0,0 +1,10 @@
+@article{bala2024multilingual,
+  title={Multilingual Neural Machine Translation for Indic to Indic Languages},
+  author={Bala Das, Sudhansu and Panda, Divyajyoti and Kumar Mishra, Tapas and Kr. Patra, Bidyut and Ekbal, Asif},
+  journal={ACM Transactions on Asian and Low-Resource Language Information Processing},
+  volume={23},
+  number={5},
+  pages={1--32},
+  year={2024},
+  publisher={ACM New York, NY}
+}
diff --git a/content/publication/das-et-al-2024/index.md b/content/publication/das-et-al-2024/index.md
@@ -0,0 +1,18 @@
+---
+title: 'Multilingual Neural Machine Translation for Indic to Indic Languages'
+authors:
+- Sudhansu Bala Das
+- divyajyoti
+- Tapas Kumar Mishra
+- Bidyut Kumar Patra
+- Asif Ekbal
+date: '2024-05-10'
+publishDate: '2024-08-02T16:12:33.521Z'
+publication_types:
+- article
+publication: '*ACM Transactions on Asian and Low-Resource Language Information Processing*'
+abstract: The method of translation from one language to another without human intervention is known as Machine Translation (MT). Multilingual neural machine translation (MNMT) is a technique for MT that builds a single model for multiple languages. It is preferred over other approaches, since it decreases training time and improves translation in low-resource contexts, i.e., for languages that have insufficient corpus. However, good-quality MT models are yet to be built for many scenarios such as for Indic-to-Indic Languages (IL-IL). Hence, this article is an attempt to address and develop the baseline models for low-resource languages i.e., IL-IL (for 11 Indic Languages (ILs)) in a multilingual environment. The models are built on the Samanantar corpus and analyzed on the Flores-200 corpus. All the models are evaluated using standard evaluation metrics i.e., Bilingual Evaluation Understudy (BLEU) score (with the range of 0 to 100). This article examines the effect of the grouping of related languages, namely, East Indo-Aryan (EI), Dravidian (DR), and West Indo-Aryan (WI) on the MNMT model. From the experiments, the results reveal that related language grouping is beneficial for the WI group only while it is detrimental for the EI group and it shows an inconclusive effect on the DR group. The role of pivot-based MNMT models in enhancing translation quality is also investigated in this article. Owing to the presence of large good-quality corpora from English (EN) to ILs, MNMT IL-IL models using EN as a pivot are built and examined. To achieve this, English-Indic Language (EN-IL) models are developed with and without the usage of related languages. Results show that the use of related language grouping is advantageous specifically for EN to ILs. Thus, related language groups are used for the development of pivot MNMT models. It is also observed that the usage of pivot models greatly improves MNMT baselines. Furthermore, the effect of transliteration on ILs is also analyzed in this article. To explore transliteration, the best MNMT models from the previous approaches (in most of cases pivot model using related groups) are determined and built on corpus transliterated from the corresponding scripts to a modified Indian language Transliteration script (ITRANS). The outcome of the experiments indicates that transliteration helps the models built for lexically rich languages, with the best increment of BLEU scores observed in Malayalam (ML) and Tamil (TA), i.e., 6.74 and 4.72, respectively. The BLEU score using transliteration models ranges from 7.03 to 24.29. The best model obtained is the Punjabi (PA)-Hindi (HI) language pair trained on PA-WI transliterated corpus.
+links:
+- name: URL
+  url: https://dl.acm.org/doi/full/10.1145/3652026
+---