Skip to content

Commit

Permalink
Merge pull request #24 from Divyajyoti02/main
Browse files Browse the repository at this point in the history
Divyajyoti's profile
  • Loading branch information
knmnyn authored Aug 3, 2024
2 parents 8cd7e0a + b9386d7 commit 50425b7
Show file tree
Hide file tree
Showing 10 changed files with 127 additions and 0 deletions.
Binary file modified .DS_Store
Binary file not shown.
Binary file modified content/.DS_Store
Binary file not shown.
Binary file modified content/authors/.DS_Store
Binary file not shown.
74 changes: 74 additions & 0 deletions content/authors/divyajyoti/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
# Display name
title: Divyajyoti

# Full Name (for SEO)
first_name: Divyajyoti
last_name: Panda

# Is this the primary user of the site?
superuser: false

# Role/position
role: Intern (May '24)

# Organizations/Affiliations
organizations:
- name: University of Southern California
url: 'https://www.usc.edu/'

# Short bio (displayed in user profile at end of posts)
bio: Research Intern

interests:
- Machine Learning
- Natural Language Processing
- Machine Translation
- Code-Switching
- Transliteration Systems for Indian languages
- Large Language Models

education:
courses:
- course: MS in Computer Science
institution: University of Southern California
year: 2023-2025
- course: BTech in Computer Science
institution: National Institute of Technology, Rourkela
year: 2019-2023

# Social/Academic Networking
# For available icons, see: https://docs.hugoblox.com/getting-started/page-builder/#icons
# For an email link, use "fas" icon pack, "envelope" icon, and a link in the
# form "mailto:[email protected]" or "#contact" for contact widget.
social:
- icon: envelope
icon_pack: fas
link: 'mailto:[email protected]'
- icon: github
icon_pack: fab
link: https://github.com/Divyajyoti02
- icon: linkedin
icon_pack: fab
link: https://www.linkedin.com/in/divyajyoti-panda/
# Link to a PDF of your resume/CV from the About widget.
# To enable, copy your resume/CV to `static/files/cv.pdf` and uncomment the lines below.
# - icon: cv
# icon_pack: ai
# link: files/cv.pdf

# Enter email to display Gravatar (if Gravatar enabled in Config)
email: '[email protected]'

# Highlight the author in author lists? (true/false)
highlight_name: false

# Organizational groups that you belong to (for People widget)
# Set this to `[]` or comment out if you are not using People widget.
user_groups:
- Visitors / Interns
# - Researchers
---

Divyajyoti Panda is an intern in WING Research Group and working from May 2024. He is a master's student of Computer Science in USC and has interests in machine translation and transliteration, and large language models.

Binary file added content/authors/divyajyoti/avatar.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified content/publication/.DS_Store
Binary file not shown.
8 changes: 8 additions & 0 deletions content/publication/das-et-al-2023/cite.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
@article{das2023statistical,
title={Statistical machine translation for indic languages},
author={Das, Sudhansu Bala and Panda, Divyajyoti and Mishra, Tapas Kumar and Patra, Bidyut Kr},
journal={Natural Language Processing},
pages={1--18},
year={2023},
publisher={Cambridge University Press}
}
17 changes: 17 additions & 0 deletions content/publication/das-et-al-2023/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
title: 'Statistical machine translation for indic languages'
authors:
- Sudhansu Bala Das
- divyajyoti
- Tapas Kumar Mishra
- Bidyut Kumar Patra
date: '2024-06-03'
publishDate: '2024-08-02T16:08:01.479Z'
publication_types:
- article
publication: '*Natural Language Processing*'
abstract: Statistical Machine Translation (SMT) systems use various probabilistic and statistical Natural Language Processing (NLP) methods to automatically translate from one language to another language while retaining the originality of the context. This paper aims to discuss the development of bilingual SMT models for translating English into fifteen low-resource Indic languages (ILs) and vice versa. The process to build the SMT model is described and explained using a workflow diagram. Samanantar and OPUS corpus are utilized for training, and Flores200 corpus is used for fine-tuning and testing purposes. The paper also highlights various preprocessing methods used to deal with corpus noise. The Moses open-source SMT toolkit is being investigated for the system’s development. The impact of distance-based reordering and Morpho-syntactic Descriptor Bidirectional Finite-State Encoder (msd-bidirectional-fe) reordering on ILs is compared in the paper. This paper provides a comparison of SMT models with Neural Machine Translation (NMT) for ILs. All the experiments assess the translation quality using standard metrics such as BiLingual Evaluation Understudy, Rank-based Intuitive Bilingual Evaluation Score, Translation Edit Rate, and Metric for Evaluation of Translation with Explicit Ordering. From the result, it is observed that msd-bidirectional-fe reordering performs better than the distance-based reordering model for ILs. It is also noticed that even though the IL-English and English-IL systems are trained using the same corpus, the former performs better for all the evaluation metrics. The comparison between SMT and NMT shows that across various languages, SMT performs better in some cases, while NMT outperforms in others.
links:
- name: URL
url: https://www.cambridge.org/core/journals/natural-language-processing/article/statistical-machine-translation-for-indic-languages/022C193C28525D1C88A731C35DF1C388
---
10 changes: 10 additions & 0 deletions content/publication/das-et-al-2024/cite.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
@article{bala2024multilingual,
title={Multilingual Neural Machine Translation for Indic to Indic Languages},
author={Bala Das, Sudhansu and Panda, Divyajyoti and Kumar Mishra, Tapas and Kr. Patra, Bidyut and Ekbal, Asif},
journal={ACM Transactions on Asian and Low-Resource Language Information Processing},
volume={23},
number={5},
pages={1--32},
year={2024},
publisher={ACM New York, NY}
}
18 changes: 18 additions & 0 deletions content/publication/das-et-al-2024/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
title: 'Multilingual Neural Machine Translation for Indic to Indic Languages'
authors:
- Sudhansu Bala Das
- divyajyoti
- Tapas Kumar Mishra
- Bidyut Kumar Patra
- Asif Ekbal
date: '2024-05-10'
publishDate: '2024-08-02T16:12:33.521Z'
publication_types:
- article
publication: '*ACM Transactions on Asian and Low-Resource Language Information Processing*'
abstract: The method of translation from one language to another without human intervention is known as Machine Translation (MT). Multilingual neural machine translation (MNMT) is a technique for MT that builds a single model for multiple languages. It is preferred over other approaches, since it decreases training time and improves translation in low-resource contexts, i.e., for languages that have insufficient corpus. However, good-quality MT models are yet to be built for many scenarios such as for Indic-to-Indic Languages (IL-IL). Hence, this article is an attempt to address and develop the baseline models for low-resource languages i.e., IL-IL (for 11 Indic Languages (ILs)) in a multilingual environment. The models are built on the Samanantar corpus and analyzed on the Flores-200 corpus. All the models are evaluated using standard evaluation metrics i.e., Bilingual Evaluation Understudy (BLEU) score (with the range of 0 to 100). This article examines the effect of the grouping of related languages, namely, East Indo-Aryan (EI), Dravidian (DR), and West Indo-Aryan (WI) on the MNMT model. From the experiments, the results reveal that related language grouping is beneficial for the WI group only while it is detrimental for the EI group and it shows an inconclusive effect on the DR group. The role of pivot-based MNMT models in enhancing translation quality is also investigated in this article. Owing to the presence of large good-quality corpora from English (EN) to ILs, MNMT IL-IL models using EN as a pivot are built and examined. To achieve this, English-Indic Language (EN-IL) models are developed with and without the usage of related languages. Results show that the use of related language grouping is advantageous specifically for EN to ILs. Thus, related language groups are used for the development of pivot MNMT models. It is also observed that the usage of pivot models greatly improves MNMT baselines. Furthermore, the effect of transliteration on ILs is also analyzed in this article. To explore transliteration, the best MNMT models from the previous approaches (in most of cases pivot model using related groups) are determined and built on corpus transliterated from the corresponding scripts to a modified Indian language Transliteration script (ITRANS). The outcome of the experiments indicates that transliteration helps the models built for lexically rich languages, with the best increment of BLEU scores observed in Malayalam (ML) and Tamil (TA), i.e., 6.74 and 4.72, respectively. The BLEU score using transliteration models ranges from 7.03 to 24.29. The best model obtained is the Punjabi (PA)-Hindi (HI) language pair trained on PA-WI transliterated corpus.
links:
- name: URL
url: https://dl.acm.org/doi/full/10.1145/3652026
---

0 comments on commit 50425b7

Please sign in to comment.