Skip to content

Commit

Permalink
More normalization rules and tests
Browse files Browse the repository at this point in the history
  • Loading branch information
kavyamanohar committed Mar 23, 2024
1 parent 63da685 commit d3d0d54
Show file tree
Hide file tree
Showing 3 changed files with 24 additions and 7 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Malayalam language only.
- Changes combination chillus to atomic chillu characters
- Normalization of vowel signs
- Corrects some common typos in Malayalam (needs thorough review)
- Alternate spelling normalizations

## Installation

Expand Down
21 changes: 15 additions & 6 deletions libindic/normalizer/rules/normalizer_ml.rules
Original file line number Diff line number Diff line change
@@ -1,12 +1,6 @@
#This is comment
$remove_punctuation=true
$filter_lang=ml_IN
# Common Type Corrections
ൻറ=ന്റ
ന്‍പ=മ്പ
ററ=റ്റ
റ്‍=ർ

# Chillu normalization to atomic chillus
ണ്‍=ൺ
ന്‍=ൻ
Expand All @@ -27,3 +21,18 @@ $filter_lang=ml_IN
ഇൗ=ഈ
ഉൗ=ഊ
ഒൗ=ഔ

# Common Typo Corrections
ൻറ=ന്റ
ന്‍പ=മ്പ
ററ=റ്റ
റ്‍=ർ
ദു:ഖ=ദുഃഖ
നമ:=നമഃ

# Alternate written forms
ൎയ്യ=ര്യ #ഭാൎയ്യ, സൂൎയ്യൻ
അധ്യാപ=അദ്ധ്യാപ
ൎ=ർ
ൽപ=ല്പ

9 changes: 8 additions & 1 deletion libindic/normalizer/tests/test_normalizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,17 @@ def test_normalize(self):

# Remove punctuations
self.assertEqual(normalize('1-ാം'), '1ാം')
self.assertEqual(normalize('കാൎത്തുമ്പി'), 'കാൎത്തുമ്പി')
self.assertEqual(normalize('1-ാം', keep_punctuations=True), '1-ാം')

# Common Typos
self.assertEqual(normalize('പൂമ്പാററ'), 'പൂമ്പാറ്റ')
self.assertEqual(normalize('ദു:ഖത്തിന്റെ'), 'ദുഃഖത്തിന്റെ')
self.assertEqual(normalize('ദു:ഖത്തിന്റെ', keep_punctuations=True),
'ദുഃഖത്തിന്റെ')

# Alternate Spellings
self.assertEqual(normalize('കാൎത്തുമ്പി'), 'കാർത്തുമ്പി')
self.assertEqual(normalize('ഭാൎയ്യ'), 'ഭാര്യ')

def test_multiline_string(self):
expected = """കുഞ്ചൻ നമ്പ്യാർ
Expand Down

0 comments on commit d3d0d54

Please sign in to comment.