Skip to content

Commit

Permalink
Use attacut module for Thai word tokenization
Browse files Browse the repository at this point in the history
  • Loading branch information
flyingleafe committed Dec 7, 2023
1 parent b869488 commit 10bd5d6
Showing 1 changed file with 5 additions and 9 deletions.
14 changes: 5 additions & 9 deletions lhotse/workflows/forced_alignment/mms_aligner.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,19 +143,15 @@ def _word_tokenize(text: str, language: Optional[str] = None) -> List[str]:
return kss.split_morphemes(text, return_pos=False)

elif language == "th":
# `pythainlp` is alive and much better, but it is a huge package bloated with dependencies
if not is_module_available("tltk"):
if not is_module_available("attacut"):

Check warning on line 146 in lhotse/workflows/forced_alignment/mms_aligner.py

View check run for this annotation

Codecov / codecov/patch

lhotse/workflows/forced_alignment/mms_aligner.py#L146

Added line #L146 was not covered by tests
raise ImportError(
"MMSForcedAligner requires the 'tltk' module to be installed to align Thai text."
"Please install it with 'pip install tltk'."
"MMSForcedAligner requires the 'attacut' module to be installed to align Thai text."
"Please install it with 'pip install attacut'."
)

from tltk import nlp
import attacut

Check warning on line 152 in lhotse/workflows/forced_alignment/mms_aligner.py

View check run for this annotation

Codecov / codecov/patch

lhotse/workflows/forced_alignment/mms_aligner.py#L152

Added line #L152 was not covered by tests

pieces = nlp.pos_tag(text)
return [
word if word != "<s/>" else " " for piece in pieces for word, _ in piece
]
return attacut.tokenize(text)

Check warning on line 154 in lhotse/workflows/forced_alignment/mms_aligner.py

View check run for this annotation

Codecov / codecov/patch

lhotse/workflows/forced_alignment/mms_aligner.py#L154

Added line #L154 was not covered by tests

elif language == "my":
if not is_module_available("pyidaungsu"):
Expand Down

0 comments on commit 10bd5d6

Please sign in to comment.