From 57b6b202788b66d256123261e0222ff635b595f6 Mon Sep 17 00:00:00 2001 From: Rotem Dan Date: Thu, 2 May 2024 14:02:55 +0300 Subject: [PATCH] Update docs --- docs/Tasklist.md | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/docs/Tasklist.md b/docs/Tasklist.md index 2822112..ab2573b 100644 --- a/docs/Tasklist.md +++ b/docs/Tasklist.md @@ -2,7 +2,7 @@ ## Bugs -### Alignment +### Alignment / DTW-RA * In DTW-RA, a recognized transcript including something like "Question 2.What does Juan", where "2.What" has a point in the middle, is breaking playback of the timeline * DTW-RA will not work correctly with Polish language texts, due to issues with the eSpeak engine pronouncing `|` characters, which are intended to be used as separators and ignored by all other eSpeak languages @@ -83,7 +83,6 @@ * `speak-youtube-subtitles`: To speak the subtitles of a YouTube video ### API -* Accept full language names as language identifiers * Validate timelines to ensure timestamps are always increasing: no negative timestamps or timestamps over the duration of the audio. No sentences without words, etc. and correct if needed * See whether it's possible to detect and include / remove Emoji characters in timelines * Add support for phrases in timelines @@ -120,7 +119,7 @@ * Decide whether asterisk `*` should be spoken when using `speak-url` or `speak-wikipedia` * Add partial SSML support for all engines. In particular, allow changing language or voice using the `` and `` tags, `` and `` where possible * Try to remove reliance on `()` after `.` character hack in `EspeakTTS.synthesizeFragments` -* eSpeak IPA output puts stress marks on vowels, not syllables - which is the standard for IPA. Consider how to make a conversion to and from these two approaches (possibly detect it automatically) +* eSpeak IPA output applies stress marks on vowels, not syllables - which is usually the standard for IPA. Consider how to make a conversion to and from these two approaches (possibly detect it automatically), to provide users with more useful phonemizations * Decide if `msspeech` engine should be selected if available. This would require attempting to load a matching voice, and falling back if it is not installed * Speaker-specific voice option * Use VAD on the synthesized audio file to get more accurate sentence or word segmentation @@ -149,7 +148,6 @@ * Currently, when input is set to be SSML, it is wrapped in a `` tag. Handle the case where the user made their own SSML document wrapped with a `` tag as well. Currently, it may send invalid input to Azure ### Recognition -* Recognized word entries that span VAD segment boundaries can be split * Show alternatives when playing in the CLI. Clear current line and rewrite already printed text for alternatives during the speech recognition process ### Recognition / Whisper @@ -161,8 +159,6 @@ ### Alignment -* Aligned words entries that span VAD boundaries may be split - ### Alignment / DTW-RA ### Alignment / Whisper @@ -256,6 +252,15 @@ * Allow `dtw` mode work with more speech synthesizers to produce its reference * Predict timing for individual letters (graphemes) based on phoneme timestamps (especially useful for Chinese and Japanese) +### Translation +* Add text-to-text translation with cloud translation APIs like Google Translate or DeepL, or offline models like OpenNMT or NLLB-200 + +### Translation alignment +* Perform word-level alignment of text-to-text translations from and to any language (not just English) using methods like multilingual embeddings, or specialized models, and then use the text-based alignment to align speech in any recognized language, to its translated transcript in any language supported by the text-to-text alignment approach + +### Speech-to-text translation + +* Hybrid approach: recognize speech in its language using any recognition model, then translate the resulting transcript using a text-to-text translation engine, and then align the translated transcript to the original one using text-to-text alignment, and map back to the original speech using the recognition timestamps, to get word-level alignment for the translated transcript ## Possible new engines or platforms