Skip to content

Commit

Permalink
Update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
rotemdan committed May 2, 2024
1 parent 27a027e commit 57b6b20
Showing 1 changed file with 11 additions and 6 deletions.
17 changes: 11 additions & 6 deletions docs/Tasklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Bugs

### Alignment
### Alignment / DTW-RA

* In DTW-RA, a recognized transcript including something like "Question 2.What does Juan", where "2.What" has a point in the middle, is breaking playback of the timeline
* DTW-RA will not work correctly with Polish language texts, due to issues with the eSpeak engine pronouncing `|` characters, which are intended to be used as separators and ignored by all other eSpeak languages
Expand Down Expand Up @@ -83,7 +83,6 @@
* `speak-youtube-subtitles`: To speak the subtitles of a YouTube video

### API
* Accept full language names as language identifiers
* Validate timelines to ensure timestamps are always increasing: no negative timestamps or timestamps over the duration of the audio. No sentences without words, etc. and correct if needed
* See whether it's possible to detect and include / remove Emoji characters in timelines
* Add support for phrases in timelines
Expand Down Expand Up @@ -120,7 +119,7 @@
* Decide whether asterisk `*` should be spoken when using `speak-url` or `speak-wikipedia`
* Add partial SSML support for all engines. In particular, allow changing language or voice using the `<voice>` and `<lang>` tags, `<say-as>` and `<phoneme>` where possible
* Try to remove reliance on `()` after `.` character hack in `EspeakTTS.synthesizeFragments`
* eSpeak IPA output puts stress marks on vowels, not syllables - which is the standard for IPA. Consider how to make a conversion to and from these two approaches (possibly detect it automatically)
* eSpeak IPA output applies stress marks on vowels, not syllables - which is usually the standard for IPA. Consider how to make a conversion to and from these two approaches (possibly detect it automatically), to provide users with more useful phonemizations
* Decide if `msspeech` engine should be selected if available. This would require attempting to load a matching voice, and falling back if it is not installed
* Speaker-specific voice option
* Use VAD on the synthesized audio file to get more accurate sentence or word segmentation
Expand Down Expand Up @@ -149,7 +148,6 @@
* Currently, when input is set to be SSML, it is wrapped in a `<speak>` tag. Handle the case where the user made their own SSML document wrapped with a `<speak>` tag as well. Currently, it may send invalid input to Azure

### Recognition
* Recognized word entries that span VAD segment boundaries can be split
* Show alternatives when playing in the CLI. Clear current line and rewrite already printed text for alternatives during the speech recognition process

### Recognition / Whisper
Expand All @@ -161,8 +159,6 @@

### Alignment

* Aligned words entries that span VAD boundaries may be split

### Alignment / DTW-RA

### Alignment / Whisper
Expand Down Expand Up @@ -256,6 +252,15 @@
* Allow `dtw` mode work with more speech synthesizers to produce its reference
* Predict timing for individual letters (graphemes) based on phoneme timestamps (especially useful for Chinese and Japanese)

### Translation
* Add text-to-text translation with cloud translation APIs like Google Translate or DeepL, or offline models like OpenNMT or NLLB-200

### Translation alignment
* Perform word-level alignment of text-to-text translations from and to any language (not just English) using methods like multilingual embeddings, or specialized models, and then use the text-based alignment to align speech in any recognized language, to its translated transcript in any language supported by the text-to-text alignment approach

### Speech-to-text translation

* Hybrid approach: recognize speech in its language using any recognition model, then translate the resulting transcript using a text-to-text translation engine, and then align the translated transcript to the original one using text-to-text alignment, and map back to the original speech using the recognition timestamps, to get word-level alignment for the translated transcript

## Possible new engines or platforms

Expand Down

0 comments on commit 57b6b20

Please sign in to comment.