- IPA -> Kirshenbaum translation is still not completely similar to what is output by eSpeak. Also, in rare situations, it outputs characters that are not accepted by eSpeak and eSpeak errors. Investigate when that happens and how to improve on this
- Phrase splitting may split on number separators like the
,
in100,000
. The new segmentation library would resolve that
- Investigate why WebSpeech events sometimes completely stop working in the middle of an utterance for no apparent reason. Sometimes this is permanent, until the extension is restarted. Is this a browser issue?
- If a request is made and the server takes too much time to respond, the service worker may sleep and the request never canceled
- Highlighting sometimes does not appear when mouse is pressed over handle while speech of element starts
espeak-ng
: Marker right after sentence end is not reported as an eventespeak-ng
: On Japanese text, it says "Chinese letter" or "Japanese letter" for characters it doesn't supportespeak-ng
: Broken markers on the Korean voicewtf_wikipedia
Sometimes fails ongetResult.js
without throwing a humanly readable errorwtf_wikipedia
Sometimes captures markup like.svg
etc.msspeech
: Initialization fails on Chinese and Japanese voices (but not Korean)compromise
: Slow initialization time. Currently, it takes more than a second- Browser extension: Chromium doesn't fire timer events when cursor is positioned over scrollbar or body margins
- Show names of files written to disk. This is useful for cases where a file is auto-renamed to prevent overwriting existing data
- Restrict input media file extensions to ensure that invalid files are not passed to FFmpeg
- Show a message when a new version is available
- Figure out which terminal outputs should go to stdout, or if that's a good idea at all
- Print available synthesis voices when no voice matches (or suggest near matches)
transcribe
: accepthttp://
andhttps://
URLs and pull the remote media file- Make
enum
options case-insensitive if possible - More fine-grained intermediate progress report for operations
- Suggest possible correction on the error of not using
=
, e.g.speed 0.9
instead ofspeed=0.9
- Multiple configuration files in
--config=..
taking precedence by order - Generate JSON configuration file schema
- Use a file type detector like
file-type
that uses magic numbers to detect the type of a binary file regardless of its extension. This would help to give better error messages when the given file type is wrong - Mode to print IPA words when speaking
- Option to set audio output device for playback
- Option to set playback volume
- Maybe find a way not to pre-normalize if the audio is silent (to prevent a 30dB increase of possible noise)
- Add phone playback support
- Add support for sentence templates, like
echogarden speak-file text.txt /parts/[sentence].wav
- Correctly detect language when a Wikipedia URL is passed instead of an article name
- Add option to set language edition separately from language, since Wikipedia language editions has its own code system that is slightly different from the standard one, in some cases
- Use the Wikipedia reader when the URL is detected to be from
wikipedia.org
- When given a configuration file, see if you can fall back to take options from
speak
options, for example, to take API keys that are required for both the synthesis request and voice list request and
- Support filters
play-with-subtitles
: Preview subtitles in terminalplay-with-timeline
: Preview timeline in terminalsubtitles-to-text
,subtitles-to-timeline
,srt-to-vtt
,vtt-to-srt
text-to-ipa
,arpabet-to-ipa
,ipa-to-arpabet
phonemize
normalize-text
transcribe-youtube
: Transcribe the audio in a YouTube video (requires fetching the audio somehow - which can't be done using the normal YouTube API)speak-youtube-subtitles
: To speak the subtitles of a YouTube video
- Validate timelines to ensure timestamps are always increasing: no negative timestamps or timestamps over the duration of the audio. No sentences without words, etc. and correct if needed
- See whether it's possible to detect and include / remove Emoji characters in timelines
- Add support for phrases in timelines
- Accept voice list caching options in
SynthesisOptions
- Better error message when a package is not found remotely. Currently, it just gives a
404 not found
without any other information - Retry on network failure
- Deploy and add the new n-gram based text language detection model
- Split long words if needed
- Clauses shouldn't be split in the middle of numbers, like the
,
in123,456
- Decide how many punctuation characters to allow before breaking to a new line (currently it's infinite)
- If a subtitle is too short and at the end of the audio, try to extend it back if possible (for example, if the previous subtitle is already extended, take back from it)
- Add more clause separators, for even more special cases
- Add option to output usable word or phoneme-level caption files (investigate how it's done on YouTube auto-captions)
- Parse VTT's language
- Option to disable alignment (only for some engines). Alternative: use a low granularity DTW setting that is very fast to compute
- Find places to add commas (",") to improve speech fluency. VITS voices don't normally add speech breaks if there is no punctuation
- An isolated dash " - " can be converted to a " , " to ensure there's a break in the speech
- Ensure abbreviations like "Ph.d" or similar names are segmented and read correctly (does
cldr
treat it as a word? Maybe eSpeak doesn't recognize it as a word). "C#" and ".NET" as well - Find a way to manually reset voice list cache
- When synthesized text isn't pre-split to sentences, apply sentence splits by using the existing method to convert the output of word timelines to sentence/segment timelines
- Some
sapi
voices andmsspeech
languages output phones that are converted to Microsoft alphabet, not IPA symbols. Try to see if these can be translated to IPA - Decide whether asterisk
*
should be spoken when usingspeak-url
orspeak-wikipedia
- Add partial SSML support for all engines. In particular, allow changing language or voice using the
<voice>
and<lang>
tags,<say-as>
and<phoneme>
where possible - Try to remove reliance on
()
after.
character hack inEspeakTTS.synthesizeFragments
- eSpeak IPA output applies stress marks on vowels, not syllables - which is usually the standard for IPA. Consider how to make a conversion to and from these two approaches (possibly detect it automatically), to provide users with more useful phonemizations
- Decide if
msspeech
engine should be selected if available. This would require attempting to load a matching voice, and falling back if it is not installed - Speaker-specific voice option
- Use VAD on the synthesized audio file to get more accurate sentence or word segmentation
- When
splitToSentences
is set tofalse
, the timeline doesn't include proper sentences. Find a way to pass larger sections to the TTS, but still have proper sentences in the timeline
- Extend the heteronyms JSON document with additional words like "conducts", "survey", "protest", "transport", "abuse", "combat", "combats", "affect", "contest", "detail", "marked", "contrast", "construct", "constructs", "console", "recall", "permit", "permits", "prospect", "prospects", "proceed", "proceeds", "invite", "reject", "deserts", "transcript", "transcripts", "compact", "impact", "impacts"
- Full date normalization (e.g.
21 August 2023
,21 Aug 2023
,August 21, 2023
) - Add support for capitalized-only rules, and possibly also all uppercase / all lowercase rules
- Add support for multiple words in
precededBy
andsucceededBy
- Support substituting to graphemes in lexicons, not only phonemes
- Cache lexicons to avoid parsing the JSON each time it is loaded (this may not be needed for if the file is relatively small)
- Is it possible to pre-phonemize common words like "the" or is it a bad idea / not necessary?
- Add support for text preprocessing for all engines that can benefit from it (possibly including cloud engines)
- Add SAPI pronunciation to lexicons (you already have the pronunciations for
en_US
anden_GB
) - Try to use entity recognition to detect years, dates, currencies etc., which would disambiguate cases where it is not clear, like "in 1993" in "She was born in 1993" and "It searched in 1993 websites"
- Option to add POS tags to timeline, if available
- Allow limiting how many models are cached in memory
- Custom model paths (decide how to implement)
- Pull voice list from JSON file, or based on URL? Is that a good idea?
- Add speaker names to voice list somehow
- Currently, when input is set to be SSML, it is wrapped in a
<speak>
tag. Handle the case where the user made their own SSML document wrapped with a<speak>
tag as well. Currently, it may send invalid input to Azure
- Show alternatives when playing in the CLI. Clear current line and rewrite already printed text for alternatives during the speech recognition process
- Whisper's Chinese and Japanese output can be split into words in a more accurate way. Consider using a dedicated segmentation library to perform the segmentation in character sequences that have no punctuation characters to aid on guessing word boundaries
- Cache last model (if enough memory is available)
- The segment output can be used to split into segments, otherwise it is possible to try to guess using pause lengths or voice activity detection
- Bring back the option to use eSpeak DTW based alignment on segments, as an alternative approach
- For the
granularity
option, add more granularities likexxx-low
andxxxx-low
(should the naming be changed? Maybe transition to a new naming scheme?) - Add and test official support for more than 6 hours of audio
- Option to customize overlap
- Add more models
- Option to allow or disallow local file paths as arguments to API methods (as a security safeguard)
- Add cancellation checks in more operations
- Support more operations
- Options UI
- Add supported engines and voices to WebSpeech voice list
- Pause and resume support
- Autoscroll should work even if the scrollbar relevant to the target element is not the viewport's scrollbar
- Find a way to show handles even for elements that start with a link
- Add detection for line breaks in
pre
blocks - Support the custom tags used in YouTube comments
- Show handles based on
<br>
tags and possibly line breaks internal to the element - Show handles based on sentence start positions
- UI or gesture to stop speech (other than the
esc
key) - Hide handles when mouse leaves the viewport
- Don't show handles when mouse is over a large container element
- Button or keyboard shortcut to show and hide handles
- Show blinking placeholder when synthesis is loading for a particular text node
- Navigate paragraphs or sentences with keyboard shortcuts
- Minimum size when iterating text nodes to get handle
- Find a way to reset voice list cache on update
- CLI code has a lot of repetition. See how it can be refactored
- See if the installation of
winax
can be automated and only initiate if it is in a Windows environment - Ensure that all modules have no internal state other than caching
- Start thinking about some modules being available in the browser. Which node core APIs the use? Which of them can be polyfilled, and which cannot?
- Remove built-in voices from
flite
to reduce size? - Slim down
kuromoji
package to reduce base installation size
- Test that SSML works where it should
- Test that alignment works correctly when the input is SSML
- Test synthesis, recognition and alignment with empty input. Do they still work?
- Test everything's fine on macOS
- Test that cloud services all still work correctly, especially with SSML inputs
- Auto-generate options file, with comments, based on default options of the API
- Have the CLI launch a background worker (in a thread) to enable better parallelism
- Playback result audio while synthesis or recognition is still processing in the background
- Auto-import and extract project Gutenberg texts (by URL or from a file)
stdin
input supportstdout
output support- Markdown file as text input?
- Web based frontend UI to the server
- Adapt some WASM modules to also run on the web
- Investigate running in WebContainer
- Auto-install npm modules when needed using an approach similar to like
npm-programmatic
- Add capitalization and punctuation to recognized outputs if needed (Silero has a model for it for
en
,de
,ru
,es
, but in.pt
format only)
- Synthesize the given subtitle file and try to preserve the existing timing of cues, or even align to existing speech
- Low latency, streaming recognition mode. Make the partial transcription available as fast as possible
- Live input / microphone recognition
- Implement beam search for Whisper decoder
- Implement beam search for Silero decoder
- Live Vosk alternatives events
- Investigate exporting Whisper models to 16-bit quantized ONNX or a mix of 16-bit and 32-bit
- Method to align audio file to audio file
- Allow
dtw
mode work with more speech synthesizers to produce its reference - Predict timing for individual letters (graphemes) based on phoneme timestamps (especially useful for Chinese and Japanese)
- PlayHT speech synthesis cloud service
- Deepgram cloud text-to-speech API
- Assembly AI cloud speech recognition API
- Picovoice Orca local text-to-speech
- Coqui STT server connection
- MarbleNet VAD, included of the NVIDIA NeMo framework, can be exported to ONNX
- Silero text enhancement engine can be ported to ONNX
- Figure out how to support
julius
speech recognition via WASM - Any way to support RHVoice?
- Using a machine translation model to provide speech translation to languages other than English
- Bring back interleaved playback
- Bring back debugging file output
- Speaker diarization
- Support alignment of EPUB 3 eBooks with a corresponding audiobook
- Voice cloning
- Speech-to-speech voice conversion
- Speech-to-speech translation
- HTML generator, that includes text and audio, with playback and word highlighting
- Video generator
- Desktop app that uses the tool to transcribe the PC audio output
- Special method to use time stretching to project between different aligned utterances of the same text
- Is it possible to combine the Silero speech recognizer and a language model and try to perform Viterbi decoding to find alignments?