Skip to content

v1.3.0

Compare
Choose a tag to compare
@rotemdan rotemdan released this 02 May 11:19
· 247 commits to main since this release

Enhancements

  • Accept language codes in multiple formats. Currently supports ISO 639-1 (example: es, es-MX), ISO 639-2 (example: spa), and full English language names (spanish)
  • Whisper: when no matching language was found, include the exact provided language identifier to reduce confusion about language support

Fixes

  • Recognition / alignment / translation: timing for words that overlap with non-speech regions is now truncated based on the voice-activity region where the overlap is greatest. If the processing is done over audio that has been cropped using VAD, it can cause an upcoming word to appear too early, or extend too much, before/after a non-speech region, causing the timing to be inaccurate near the region boundaries. This tries to fix that, by, during the uncropping of the timeline, ensuring that words can only span a single active voice region (selected according to maximum overlap), preventing time ranges to be over-extended
  • Fix whisper.cpp speech-to-text translation not including word offsets

Behavioral changes

  • Remove pico and flite being used as default synthesis engines in some languages (pico is never actually selected and flite uses WASI which appears to have segmentation fault issues in Node versions 20, 21, and 22

Full Changelog: v1.2.1...v1.3.0