Releases: echogarden-project/echogarden
Releases · echogarden-project/echogarden
v1.8.2
Features
whisper
: new optiontimestampAccuracy
with possible valuesmedium
orhigh
.medium
uses a reduced subset of attention heads for alignment, which makes it fast to compute.high
uses all attention heads for alignment, and is thus more accurate at the word level, but slower for larger models. Defaults tomedium
whisper.cpp
: new optionstemperature
,temperatureIncrement
,enableFlashAttention
. Using flash attention can significantly improve performance in some cases. Note: enabling flash attention will automatically disable theenableDTW
option since the two don't seem to work together
Fixes
whisper.cpp
: derive correct model name forlarge-v3-turbo
whisper
andwhisper.cpp
: error when model is set tolarge-v3-turbo
and a translation task is requested (large-v3-turbo
doesn't support translation tasks)
Full Changelog: v1.8.1...v1.8.2
v1.8.1
Fixes
whisper
alignment: Ensure the resulting timeline always includes all words, even if not all transcript tokens got decoded.
Full Changelog: v1.8.0...v1.8.1
v1.8.0
Enhancements
- Use PFFFT library (WASM port) with SIMD support, instead of KissFFT, for FFT operations
- MDX-NET source separation: further improvements in speed, mostly due to faster FFT operations. Using DirectML (Windows) or CUDA (Linux) GPU acceleration, now gets to up to 30x real-time on an NVIDIA RTX 2060 and 13th Gen Core i3 for the default model, and about 17x real-time for the 3 higher quality models.
whisper.cpp
: use updated packageswhisper
: by default, use optimized alignment heads for all model sizes (increases recognition speed by reducing the alignment time for each part). Can be enabled or disabled using a new optionuseOptimizedAlignmentHeads
- Source separation: ensure output audio never clips
- Optimizations in various audio processing operations. Less copying in memory
Fixes
whisper
alignment engine: fix issue where the model would decode too many tokens for a single part, eventually leading to a crash due to an ONNX runtime error. The maximum decoded tokens per part is now configurable usingmaxTokensPerPart
, and defaults to 250- MDX-NET source separation: reduce default logging verbosity. Can be made more verbose by setting
logLevel
totrace
Documentation
- Officially document
cuda
ONNX provider support for all engines that depend on ONNX models. Supported on Linux only, and can often be faster than DirectML, even when used within Windows WSL (Ubuntu). Requires manual installation of CUDA Toolkit 12.x and cuDNN 9.x
Full Changelog: v1.7.0...v1.8.0
v1.7.0
New features
- MDX-NET now includes 3 new, higher quality models:
UVR_MDXNET_Main
,Kim_Vocal_1
andKim_Vocal_2
. These models produce cleaner sound with less artifacts, and are about 3x slower on CPU than existing ones, but still fast on GPU - Google Translate text-to-text translation: Add 2 customization options:
tld
(set top-level domain likecom
forgoogle.com
orco.uk
forgoogle.co.uk
) andmaxCharactersPerPart
(maximum number of characters in each text part sent to the the server)
Enhancements
- MDX-NET source separation implementation has been partially rewritten, with substantially better performance, reduced memory usage, and GPU support. With an NVIDIA RTX 2060 GPU (over DirectML) and 13th Gen Core i3, it now achieves 20x real-time processing speed, which is closer in performance to Python implementations like the ones in Ultimate Vocal Remover and Python Audio Separator
Behavioral changes
- MDX-NET will now use the
dml
ONNX execution provider (DirectML-based GPU acceleration) on Windows by default, if available
Fixes
- Text-to-text translation: fix several issues
- Google Translate text-to-text translation: Improve and fix several issues. Ensure translated output preserves the line break structure of the original
Full Changelog: v1.6.2...v1.7.0
v1.6.2
Fixes
dtw-ra
: split fragments to chunks based on total character count, rather than the fragment count (currently set to a maximum of 1000 characters in a chunk).
Full Changelog: v1.6.1...v1.6.2
v1.6.1
Enhancements
- Log ONNX provider used in Whisper session
Fixes
- Preserve paragraphs in Google Translate output text
Full Changelog: v1.6.0...v1.6.1
v1.6.0
New features
- Initial support for text-to-text translation (Google Translate engine)
openai-cloud
STT engine: Support for custom OpenAI API compatible speech-to-text providers, like Groq- Support for the new
large-v3-turbo
Whisper model in both integrated Whisper engine andwhisper.cpp
engine. - Add 6 new VITS voices
Enhancements
- Whisper (integrated engine): hash seed before using it (ensures seeds like 0, 1, 2, 3, 4 would produce more distinct results)
whisper.cpp
: use updated builds
Behavioral changes
- Whisper (integrated engine): on Windows x64, will possibly use GPU accelerated decoding (
decoderProvider=dml
) for larger models (small*
,medium*
andlarge*
) alignTimelineTranslation
/e5
engine: reduce default DTW window's token count to 20,000 tokens
Removed features
whisper.cpp
: removed internal package support forcublas-1.8.0
, due to build issues with latest VS2022, and very long build times.- Removed optional dependency on unused package
speaker
due to security vulnerabilities, and its native module requirements. whisper
: removed option forlarge
model keyword (large-v3-turbo
is currently the only one supported).
Fixes
- When deriving sentence / segment timeline from word timeline, ensure sentences never break within words by temporarily masking potential sentence ending characters in the body of the word. Attempts to resolve issues #67 and #58
dtw-ra
: when producing an alignment reference for a set of fragments, process the fragments in chunks, rather than all at once (currently uses a maximum of 1000 fragments for each chunk). Should resolve issue #64whisper.cpp
: add workaround for rarewhisper.cpp
issue with missing time offsets by falling back to last known end offset when they are not included. Should resolve issue #65- Don't error when DTW length is less than 2 (fixes rare issue with Whisper's internal alignment)
- Fix logging in timeline translation alignment
Full Changelog: v1.5.0...v1.6.0
v1.5.0
New features
- Speech-to-transcript-and-translation alignment aligns a translated transcript to the spoken audio with the assistance of the transcript in the original language. Supports 100 source and target languages. It does it uses a two-stage approach: first, conventional alignment is performed between the spoken audio and its native-language transcript. Then, the resulting timeline is aligned to the translated text using cross-language semantic text-to-text alignment
- Timeline-to-translation alignment accepts a timeline and translated transcript, and performs the second stage independently. This can allow to reuse a previously aligned transcript with multiple translations, or be applied to the timeline output after speech synthesis or recognition
Enhancements
- Add support for passing
cuda
as ONNX provider. Latestonnxruntime-node
now supports it, but only on Linux (for Windows, usedml
- DirectML)
Behavioral changes
- Passing a subtitle file to synthesis operations now ignores the cues and splits to sentences based on punctuation alone
- API operations for speech-to-translation now include separate properties for source and target languages
Fixes
- Timeline uncropping now correctly handles the edge case where a timestamp is higher than the audio duration (this can occur due to rounding or numerical stability)
- Mel spectrogram conversion now handles the case where a filterbank is wider than the maximum frequency
Full Changelog: v1.4.4...v1.5.0
v1.4.4
Enhancements
- DTW speech alignment: use optimized Euclidian distance computation function with a fully unrolled loop when vector size is exactly 13 (typical MFCC vector size)
Fixes
- eSpeak: Prevent using vertical bar separators (
|
) in the exact set of voices that (incorrectly) pronounce them:roa/an
(Aragonese),art/eo
(Esperanto),trk/ky
(Kirghiz),zlw/pl
(Polish),zle/uk
(Ukranian) - Add missing entry for Latin (
la
) in language code parser
Full Changelog: v1.4.3...v1.4.4
v1.4.3
Fixes
- eSpeak: Bring back the
|
workaround, but only when the language isn't Polish
Full Changelog: v1.4.2...v1.4.3 (note: release v1.4.2
was unintentionally not committed to GitHub, so this includes its changes as well)