Releases · echogarden-project/echogarden

15 Oct 21:51

rotemdan

v1.8.2

07f682a

v1.8.2

Features

whisper: new option timestampAccuracy with possible values medium or high. medium uses a reduced subset of attention heads for alignment, which makes it fast to compute. high uses all attention heads for alignment, and is thus more accurate at the word level, but slower for larger models. Defaults to medium
whisper.cpp: new options temperature, temperatureIncrement, enableFlashAttention. Using flash attention can significantly improve performance in some cases. Note: enabling flash attention will automatically disable the enableDTW option since the two don't seem to work together

Fixes

whisper.cpp: derive correct model name for large-v3-turbo
whisper and whisper.cpp: error when model is set to large-v3-turbo and a translation task is requested (large-v3-turbo doesn't support translation tasks)

Full Changelog: v1.8.1...v1.8.2

Assets 2

10 Oct 22:37

rotemdan

v1.8.1

53a3829

v1.8.1

Fixes

whisper alignment: Ensure the resulting timeline always includes all words, even if not all transcript tokens got decoded.

Full Changelog: v1.8.0...v1.8.1

Assets 2

10 Oct 21:23

rotemdan

v1.8.0

a02cae3

v1.8.0

Enhancements

Use PFFFT library (WASM port) with SIMD support, instead of KissFFT, for FFT operations
MDX-NET source separation: further improvements in speed, mostly due to faster FFT operations. Using DirectML (Windows) or CUDA (Linux) GPU acceleration, now gets to up to 30x real-time on an NVIDIA RTX 2060 and 13th Gen Core i3 for the default model, and about 17x real-time for the 3 higher quality models.
whisper.cpp: use updated packages
whisper: by default, use optimized alignment heads for all model sizes (increases recognition speed by reducing the alignment time for each part). Can be enabled or disabled using a new option useOptimizedAlignmentHeads
Source separation: ensure output audio never clips
Optimizations in various audio processing operations. Less copying in memory

Fixes

whisper alignment engine: fix issue where the model would decode too many tokens for a single part, eventually leading to a crash due to an ONNX runtime error. The maximum decoded tokens per part is now configurable using maxTokensPerPart, and defaults to 250
MDX-NET source separation: reduce default logging verbosity. Can be made more verbose by setting logLevel to trace

Documentation

Officially document cuda ONNX provider support for all engines that depend on ONNX models. Supported on Linux only, and can often be faster than DirectML, even when used within Windows WSL (Ubuntu). Requires manual installation of CUDA Toolkit 12.x and cuDNN 9.x

Full Changelog: v1.7.0...v1.8.0

Assets 2

09 Oct 10:26

rotemdan

v1.7.0

37f922f

v1.7.0

New features

MDX-NET now includes 3 new, higher quality models: UVR_MDXNET_Main, Kim_Vocal_1 and Kim_Vocal_2. These models produce cleaner sound with less artifacts, and are about 3x slower on CPU than existing ones, but still fast on GPU
Google Translate text-to-text translation: Add 2 customization options: tld (set top-level domain like com for google.com or co.uk for google.co.uk) and maxCharactersPerPart (maximum number of characters in each text part sent to the the server)

Enhancements

MDX-NET source separation implementation has been partially rewritten, with substantially better performance, reduced memory usage, and GPU support. With an NVIDIA RTX 2060 GPU (over DirectML) and 13th Gen Core i3, it now achieves 20x real-time processing speed, which is closer in performance to Python implementations like the ones in Ultimate Vocal Remover and Python Audio Separator

Behavioral changes

MDX-NET will now use the dml ONNX execution provider (DirectML-based GPU acceleration) on Windows by default, if available

Fixes

Text-to-text translation: fix several issues
Google Translate text-to-text translation: Improve and fix several issues. Ensure translated output preserves the line break structure of the original

Full Changelog: v1.6.2...v1.7.0

Assets 2

06 Oct 19:09

rotemdan

v1.6.2

c93ef6b

v1.6.2

Fixes

dtw-ra: split fragments to chunks based on total character count, rather than the fragment count (currently set to a maximum of 1000 characters in a chunk).

Full Changelog: v1.6.1...v1.6.2

Assets 2

04 Oct 16:56

rotemdan

v1.6.1

04acc56

v1.6.1

Enhancements

Log ONNX provider used in Whisper session

Fixes

Preserve paragraphs in Google Translate output text

Full Changelog: v1.6.0...v1.6.1

Assets 2

04 Oct 06:00

rotemdan

v1.6.0

f0fc3d9

v1.6.0

New features

Initial support for text-to-text translation (Google Translate engine)
openai-cloud STT engine: Support for custom OpenAI API compatible speech-to-text providers, like Groq
Support for the new large-v3-turbo Whisper model in both integrated Whisper engine and whisper.cpp engine.
Add 6 new VITS voices

Enhancements

Whisper (integrated engine): hash seed before using it (ensures seeds like 0, 1, 2, 3, 4 would produce more distinct results)
whisper.cpp: use updated builds

Behavioral changes

Whisper (integrated engine): on Windows x64, will possibly use GPU accelerated decoding (decoderProvider=dml) for larger models (small*, medium* and large*)
alignTimelineTranslation / e5 engine: reduce default DTW window's token count to 20,000 tokens

Removed features

whisper.cpp: removed internal package support for cublas-1.8.0, due to build issues with latest VS2022, and very long build times.
Removed optional dependency on unused package speaker due to security vulnerabilities, and its native module requirements.
whisper: removed option for large model keyword (large-v3-turbo is currently the only one supported).

Fixes

When deriving sentence / segment timeline from word timeline, ensure sentences never break within words by temporarily masking potential sentence ending characters in the body of the word. Attempts to resolve issues #67 and #58
dtw-ra: when producing an alignment reference for a set of fragments, process the fragments in chunks, rather than all at once (currently uses a maximum of 1000 fragments for each chunk). Should resolve issue #64
whisper.cpp: add workaround for rare whisper.cpp issue with missing time offsets by falling back to last known end offset when they are not included. Should resolve issue #65
Don't error when DTW length is less than 2 (fixes rare issue with Whisper's internal alignment)
Fix logging in timeline translation alignment

Full Changelog: v1.5.0...v1.6.0

Assets 2

26 May 15:04

rotemdan

v1.5.0

48baa2f

v1.5.0

New features

Speech-to-transcript-and-translation alignment aligns a translated transcript to the spoken audio with the assistance of the transcript in the original language. Supports 100 source and target languages. It does it uses a two-stage approach: first, conventional alignment is performed between the spoken audio and its native-language transcript. Then, the resulting timeline is aligned to the translated text using cross-language semantic text-to-text alignment
Timeline-to-translation alignment accepts a timeline and translated transcript, and performs the second stage independently. This can allow to reuse a previously aligned transcript with multiple translations, or be applied to the timeline output after speech synthesis or recognition

Enhancements

Add support for passing cuda as ONNX provider. Latest onnxruntime-node now supports it, but only on Linux (for Windows, use dml - DirectML)

Behavioral changes

Passing a subtitle file to synthesis operations now ignores the cues and splits to sentences based on punctuation alone
API operations for speech-to-translation now include separate properties for source and target languages

Fixes

Timeline uncropping now correctly handles the edge case where a timestamp is higher than the audio duration (this can occur due to rounding or numerical stability)
Mel spectrogram conversion now handles the case where a filterbank is wider than the maximum frequency

Full Changelog: v1.4.4...v1.5.0

Assets 2

15 May 06:54

rotemdan

v1.4.4

6915554

v1.4.4

Enhancements

DTW speech alignment: use optimized Euclidian distance computation function with a fully unrolled loop when vector size is exactly 13 (typical MFCC vector size)

Fixes

eSpeak: Prevent using vertical bar separators (|) in the exact set of voices that (incorrectly) pronounce them: roa/an (Aragonese), art/eo (Esperanto), trk/ky (Kirghiz), zlw/pl (Polish), zle/uk (Ukranian)
Add missing entry for Latin (la) in language code parser

Full Changelog: v1.4.3...v1.4.4

Assets 2

12 May 04:04

rotemdan

v1.4.3

1001ab6

v1.4.3

Fixes

eSpeak: Bring back the | workaround, but only when the language isn't Polish

Full Changelog: v1.4.2...v1.4.3 (note: release v1.4.2 was unintentionally not committed to GitHub, so this includes its changes as well)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features

Fixes

Fixes

Enhancements

Fixes

Documentation

New features

Enhancements

Behavioral changes

Fixes

Fixes

Enhancements

Fixes

New features

Enhancements

Behavioral changes

Removed features

Fixes

New features

Enhancements

Behavioral changes

Fixes

Enhancements

Fixes

Fixes

Releases: echogarden-project/echogarden

v1.8.2

Features

Fixes

v1.8.1

Fixes

v1.8.0

Enhancements

Fixes

Documentation

v1.7.0

New features

Enhancements

Behavioral changes

Fixes

v1.6.2

Fixes

v1.6.1

Enhancements

Fixes

v1.6.0

New features

Enhancements

Behavioral changes

Removed features

Fixes

v1.5.0

New features

Enhancements

Behavioral changes

Fixes

v1.4.4

Enhancements

Fixes

v1.4.3

Fixes