v1.5.0
New features
- Speech-to-transcript-and-translation alignment aligns a translated transcript to the spoken audio with the assistance of the transcript in the original language. Supports 100 source and target languages. It does it uses a two-stage approach: first, conventional alignment is performed between the spoken audio and its native-language transcript. Then, the resulting timeline is aligned to the translated text using cross-language semantic text-to-text alignment
- Timeline-to-translation alignment accepts a timeline and translated transcript, and performs the second stage independently. This can allow to reuse a previously aligned transcript with multiple translations, or be applied to the timeline output after speech synthesis or recognition
Enhancements
- Add support for passing
cuda
as ONNX provider. Latestonnxruntime-node
now supports it, but only on Linux (for Windows, usedml
- DirectML)
Behavioral changes
- Passing a subtitle file to synthesis operations now ignores the cues and splits to sentences based on punctuation alone
- API operations for speech-to-translation now include separate properties for source and target languages
Fixes
- Timeline uncropping now correctly handles the edge case where a timestamp is higher than the audio duration (this can occur due to rounding or numerical stability)
- Mel spectrogram conversion now handles the case where a filterbank is wider than the maximum frequency
Full Changelog: v1.4.4...v1.5.0