Skip to content

Commit

Permalink
Update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
rotemdan committed Oct 4, 2024
1 parent 674819d commit f0fc3d9
Show file tree
Hide file tree
Showing 4 changed files with 53 additions and 27 deletions.
24 changes: 24 additions & 0 deletions docs/API.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,30 @@ Translates speech audio directly to a transcript in a different language (only E
}
```

## Text-to-text translation

### `translateText(input, options)`

Translates text to text.

* `input`: string
* `options`: text translation options object

#### Returns (via promise):
```ts
{
text: string
translatedText: string

translationPairs: TranslationPair[]

sourceLanguage: string
targetLanguage: string
}
```

`translationPairs` is an array of objects corresponding to individual segments of the text and their translations.

## Speech-to-translated-transcript alignment

### `alignTranslation(input, translatedTranscript, options)`
Expand Down
11 changes: 8 additions & 3 deletions docs/Engines.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

**Offline**:

* [VITS](https://github.com/jaywalnut310/vits) (`vits`): end-to-end neural speech synthesis architecture. Available models were trained by Michael Hansen as part of his [Piper speech synthesis system](https://github.com/rhasspy/piper). Currently, there are 117 voices, in a range of languages, including English (US, UK), Spanish (ES, MX), Portuguese (PT, BR), Italian, French, German, Dutch (NL, BE), Swedish, Norwegian, Danish, Finnish, Polish, Greek, Romanian, Serbian, Czech, Hungarian, Slovak, Slovenian, Turkish, Arabic, Farsi, Russian, Ukrainian, Catalan, Luxembourgish, Icelandic, Swahili, Kazakh, Georgian, Nepali, Vietnamese and Chinese. You can listen to audio samples of all voices and languages in [Piper's samples page](https://rhasspy.github.io/piper-samples/)
* [VITS](https://github.com/jaywalnut310/vits) (`vits`): end-to-end neural speech synthesis architecture. Available models were trained by Michael Hansen as part of his [Piper speech synthesis system](https://github.com/rhasspy/piper). Currently, there are 123 voices, in a range of languages, including English (US, UK), Spanish (ES, MX), Portuguese (PT, BR), Italian, French, German, Dutch (NL, BE), Swedish, Norwegian, Danish, Finnish, Polish, Greek, Romanian, Serbian, Czech, Hungarian, Slovak, Slovenian, Turkish, Arabic, Farsi, Russian, Ukrainian, Catalan, Luxembourgish, Icelandic, Swahili, Kazakh, Georgian, Nepali, Vietnamese and Chinese. You can listen to audio samples of all voices and languages in [Piper's samples page](https://rhasspy.github.io/piper-samples/)
* [SVOX Pico](https://github.com/naggety/picotts) (`pico`): a legacy diphone-based synthesis engine. Supports English (US, UK), Spanish, Italian, French, and German
* [Flite](https://github.com/festvox/flite) (`flite`): a legacy diphone-based synthesis engine. Supports English (US, Scottish), and several Indic languages: Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada and Punjabi
* [eSpeak-NG](https://github.com/espeak-ng/espeak-ng/) (`espeak`): a lightweight "robot" sounding formant-based synthesizer. Supports 100+ languages. Used internally for speech alignment, phonemization, and other internal tasks
Expand Down Expand Up @@ -58,7 +58,7 @@ These are commercial services that require a subscription and an API key to use:
* [Google Cloud](https://cloud.google.com/speech-to-text) (`google-cloud`)
* [Azure Cognitive Services](https://azure.microsoft.com/en-us/products/ai-services/speech-to-text/) (`microsoft-azure`)
* [Amazon Transcribe](https://aws.amazon.com/transcribe/) (`amazon-transcribe`)
* [OpenAI Cloud Platform](https://platform.openai.com/) (`openai-cloud`): runs the `large-v2` Whisper model on the cloud
* [OpenAI Cloud Platform](https://platform.openai.com/) (`openai-cloud`): runs the `large-v2` Whisper model on the cloud. Allows setting a custom provider, like [Groq](https://console.groq.com/docs/api-reference#audio), to request from any OpenAI-compatible provider

## Speech-to-transcript alignment

Expand All @@ -68,7 +68,6 @@ These engines' goal is to match (or "align") a given spoken recording with a giv
* Dynamic Time Warping with Recognition Assist (`dtw-ra`): recognition is applied to the audio (any recognition engine can be used), then both the ground-truth transcript and the recognized transcript are synthesized using eSpeak. Then, the best mapping is found between the two synthesized waveforms, using the DTW algorithm, and the result is remapped back to the original audio using the timing information produced by the recognizer
* Whisper-based alignment (`whisper`): transcript is first tokenized, then, its tokens are decoded, in order, with a guided approach, using the Whisper model. The resulting token timestamps are then used to derive the timing for each word


## Speech-to-text translation

**Offline**:
Expand All @@ -78,6 +77,12 @@ These engines' goal is to match (or "align") a given spoken recording with a giv
**Cloud services**:
* [OpenAI Cloud Platform](https://platform.openai.com/) (`openai-cloud`): runs the `large-v2` Whisper model on the cloud. Only supports English as target

## Text-to-text translation

**Cloud services (unofficial)**:

* Google Translate (`google-translate`): uses the [Google Translate mobile web UI](https://translate.google.com/m) to translate text from and to any one of its supported languages.

## Speech-to-translated-transcript alignment

The goal here is to match (or "align") a given spoken recording in one language, with a given translated transcript in a different language, as closely as possible.
Expand Down
31 changes: 20 additions & 11 deletions docs/Options.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,32 +141,32 @@ Applies to CLI operation: `transcribe`, API method: `recognize`
* `sourceSeparation`: prefix to provide options for source separation when `isolate` is set to `true`. Options detailed in section for source separation

**Whisper**:
* `whisper.model`: selects which Whisper model to use. Can be `tiny`, `tiny.en`, `base`, `base.en`, `small`, `small.en`, `medium`, `medium.en`, `large` (same as `large-v2`), `large-v1`, `large-v2`, `large-v3` (**Note**: large models aren't yet supported by `onnxruntime-node` due to their size). Defaults to `tiny` or `tiny.en`
* `whisper.model`: selects which Whisper model to use. Can be `tiny`, `tiny.en`, `base`, `base.en`, `small`, `small.en`, `medium`, `medium.en` or `large-v3-turbo`. Defaults to `tiny` or `tiny.en`
* `whisper.temperature`: temperature setting for the text decoder. Impacts the amount of randomization for token selection. It is recommended to leave at `0.1` (close to no randomization - almost always chooses the top ranked token) or choose a relatively low value (`0.25` or lower) for best results. Defaults to `0.1`
* `whisper.prompt`: initial text to give the Whisper model. Can be a vocabulary, or example text of some sort. Note that if the prompt is very similar to the transcript, the model may intentionally avoid producing the transcript tokens as it may assume that they have already been transcribed. Optional
* `whisper.topCandidateCount`: the number of top candidate tokens to consider. Defaults to `5`
* `whisper.punctuationThreshold`: the minimal probability for a punctuation token, included in the top candidates, to be chosen unconditionally. A lower threshold encourages the model to output more punctuation symbols. Defaults to `0.2`
* `whisper.autoPromptParts`: use previous part's recognized text as the prompt for the next part. Disabling this may help to prevent repetition carrying over between parts, in some cases. Defaults to `true`
* `whisper.punctuationThreshold`: the minimal probability for a punctuation token, included in the top candidates, to be chosen unconditionally. A lower threshold encourages the model to output more punctuation characters. Defaults to `0.2`
* `whisper.autoPromptParts`: use previous part's recognized text as the prompt for the next part. Disabling this may help to prevent repetition carrying over between parts, in some cases. Defaults to `true` (**Note**: currently always disabled for `large-v3-turbo` model due to an apparent issue with corrupt output when prompted)
* `whisper.maxTokensPerPart`: maximum number of tokens to decode for each audio part. Defaults to `250`
* `whisper.suppressRepetition`: attempt to suppress decoding of repeating token patterns. Defaults to `true`
* `whisper.repetitionThreshold`: minimal repetition / compressibility score to cause a part not to be auto-prompted to the next part. Defaults to `2.4`
* `whisper.decodeTimestampTokens`: enable/disable decoding of timestamp tokens. Setting to `false` can reduce the occurrence of hallucinations and token repetition loops, possibly due to the overall reduction in the number of tokens decoded. This has no impact on the accuracy of timestamps, since they are derived independently using cross-attention weights. However, there are cases where this can cause the model to end a part prematurely, especially in singing and less speech-like voice segments, or when there are multiple speakers. Defaults to `true`
* `whisper.encoderProvider`: identifier for the ONNX execution provider to use with the encoder model. Can be `cpu` or `dml` ([DirectML](https://microsoft.github.io/DirectML/)-based GPU acceleration - Windows only). In general, GPU-based encoding should be significantly faster. Defaults to `cpu`, or `dml` if available
* `whisper.decoderProvider`: identifier for the ONNX execution provider to use with the decoder model. Can be `cpu` or `dml` (Windows only). Using GPU acceleration for the decoder may be faster than CPU, especially for larger models, but that depends on your particular combination of CPU and GPU. Defaults to `cpu`
* `whisper.encoderProvider`: identifier for the ONNX execution provider to use with the encoder model. Can be `cpu`, `dml` ([DirectML](https://microsoft.github.io/DirectML/)-based GPU acceleration - Windows only) or `cuda` (Linux only). In general, GPU-based encoding should be significantly faster. Defaults to `cpu`, or `dml` if available
* `whisper.decoderProvider`: identifier for the ONNX execution provider to use with the decoder model. Can be `cpu`, `dml` (Windows only) or `cuda` (Linux only). Using GPU acceleration for the decoder may be faster than CPU, especially for larger models, but that depends on your particular combination of CPU and GPU. Defaults to `cpu`, and on Windows, `dml` if available for larger models (`small`, `medium`, `large`)
* `whisper.seed`: provide a custom random seed for token selection when temperature is greater than 0. Uses a constant seed by default to ensure reproducibility

**Whisper.cpp**:
* `whisperCpp.model`: selects which `whisper.cpp` model to use. Can be `tiny`, `tiny.en`, `base`, `base.en`, `small`, `small.en`, `medium`, `medium.en`, `large` (same as `large-v2`), `large-v1`, `large-v2`, `large-v3`. These quantized models are also supported: `tiny-q5_1`, `tiny.en-q5_1`, `tiny.en-q8_0`,`base-q5_1`, `base.en-q5_1`, `small-q5_1`, `small.en-q5_1`, `medium-q5_0`, `medium.en-q5_0`, `large-v2-q5_0`, `large-v3-q5_0`. Defaults to `base` or `base.en`
* `whisperCpp.executablePath`: custom `whisper.cpp` executable path (currently required for macOS)
* `whisperCpp.build`: type of `whisper.cpp` build to use. Can be set `cpu`, `cublas-11.8.0`, `cublas-12.4.0`. By default, builds are auto-selected and downloaded for Windows x64 (`cpu`, `cublas-11.8.0`, `cublas-12.4.0`) and Linux x64 (`cpu`). Using other builds requires providing a custom `executablePath`
* `whisperCpp.model`: selects which `whisper.cpp` model to use. Can be `tiny`, `tiny.en`, `base`, `base.en`, `small`, `small.en`, `medium`, `medium.en`, `large` (same as `large-v2`), `large-v1`, `large-v2`, `large-v3`. These quantized models are also supported: `tiny-q5_1`, `tiny.en-q5_1`, `tiny.en-q8_0`,`base-q5_1`, `base.en-q5_1`, `small-q5_1`, `small.en-q5_1`, `medium-q5_0`, `medium.en-q5_0`, `large-v2-q5_0`, `large-v3-q5_0`, `large-v3-turbo`, `large-v3-turbo-q5_0`. Defaults to `base` or `base.en`
* `whisperCpp.executablePath`: a path to a custom `whisper.cpp` `main` executable (currently required for macOS)
* `whisperCpp.build`: type of `whisper.cpp` build to use. Can be set to `cpu` or `cublas-12.4.0`. By default, builds are auto-selected and downloaded for Windows x64 (`cpu`, `cublas-12.4.0`) and Linux x64 (`cpu`). Using other builds requires providing a custom `executablePath`
* `whisperCpp.threadCount`: number of threads to use, defaults to `4`
* `whisperCpp.splitCount`: number of splits of the audio data to process in parallel (called `--processors` in the `whisper.cpp` CLI). A value greater than `1` can increase memory use significantly, reduce timing accuracy, and slow down execution in some cases. Defaults to `1` (highly recommended)
* `whisperCpp.enableGPU`: enable GPU processing. Setting to `true` will try to use a CUDA build, if available for your system. Defaults to `true` when a CUDA-enabled build is selected via `whisperCpp.build`, otherwise `false`
* `whisperCpp.topCandidateCount`: the number of top candidate tokens to consider. Defaults to `5`
* `whisperCpp.beamCount`: the number of decoding paths to use during beam search. Defaults to `5`
* `whisperCpp.repetitionThreshold`: minimal repetition / compressibility score to cause a decoded segment to be discarded. Defaults to `2.4`
* `whisperCpp.prompt`: initial text to give the Whisper model. Can be a vocabulary, or example text of some sort. Note that if the prompt is very similar to the transcript, the model may intentionally avoid producing the transcript tokens as it may assume that they have already been transcribed. Optional
* `whisperCpp.enableDTW`: enable experimental `whisper.cpp` internal DTW-based token alignment to be used to derive timestamps. Defaults to `false` (recommended for now)
* `whisperCpp.enableDTW`: enable `whisper.cpp`'s own experimental DTW-based token alignment to be used to derive timestamps. Defaults to `false` (highly recommended)
* `whisperCpp.verbose`: show all CLI messages during execution. Defaults to `false`

**Vosk**:
Expand Down Expand Up @@ -194,9 +194,9 @@ Applies to CLI operation: `transcribe`, API method: `recognize`

**OpenAI Cloud**:
* `openAICloud.apiKey`: API key (required)
* `openAICloud.model`: model to use. Can only be `whisper-1`
* `openAICloud.model`: model to use. When using the default provider (OpenAI), can only be `whisper-1`. For a custom provider, like Groq, see its documentation
* `openAICloud.organization`: organization identifier. Optional
* `openAICloud.baseURL`: override the default base URL used by the API. Optional
* `openAICloud.baseURL`: override the default base URL used by the API. For example, set `https://api.groq.com/openai/v1` to use Groq's OpenAI compatible Whisper API instead. Optional
* `openAICloud.temperature`: temperature. Choosing `0` uses a dynamic temperature approach. Defaults to `0`
* `openAICloud.prompt`: initial prompt for the model. Optional
* `openAICloud.timeout`: request timeout. Optional
Expand Down Expand Up @@ -265,6 +265,15 @@ Applies to CLI operation: `translate-speech`, API method: `translateSpeech`

* `openAICloud`: prefix to provide options for OpenAI cloud. Same options as detailed in the recognition section above

## Text-to-text translation

Applies to CLI operation: `translate-text`, API method: `translateText`

* `engine`: only `google-translate` supported
* `sourceLanguage`: the source language code for the input text. Auto-detected if not set
* `targetLanguage`: the target language code for the output text. Required
* `languageDetection`: language detection options. Optional

## Speech-to-translated-transcript alignment

Applies to CLI operation: `align-translation`, API method: `alignTranslation`
Expand Down
14 changes: 1 addition & 13 deletions docs/Tasklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,6 @@

## Bugs

### Alignment / DTW-RA

* In DTW-RA, a recognized transcript including something like "Question 2.What does Juan", where "2.What" has a point in the middle, is breaking playback of the timeline

### Synthesis

### eSpeak

Expand All @@ -24,6 +19,7 @@
* `espeak-ng`: 'Oh dear!”' is read as "oh dear exclamation mark", because of the special quote character following the exclamation mark
* `espeak-ng`: [Marker right after sentence end is not reported as an event](https://github.com/espeak-ng/espeak-ng/issues/920)
* `espeak-ng`: On Japanese text, it says "Chinese character" or "Japanese character" for characters it doesn't know
* `espeak-ng`: Broken markers on the Korean voice
* `wtf_wikipedia` Sometimes fails on `getResult.js` without throwing a humanly readable error
* `wtf_wikipedia` Sometimes captures markup like `.svg` etc.
* `msspeech`: Initialization fails on Chinese and Japanese voices (but not Korean)
Expand Down Expand Up @@ -255,15 +251,7 @@
* Allow `dtw` mode work with more speech synthesizers to produce its reference
* Predict timing for individual letters (graphemes) based on phoneme timestamps (especially useful for Chinese and Japanese)

### Translation
* Add text-to-text translation with cloud translation APIs like Google Translate or DeepL, or offline models like OpenNMT or NLLB-200

### Translation alignment
* Perform word-level alignment of text-to-text translations from and to any language (not just English) using methods like multilingual embeddings, or specialized models, and then use the text-based alignment to align speech in any recognized language, to its translated transcript in any language supported by the text-to-text alignment approach

### Speech-to-text translation

* Hybrid approach: recognize speech in its native language using any recognition model, then translate the resulting transcript using a text-to-text translation engine, and then align the translated transcript to the original one using text-to-text alignment, and map back to the original speech using the recognition timestamps, to get word-level alignment for the translated transcript

## Possible new engines or platforms

Expand Down

0 comments on commit f0fc3d9

Please sign in to comment.