- Vimeo
- 淡江教會
- 1,242 段音頻,共 47.5 GB
- 檔案名稱以影片識別碼為首,以等號分隔標題,遇到路徑符號皆改為全形。
File names begin with video ID followed by the title after an equal sign, with path characters converted to full-width.
- 淡江教會
- YouTube
注意:以下筆記僅供參考,並非實際記錄完整步驟。
Note: incomplete documentation without full procedure for reference only.
git clone --depth 1 https://github.com/TimmyWong-TW/Maize-KB.git
cd Maize-KB/tools
docker compose build
- 大量下載音頻及字幕。
Bulk download audios and captions.docker compose run --rm -d -w /data/source/vimeo/淡江教會 -e USER_ID=2178983 vimeo-downloader
……docker compose run --rm -d -e PLAYLIST_ID=PLe-YK1dmFUsLnnUV54cwFLCc8nXP2TUvz youtube-dl docker compose run --rm -d -e PLAYLIST_ID=PLe-YK1dmFUsL0j_THqhwbZFpj0BuXdpJj youtube-dl # …… docker compose run --rm -d -e PLAYLIST_ID=PLV4YZBS1Bq3dQXmB8dx0Ex6w6dMfs0SW4 youtube-dl docker compose run --rm -d -e PLAYLIST_ID=PLV4YZBS1Bq3dmq9ApN3-b2uTICgKJCnVj youtube-dl docker compose run --rm -d -e PLAYLIST_ID=PLV4YZBS1Bq3dMGy_e4c-H0K0M9n9uCl3e youtube-dl docker compose run --rm -d -e PLAYLIST_ID=PLV4YZBS1Bq3ep6HtFzrNjmFPupiZbnL_P youtube-dl docker compose run --rm -d -e PLAYLIST_ID=PLV4YZBS1Bq3eJ7LOUUEh_0dAXnKePUWd2 youtube-dl
- 若來源缺乏字幕,辨識漢語以生成中文文本並對齊音頻時間。
When source lacks captions, transcribe Mandarin to Chinese and align with audio.
如預算許可則考慮使用更佳模型。
Consider Gemini 1.5 Pro with audio input over Whisper Large v3 when budget allows.docker compose run --rm -d transcriber docker compose run --rm whisperx-json-trimmer docker compose run --rm resegmenter
- 將文本轉換成臺灣中文,然後自行校對。
Convert transcripts into Chinese (Taiwan), and then proofread manually.docker compose run --rm chinese-converter
- 使用字幕編輯工具校對文本以便大量更正。
Proofread transcripts with a subtitle editor to identify misrecognitions for batch correction. - 需要時,以語音停頓時間分段,然後統一標點符號,繼而重新對齊音頻時間用以供字幕使用。
Optionally, arrange into paragraphs by pauses in speech, and then unify punctuations, before re-alignment of clauses for captioning. - 分門別類,標註講員。
Classify, and diarize speakers.