🌽小玉米知識庫 Maize Knowledge Base

資料來源 Data Source

Vimeo
- 淡江教會
  - 1,242 段音頻，共 47.5 GB
  - 檔案名稱以影片識別碼為首，以等號分隔標題，遇到路徑符號皆改為全形。
    File names begin with video ID followed by the title after an equal sign, with path characters converted to full-width.
YouTube

擷取方法 Retrieval Methodology

注意：以下筆記僅供參考，並非實際記錄完整步驟。
Note: incomplete documentation without full procedure for reference only.

git clone --depth 1 https://github.com/TimmyWong-TW/Maize-KB.git
cd Maize-KB/tools
docker compose build

大量下載音頻及字幕。
Bulk download audios and captions.

docker compose run --rm -d -w /data/source/vimeo/淡江教會 -e USER_ID=2178983 vimeo-downloader

docker compose run --rm -d -e PLAYLIST_ID=PLe-YK1dmFUsLnnUV54cwFLCc8nXP2TUvz youtube-dl
docker compose run --rm -d -e PLAYLIST_ID=PLe-YK1dmFUsL0j_THqhwbZFpj0BuXdpJj youtube-dl
# ……
docker compose run --rm -d -e PLAYLIST_ID=PLV4YZBS1Bq3dQXmB8dx0Ex6w6dMfs0SW4 youtube-dl
docker compose run --rm -d -e PLAYLIST_ID=PLV4YZBS1Bq3dmq9ApN3-b2uTICgKJCnVj youtube-dl
docker compose run --rm -d -e PLAYLIST_ID=PLV4YZBS1Bq3dMGy_e4c-H0K0M9n9uCl3e youtube-dl
docker compose run --rm -d -e PLAYLIST_ID=PLV4YZBS1Bq3ep6HtFzrNjmFPupiZbnL_P youtube-dl
docker compose run --rm -d -e PLAYLIST_ID=PLV4YZBS1Bq3eJ7LOUUEh_0dAXnKePUWd2 youtube-dl

……

若來源缺乏字幕，辨識漢語以生成中文文本並對齊音頻時間。
When source lacks captions, transcribe Mandarin to Chinese and align with audio.
如預算許可則考慮使用更佳模型。
Consider Gemini 1.5 Pro with audio input over Whisper Large v3 when budget allows.
```
docker compose run --rm -d transcriber
docker compose run --rm whisperx-json-trimmer
docker compose run --rm resegmenter
```
將文本轉換成臺灣中文，然後自行校對。
Convert transcripts into Chinese (Taiwan), and then proofread manually.
```
docker compose run --rm chinese-converter
```
使用字幕編輯工具校對文本以便大量更正。
Proofread transcripts with a subtitle editor to identify misrecognitions for batch correction.
需要時，以語音停頓時間分段，然後統一標點符號，繼而重新對齊音頻時間用以供字幕使用。
Optionally, arrange into paragraphs by pauses in speech, and then unify punctuations, before re-alignment of clauses for captioning.
分門別類，標註講員。
Classify, and diarize speakers.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
gemini-1.5-pro/vimeo/淡江教會		gemini-1.5-pro/vimeo/淡江教會
misrecognitions		misrecognitions
overview		overview
tools		tools
tsv		tsv
whisperx		whisperx
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌽小玉米知識庫 Maize Knowledge Base

資料來源 Data Source

擷取方法 Retrieval Methodology

About

Releases

Packages

Languages

TimmyWong-TW/Maize-KB

Folders and files

Latest commit

History

Repository files navigation

🌽小玉米知識庫 Maize Knowledge Base

資料來源 Data Source

擷取方法 Retrieval Methodology

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages