STREAMLInED

Shared Tasks for Rapid, Efficient Analysis of Many Languages in Emerging Documentation.

Research @ University of Washington.

Contact Emily Ahn with questions: eahn [at] uw [dot] edu

Prerequisites

python 3
dscore
java JDK 1.7 or 1.8

Data

Our data originates from the Endangered Languages Archive (ELAR).

Selected languages for this task span a wide range of language families and typological groups.

Sakun
Cicipu
Effutu

Instructions to download from ELAR:

Create an online account profile (free here)
Login and downlaod your cookies as a txt file (browser extensions can handle this well). Note: If you are downloading across different days (or different sessions), you may need to re-download your cookies.
Run our script to curl (download) the data: scripts/download_elar.py

Provided files per language include:

elar_{lang}_links.tsv (to be used by script when downloading files from ELAR)
.uem file (Un-partitioned Evaluation Map--determines the regions to be analyzed in each recording; see description)
ref/ (rttm files)

Track 1: Speaker Diarization

Who spoke when, and where else did they speak again? This task takes raw audio as input and attempts to detect speech and cluster groups of speech from the same speaker together under one label.

1.1 Baseline System

We use the lightweight system from LIUM that uses ILP clustering techniques. Download the code from that repository and follow their installation instructions. If you are using JDK 1.8, replace their jar file in the LIUM/ folder with the jar found in this repository (then rename it or change its call from their ilp_diarization2.sh script: baseline/diar/lium-diarization-200129.jar (compiled on Jan 29, 2020). Instructions to compile this JDK 1.8 compatible version on your own machine are here.

We provide a script to convert LIUM output into rttm format: scripts/lium_to_rttm.py

1.2 Evaluation

Assuming you have your system output as .rttm files in the folder data/{lang}/sys/, run dscore on this folder with the data/{lang}/ref/ folder, and output to data/{lang}_dev.stdout.

dscore/score.py -r data/{lang}/ref/*.rttm -s data/{lang}/sys/*.rttm > data/{lang}/{lang}_dev.stdout 2> ccp/{lang}_dev.stderr -u data/{lang}/{lang}.uem

Expected Results

Language	DER
Cicipu	44.54
Effutu	34.65
Sakun	62.55

TODO: update numbers for only DEV set

1.3 Similar Work

The DIHARD Challenge I (2018, site) and Challenge II (2019, site, paper) have focused on robust speaker diarization. Their second challenge baseline involes the Kaldi toolkit.

Next Tracks TBD...

Possibly: Speaker Identification, Genre Identification

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
baseline/diar		baseline/diar
scripts		scripts
.gitignore		.gitignore
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STREAMLInED

Prerequisites

Data

Instructions to download from ELAR:

Provided files per language include:

Track 1: Speaker Diarization

1.1 Baseline System

1.2 Evaluation

Expected Results

1.3 Similar Work

Next Tracks TBD...

About

Releases

Packages

Languages

siyuliang/streamlined

Folders and files

Latest commit

History

Repository files navigation

STREAMLInED

Prerequisites

Data

Instructions to download from ELAR:

Provided files per language include:

Track 1: Speaker Diarization

1.1 Baseline System

1.2 Evaluation

Expected Results

1.3 Similar Work

Next Tracks TBD...

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages