Skip to content

Shared Tasks for Rapid, Efficient Analysis of Many Languages in Emerging Documentation.

Notifications You must be signed in to change notification settings

siyuliang/streamlined

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STREAMLInED

Shared Tasks for Rapid, Efficient Analysis of Many Languages in Emerging Documentation.

Research @ University of Washington.

Contact Emily Ahn with questions: eahn [at] uw [dot] edu

Prerequisites

  • python 3
  • dscore
  • java JDK 1.7 or 1.8

Data

Our data originates from the Endangered Languages Archive (ELAR).

Selected languages for this task span a wide range of language families and typological groups.

  • Sakun
  • Cicipu
  • Effutu

Instructions to download from ELAR:

  1. Create an online account profile (free here)
  2. Login and downlaod your cookies as a txt file (browser extensions can handle this well). Note: If you are downloading across different days (or different sessions), you may need to re-download your cookies.
  3. Run our script to curl (download) the data: scripts/download_elar.py

Provided files per language include:

  • elar_{lang}_links.tsv (to be used by script when downloading files from ELAR)
  • .uem file (Un-partitioned Evaluation Map--determines the regions to be analyzed in each recording; see description)
  • ref/ (rttm files)

Track 1: Speaker Diarization

Who spoke when, and where else did they speak again? This task takes raw audio as input and attempts to detect speech and cluster groups of speech from the same speaker together under one label.

1.1 Baseline System

We use the lightweight system from LIUM that uses ILP clustering techniques. Download the code from that repository and follow their installation instructions. If you are using JDK 1.8, replace their jar file in the LIUM/ folder with the jar found in this repository (then rename it or change its call from their ilp_diarization2.sh script: baseline/diar/lium-diarization-200129.jar (compiled on Jan 29, 2020). Instructions to compile this JDK 1.8 compatible version on your own machine are here.

We provide a script to convert LIUM output into rttm format: scripts/lium_to_rttm.py

1.2 Evaluation

Assuming you have your system output as .rttm files in the folder data/{lang}/sys/, run dscore on this folder with the data/{lang}/ref/ folder, and output to data/{lang}_dev.stdout.

dscore/score.py -r data/{lang}/ref/*.rttm -s data/{lang}/sys/*.rttm > data/{lang}/{lang}_dev.stdout 2> ccp/{lang}_dev.stderr -u data/{lang}/{lang}.uem

Expected Results

Language DER
Cicipu 44.54
Effutu 34.65
Sakun 62.55

TODO: update numbers for only DEV set

1.3 Similar Work

The DIHARD Challenge I (2018, site) and Challenge II (2019, site, paper) have focused on robust speaker diarization. Their second challenge baseline involes the Kaldi toolkit.

Next Tracks TBD...

Possibly: Speaker Identification, Genre Identification

About

Shared Tasks for Rapid, Efficient Analysis of Many Languages in Emerging Documentation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.3%
  • Shell 6.7%