Transform-average-concatenate (TAC) for end-to-end microphone permutation and number invariant multi-channel speech separation

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License.

Transform-average-concatenate (TAC) for end-to-end microphone permutation and number invariant multi-channel speech separation

This repository provides the model implementation and dataset generation scripts for the paper "End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation" by Yi Luo, Zhuo Chen, Nima Mesgarani and Takuya Yoshioka. The paper introduces transform-average-concatenate (TAC), a simple module to allow end-to-end multi-channel separation systems to be invariant to microphone permutation (indexing) and number. Although designed for ad-hoc array configuration, TAC also provides significant performance improvement in fixed geometry microphone configuration, showing that it can serve as a general design paradigm for end-to-end multi-channel processing systems.

Model

We implement TAC in the framework of filter-and-sum network (FaSNet), a recently proposed multi-channel speech separation model operated in time-domain. FaSNet is a neural beamformer that performs the standard filter-and-sum beamforming in time domain, while the beamforming coefficients are estimated by a neural network in an end-to-end fashion. For details please refer to the original paper: "FaSNet: Low-latency Adaptive Beamforming for Multi-microphone Audio Processing".

In this paper we make two main modifications to the original FaSNet:

Instead of the original two-stage architecture, we change it into a single-stage architecture.
TAC is applied throughout the filter estimation module to synchronize the information in different microphones and allow the model to perform global decision while estimating the filter coeffients.

The figure below shows different designs of FaSNet models.

The building blocks for the filter estimation modules are based on dual-path RNNs (DPRNNs), a simple yet effective method for organizing RNN layers to allow successful modeling of extremely long sequential data. For details about DPRNN please refer to "Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation". The implementation of DPRNN, as well as the combination of DPRNN and TAC, can be found in utility/models.

Dataset

The evaluation of the model is on both ad-hoc array and fixed geometry array configurations. We simulate two datasets on the public available Librispeech corpus. For data generation please refer to the data folder.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
utility		utility
FaSNet.py		FaSNet.py
README.md		README.md
flowchart.png		flowchart.png
iFaSNet.py		iFaSNet.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transform-average-concatenate (TAC) for end-to-end microphone permutation and number invariant multi-channel speech separation

Model

Dataset

About

Releases

Packages

Languages

yluo42/TAC

Folders and files

Latest commit

History

Repository files navigation

Transform-average-concatenate (TAC) for end-to-end microphone permutation and number invariant multi-channel speech separation

Model

Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages