DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio

Wataru Nakata1,2, Yuki Saito1,2, Kazuki Yamauchi1, Emiru Tsunoo1, Hiroshi Saruwatari1

1The University of Tokyo, Tokyo, Japan   2National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan

Abstract

Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, which is unsuitable for systems requiring clean speaker-wise signals. We propose DialogueSidon, a model for joint restoration and separation of degraded two-speaker dialogue audio. DialogueSidon combines an SSL-VAE—which compresses self-supervised speech features into a compact latent space—with a diffusion-based latent predictor that recovers speaker-wise latent representations from the degraded mixture. Experiments on English, multilingual, and in-the-wild dialogue datasets show that DialogueSidon substantially improves intelligibility and separation quality over a baseline, while also achieving much faster inference.

Audio Samples

Each row plays the same utterance through three systems. The noisy column is the raw monaural input given to every model. GENESES is the baseline. DialogueSidon is ours (D = 32). Separated outputs are encoded as stereo: speaker 1 on the left channel, speaker 2 on the right.

English — Switchboard

Example Noisy mixture GENESES DialogueSidon (ours)
sw02007
sw02093
sw02157

Multilingual — CallFriend

Language Noisy mixture GENESES DialogueSidon (ours)
German
English
French
Japanese
Spanish
Mandarin

In-the-Wild — OpenDialog

Real internet dialogue recordings with realistic, unknown degradations. No clean reference exists for these clips.

Example Noisy mixture GENESES DialogueSidon (ours)
Example 1
Example 2
Example 3

Citation

[BibTeX entry will be provided upon publication.]