Abstract
Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, which is unsuitable for systems requiring clean speaker-wise signals. We propose DialogueSidon, a model for joint restoration and separation of degraded two-speaker dialogue audio. DialogueSidon combines an SSL-VAE—which compresses self-supervised speech features into a compact latent space—with a diffusion-based latent predictor that recovers speaker-wise latent representations from the degraded mixture. Experiments on English, multilingual, and in-the-wild dialogue datasets show that DialogueSidon substantially improves intelligibility and separation quality over a baseline, while also achieving much faster inference.
Audio Samples
Each row plays the same utterance through three systems. The noisy column is the raw monaural input given to every model. GENESES is the baseline. DialogueSidon is ours (D = 32). Separated outputs are encoded as stereo: speaker 1 on the left channel, speaker 2 on the right.
English — Switchboard
| Example | Noisy mixture | GENESES | DialogueSidon (ours) |
|---|---|---|---|
| sw02007 | |||
| sw02093 | |||
| sw02157 |
Multilingual — CallFriend
| Language | Noisy mixture | GENESES | DialogueSidon (ours) |
|---|---|---|---|
| German | |||
| English | |||
| French | |||
| Japanese | |||
| Spanish | |||
| Mandarin |
In-the-Wild — OpenDialog
Real internet dialogue recordings with realistic, unknown degradations. No clean reference exists for these clips.
| Example | Noisy mixture | GENESES | DialogueSidon (ours) |
|---|---|---|---|
| Example 1 | |||
| Example 2 | |||
| Example 3 |
Citation
[BibTeX entry will be provided upon publication.]