Audio Samples

Audio excerpts of personalized speech enhancement systems

Introduction

This page gathers audio examples of personalized speech enhancement. Precisily, it is the combination of two works separated into two parts, as follows:

Part II. Contrastive Knowledge Distillation for Embedding Refinement in Personalized Speech Enhancement

The order is temporaly exchanged for ICASSP25 reviews: part II is on top followed by part I. For ICASSP25 reviewers, the exerpts are provided in part II.

Part II. Contrastive Knowledge Distillation for Embedding Refinement in Personalized Speech Enhancement

Thomas Serre*† , Mathieu Fontaine* , Éric Benhaim† , Slim Essid*

†Orosound, Signal Processing Lab *LTCI, Télécom Paris, Institut Polytechnique de Paris

Abstract

Personalized speech enhancement (PSE) has shown convincing results when it comes to extracting a known target voice among interfering ones. The corresponding systems usually incorporate a representation of the target voice within the enhancement system, which is extracted from an enrollment clip of the target voice with upstream models. Those models are generally heavy as the speaker embedding’s quality directly affects PSE performances. Yet, embeddings generated beforehand cannot account for the variations of the target voice during inference time. In this paper, we propose to perform on-the-fly refinement of the speaker embedding using a tiny speaker encoder. We first introduce a novel contrastive knowledge distillation methodology in order to train a 150k-parameter encoder from complex embeddings. We then use this encoder within the enhancement system during inference and show that the proposed method greatly improves PSE performances while maintaining a low computational load.

Samples

In this work, we evaluated our system and the baselines on DNS5 Blind test set which constists of real data recordings. It is separated into ”Headset” and ”Speakerphone” tracks. The latter is harder as the target voice is further from the microphone than with Headset samples. We propose samples from pDeepFilterNet2 (pDFNet2), pDeepFilterNet2 + similarity (pDFNet2+) which corresponds to the light similarity, and our trained implementations of pDCCRN and E3Net. No clean reference is available for this test set so only the unprocessed sample can be played in addition to processed samples.

Headset Track

	Sample 1	Sample 2	Sample 3	Sample 4
Noisy
pDNFNet2
pDFNet2+ (proposed)
pDCCRN
E3Net

SpeakerPhone Track

	Sample 1	Sample 2	Sample 3	Sample 4
Noisy
pDNFNet2
pDFNet2+ (proposed)
pDCCRN
E3Net

Part I. A LIGHTWEIGHT DUAL-STAGE FRAMEWORK FOR PERSONALIZED SPEECH ENHANCEMENT BASED ON DEEPFILTERNET2

Thomas Serre*† , Mathieu Fontaine* , Éric Benhaim† , Geoffroy Dutour† , Slim Essid*

†Orosound, Signal Processing Lab *LTCI, Télécom Paris, Institut Polytechnique de Paris

Abstract

Isolating the desired speaker’s voice amidst multiple speakers in a noisy acoustic context is a challenging task. Personalized speech enhancement (PSE) endeavors to achieve this by leveraging prior knowledge of the speaker’s voice. Recent research efforts have yielded promising PSE models, albeit often accompanied by computationally intensive architectures, unsuitable for resource-constrained embedded devices. In this paper, we introduce a novel method to personalize a lightweight dual-stage Speech Enhancement (SE) model and implement it within DeepFilterNet2, a SE model renowned for its state-of-the-art performance. We seek an optimal integration of speaker information within the model, exploring different positions for the integration of the speaker embeddings within the dual-stage enhancement architecture. We also investigate a tailored training strategy when adapting DeepFilterNet2 to a PSE task. We show that our personalization method greatly improves the performances of DeepFilterNet2 while preserving minimal computational overhead.

Samples

The samples below are extracted from a test set generated with VCTK corpus for the speech and DNS5 Challenge for the noise. The are three typologies of excerpts: target voice + interfering voice, target voice + interfering voice + noise and target voice + noise. The SNR corresponds to the Signal-to-Noise Ratio and the SIR corresponds to the Signal-to-Interference Ratio. In each table we have the clean excerpt, the noisy excerpt, the output of DeepFilterNet2 (without personalization) and the output of pDeepFilterNet2 (with personalization). The latter is the model that we propose, which achieve personalization without increasing the computational complexity of the original model.