Deep Noise Suppression Challenge – ICASSP 2022

Region: Global

Program dates: December 2021–February 2022

Noise suppression has become more important than ever before due to the increasing use of voice interfaces for various applications. Given the millions of internet-connected devices being employed for audio/video calls, noise suppression is expected to be effective for all noise types chosen from daily-life scenarios. The IEEE ICASSP 2022 Grand Challenge is the 4th DNS challenge intended to promote industry-academia collaboration on research in real-time noise suppression aimed to maximize the subjective (perceptual) quality of enhanced speech. This challenge will extend DNS efforts to full band speech with a special focus on personalized denoising. In the era of hybrid work, personalized denoising is very important to suppress neighboring speaker and/or background noises. Recently, DNS research has been moving fast, and researchers now have state-of-the-art advancements in deep neural networks (DNNs); currently, deep noise suppression methods leverage the convolutional, recurrent, or hybrid neural network for estimating the enhanced speech from noisy recordings.

Previous editions of the DNS Challenge provided researchers with a massive training dataset and real test set along with a P.808/P.835 test framework for subjective evaluation of enhanced speech. In the current challenge, we improved the training dataset by cleaning it further and added some more data to capture relevant DNS scenarios. We collected a new test set for full band speech ensuring high energy content in higher frequency bands to eliminate bandlimited clips from some devices. We included new noise types in the test set covering contemporary scenarios and device variety, especially mobile scenarios. Our training data synthesizer script is flexible to allow the exclusion of any subset or addition of new data by the challenge participants. This provides an opportunity for leveraging challenge data along with other corpora for improving DNS performance. Our test set consists of real-world test clips recorded by crowd-sourced workers and/or Microsoft employees. We have two dev-test sets for real-time denoising and personalized real-time denoising. Similarly, we have two blind test sets, one for each challenge track.

Challenge paper can be found ICASSP_2022_4th_Deep_Noise_Suppression_Challenge

The tracks in this challenge are:

Track 1: Real-Time non-personalized DNS for full band speech 

  • The noise suppressor must take less than the stride time Ts (in ms) to process a frame of size T (in ms) on an Intel Core i5 quad-core machine clocked at 2.4 GHz or equivalent processors. E.g., Ts = T/2 for 50% overlap between frames. The total algorithmic latency allowed including the frame size T, stride time Ts, and any lookahead must be <= 40ms. If a real-time system has a frame length of 20ms with a stride of 10ms, it results in an algorithmic latency of 30ms, and thus the latency requirements are satisfied. If a frame size of 32ms with a stride of 16ms is used, resulting in an algorithmic latency of 48ms, then the latency requirements are not met as the total algorithmic latency exceeds 40ms. If the frame size (T) plus stride (Ts) represented as T1 = T+Ts is less than 40ms, then up to (40 - T1) ms of future information can be used.

Track 2: Real-Time Personalized DNS for full band speech

  • Satisfy Track 1 requirements.
  • 2 minutes of clean speech for each unique target speaker in the test set is provided for adopting DNS/speaker embedding extractor for personalized denoising. This track has a separate dev test set and blind test set.

Participants are forbidden from using the blind test set to retrain or tweak their models. Participants must submit results only if they intend to submit a paper to ICASSP 2022 Deep Noise Suppression Challenge. Failing to adhere to these rules will lead to disqualification from the challenge.

Evaluation criteria and methodology

This challenge adopts the ITU-T P.835 subjective test framework to measure speech quality (SIG), background noise quality (BAK), and overall audio quality (OVRL). We are also releasing DNSMOS P.835 (opens in new tab), which is a machine learning based model for predicting SIG, BAK, OVRL. Participants can use DNSMOS P.835 to evaluate their intermediate models. In this challenge, we introduced Word Accuracy (WAcc) as an additional metric to compare the performance of DNS models. Challenge winners will be decided based on OVRL and WAcc as follows:

(M = {{{(OVLR – 1) over4}+WAcc} over 2})

WAcc will be obtained using Microsoft Azure Speech Recognition API. This challenge metric gives an equal weighting between subjective quality and speech recognition performance. The dev-test set and DNSMOS P.835 are provided to participants to accelerate model development. A script to evaluate WAcc is also provided. We neither use the dev-test set nor DNSMOS P.835 for deciding final winners. DNSMOS P.835 has a high correlation with human perception and hence can serve as a robust measure of audio quality. Challenge winner will be decided based on M computed on enhanced clips from blind test set.

Registration procedure

  • To register for the challenge, participants are required to email Deep Noise Suppression Challenge [email protected] with the name of their team members, emails, affiliations, team name, and tentative paper title.
  • Participants also need to register on the Challenge CMT site (opens in new tab) where they can submit the enhanced clips.
  • Challenge organizers announce the availability of data, baseline model, evaluation results, etc. by emailing registered participants via CMT
  • Challenge organizers respond to query emailed to Deep Noise Suppression Challenge [email protected]

Contact us: If you have questions about this program, email us at [email protected].