Adapting Frechet Audio Distance for Generative Music Evaluation

Azalea Gui; Hannes Gamper; Sebastian Braun; Dimitra Emmanouilidou

Adapting Frechet Audio Distance for Generative Music Evaluation

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | April 2024

Published by IEEE

Best student paper award at IEEE ICASSP 2024

The growing popularity of generative music models underlines the need for perceptually relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly used for this purpose even though its correlation with perceptual quality is understudied. We show that FAD performance may be hampered by sample size bias, poor choice of audio embeddings, or the use of biased or low-quality reference sets. We propose reducing sample size bias by extrapolating scores towards an infinite sample size. Through comparisons with MusicCaps labels and a listening test we identify audio embeddings and music reference sets that yield FAD scores well-correlated with acoustic and musical quality. Our results suggest that per-song FAD can be useful to identify outlier samples and predict perceptual quality for a range of music sets and generative models. Finally, we release a toolkit that allows adapting FAD for generative music evaluation.

Link to the FAD toolkit: fadtk on github (opens in new tab)

Pearson correlation coefficient between Frechet Audio Distance (FAD) and listening test scores for all tested embeddings and reference datasets,
for acoustic and musical quality.

GitHub

Final intern talk: Improving Frechet Audio Distance for Generative Music Evaluation

As generative music models become more powerful and popular, there is a growing need for robust objective metrics of music quality that correlates with human perception. The Frechet Audio Distance (FAD) is a commonly used metric for this purpose. However, its performance may be hampered by issues including sample size bias, limitations of the underlying audio embeddings, and the use of low-quality reference sets. We propose reducing sample size bias by extrapolating unbiased scores as the sample size approaches infinity. A comparison of various audio embeddings reveals that some are better suited for deriving FAD scores that capture aspects of musical or acoustic quality. Finally, our experiments underscore the importance of choosing a diverse and high-quality reference dataset for FAD calculation. Listening test results indicate…