Supervised Deep Hashing for Efficient Audio Retrieval [Video]

Audio Event Classification (AEC) is defined as the inherent ability of machines to assign a semantic label to a given audio segment. In spite of multiple efforts in learning better and robust audio representations (or embeddings), there has not been adequate amount of research in efficient retrieval of audio events. Fast retrieval can facilitate near-real-time similarity search between a query sound and a database containing millions of audio events.

This work, the first of its kind, investigates the potency of different hashing techniques for efficient audio event retrieval. We employ state-of-the-art audio embeddings as features. We analyze the performance of some classical unsupervised hashing algorithms. Then we show that employing a small portion of the annotated database for supervised hashing via Deep Quantization Network (DQN) can significantly boost the retrieval performance. The detailed experimental results, extensive analysis and comparison between supervised and unsupervised hashing methods can provide deep insights on the quantizability of the employed audio embeddings, and further allow performance evaluation of such an audio retrieval system.

[Slides]