Robust and Efficient Multiple Alignment of Unsynchronized Meeting Recordings
- T. J. Tsai ,
- Andreas Stolcke
IEEE Transactions on Speech, Audio and Language Processing | , Vol 24: pp. 833-845
This paper proposes a way to generate a single high-quality audio recording of a meeting using no equipment other than participants’ personal devices. Each participant in the meeting uses their mobile device as a local recording node, and they begin recording whenever they arrive in an unsynchronized fashion. The main problem in generating a single summary recording is to temporally align the various audio recordings in a robust and efficient manner. We propose a way to do this using an adaptive audio fingerprint based on spectrotemporal eigenfilters, where the fingerprint design is learned on-the-fly in a totally unsupervised way to perform well on the data at hand. The adaptive fingerprints require only a few seconds of data to learn a robust design, and they require no tuning. Our method uses an iterative, greedy two-stage alignment algorithm which finds a rough alignment using indexing techniques, and then performs a more fine-grained alignment based on Hamming distance. Our proposed system achieves >99% alignment accuracy on challenging alignment scenarios extracted from the ICSI meeting corpus, and it outperforms five other well-known and state-ofthe-art fingerprint designs. We conduct extensive analyses of the factors that affect the robustness of the adaptive fingerprints, and we provide a simple heuristic that can be used to adjust the fingerprint’s robustness according to the amount of computation we are willing to perform.
© IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.