Introduction
Speech technology has come a long way since Alexander Graham Bell’s famous Mr. Watson – Come here – I want to see you became the first speech to be heard over the telephone in 1876. Today, speech technology has moved into realms such as VoIP, teleconferencing systems, home automation, and so on. Its importance has grown exponentially with the emergence of mobile and wearable devices and many existing and upcoming Microsoft services, devices and algorithms depend on these voice-based interfaces.
As far as things have come along, there is still a lot of inefficiency and the importance of high-performing speech-processing technologies has never been more apparent. Traditional signal processing algorithms that used to be the state-of-the-art – especially in speech-recognition and computer-vision – are facing performance plateaus. Also, a new class of algorithms that can learn directly from data and be robust in diverse and adverse application environments has emerged. The development of speech technologies has exploded due to advances in machine learning and AI. These advances have made voice interfaces more practical and useful, leading to easier and more efficient communication with the machines around us. Experts believe that speech applications are approaching a level of reliability at which everyday use will become second nature.
ICASSP
The 2018 International Conference on Acoustics, Speech and Signal Processing in Calgary, Canada is the world’s largest and most comprehensive technical conference focused on signal processing and its applications; ICASSP is the global event for presenting important developments in speech technology. The conference is sponsored by the IEEE Signal Processing Society and has been held annually since 1976. It features world-class speakers, tutorials, exhibits, a show and tell event and over 120 presentation and poster sessions. Microsoft’s presence was significant, with researchers presenting over 25 papers on ground-breaking, novel machine-learning methods for speech processing. This work significantly improves the odds of advancing speech technology quality in many backend services and devices.
At ICASSP, Microsoft offered a glimpse of future speech services – a world of lightly supervised training, enhanced robustness and more intuitive interaction with machines. Far-field ASR and voice control has become a lot more practical, now working reliably in noisy environments, for example, interacting across a room and being able to handle multiple speakers even when they speak simultaneously. Virtual assistants such as Microsoft Cortana offer a simpler way of accessing information, cueing up songs and building shopping lists, all using just your voice. As part of these applications, multimodal speech processing is gaining more attention. Several of the conference sessions were dedicated to such areas. Microsoft is well-placed, especially when considering the impressive size of the team dedicated to advancing the accuracy of speech recognition and improving the overall conversational interfaces.
It’s also worth noting that more and more research teams are moving away from doing only core ASR, broadening their focus to include areas such as multi-speaker ASR, language ID, and diarization, all of which are required to build end-to-end applications.
Sounding the Future
Natural Language understanding and Dialogue Systems are two of the next challenges in AI. The use of speech and image recognition to analyze inflections and facial expressions as part of a dialogue system will make machines interact more naturally with their human users. Although many researchers expect voice interfaces to become more natural, there is still a big challenge for AI because language interfaces are complex and domain-specific intelligence together with knowledge about effective human-machine interaction is required to respond. A number of significant Microsoft papers are being presented at ICASSP that advance the conversation in these areas, including “Improving End-of-Turn Detection in Spoken Dialogues by Detecting Speaker Intentions as a Secondary Task”, “The Microsoft 2017 Conversational Speech Recognition System”, “Domain and Speaker Adaptation for Cortana Speech Recognition”, “Sequence Modeling in Unsupervised Single-Channel Overlapped Speech Recognition” and “Towards Language-Universal End-to-End Speech Recognition”.
One of the hottest trends in machine learning is Generative Adversarial Networks. These systems consist of one neural network generating artificial data and another network trained to distinguish fake from real data. When combined, these two networks have the power to create realistic synthetic data that can be indistinguishable from real data. Papers like “Adversarial Teacher-Student Learning for Unsupervised Domain Adaptation”, “Speaker-Invariant Training via Adversarial Learning”, and “Adversarial Advantage Actor-critic Model for Task-Completion Dialogue Policy Learning”, attest to Microsoft’s pioneering efforts in GANs as applied to AI.
As previously noted, ICASSP covers a wide range of technologies paving the broader trends in machine learning. A large fraction of the ASR-related papers is dedicated to attention mechanisms, end-to-end modeling and sequence-to-sequence models. Microsoft has been using sequence-to-sequence systems for machine translation; in the case of ASR, there are still important problems to iron out. Nevertheless, Microsoft is advancing the field in these areas with papers like “Advancing Connectionist Temporal Classification with Attention Modeling”, “Advancing Acoustic-to-Word CTC Model”, and “Neural Sequential Malware Detection with Parameters”.
What’s Next?
Clearly core areas of speech technology like automatic speech recognition and text-to-speech synthesis have reached an impressive level of maturity. But there remain significant open questions around how to use voice modality to create more natural user interfaces. Much attention was devoted during the ICASSP sessions to far-field speech processing, diarization, speech separation and similar technical challenges. Microsoft’s interest in these areas is strong and reflected by the presentation of multiple papers in this area including “Developing Far-field Speaker System via Teacher-Student Learning”, “Exploring sequential characteristics in speaker bottleneck feature for text-dependent speaker verification”, and “Efficient Integration of Fixed Beamformers and Speech Separation Networks for Multi-channel Far-Field Speech Separation”.
Challenges across cognitive and behavioral sciences on how to design truly effective and efficient human-computer interaction scenarios remain. As part of these challenges, it is very likely that affective computing (such as emotion processing) will continue to gain momentum and most of the prominent problems will be solved. The challenge will ultimately be to combine such increasingly accurate sensing capabilities to improve and elevate the human-machine communication in both home and work environments.