Conversational Agents that See

The dream of a personal, digital assistant has been with us for decades—a digital entity which knows our habits, activities and preferences, and which can converse with us, anticipate our needs, and carry out tasks for us. Recently, we have seen a resurgence of enthusiasm around this concept, particularly in the form of conversational systems on mobile phones, such as Cortana, Siri and Google Now. While these are in still in their infancy, they signal new ambitions to realise the vision of artificially intelligent, conversational agents.

In this project we explore what it might mean to augment such agents with the ability to see. By partnering with experts in computer vision, speech and machine learning, we ask whether the ability for an agent to see a person’s activities and context might make its capabilities more effective. After all, what we see often qualifies what we say by providing a shared context for conversation. Looking at the context around us brings into the conversation objects of interest, other people, aspects of the environment, ongoing activities and so on. Agents can start to recognise and notice things that are happening in the world around a person. Conversely, conversation qualifies what we see—it can help clarify and add meaning to people, places, objects and events, for example. In both cases, we would expect that adding vision to agents would provide a better understanding of a person and their relationship to the world around them.

This project is exploring a number of different use scenarios where computer vision and speech converge to better serve the user. Our approach involves user-centred design which means we use a mix of literature reviews, interview studies, focus groups and ethnographic techniques to generate and refine our ideas. We are also building prototype applications based on those scenarios to test and evolve our concepts. These range from supporting people in capturing and organising objects from the physical world, to supporting different kinds of navigational tasks in both indoor and outdoor environments.

This work was in collaboration with the Machine Learning and Perception group Microsoft Research Redmond and Cambridge.