Gaming Interaction Infrastructure:
We are very excited to share the good news. Our project “MindAgent: Emergent Gaming Interaction (opens in new tab)” is public recently. We seek to develop a unified interaction infrastructure and architecture that can jointly: understand large language corpora, visual (image and video) inputs, as well as provide meaningful action-based outputs. Our model on a broad range of gaming video tasks and show agent action stream efficacy across a range of tasks including interactive agent, visual and natural language understanding. In this work, we propose a novel infrastructure – MindAgent – to evaluate planning and coordination emergent capabilities for gaming interaction. In particular, our infrastructure leverages existing gaming framework, to i) require understanding of the coordinator for a multi-agent system, ii) collaborate with human players via un-finetuned proper instructions, and iii) establish an in-context learning on few-shot prompt with feedback. Furthermore, we introduce CuisineWorld, a new gaming scenario and related benchmark that dispatch a multi-agent collaboration efficiency and supervise multiple agents playing the game simultaneously. We conduct comprehensive evaluations with new auto-metric CoS for calculating the collaboration efficiency. Finally, our infrastructure can be deployed into real-world gaming scenarios in a customized VR version of CuisineWorld and adapted in existing broader Minecraft gaming domain. By creating a powerful and general-purpose foundation model with visual, language, and action capabilities, we can have great impact across many industries, both within Microsoft and external.