By Douglas Gantenbein, Senior Writer, Microsoft News Center
To use a Kinect for Xbox 360 gaming device is to see something akin to magic. Different people move in and out of its view, and Kinect recognizes the change in a player and responds accordingly.
It accomplishes this task despite the enormous variation in what it sees. Lighting can change within a room. A player might appear close to the Kinect one minute, farther away the next. And faces change second to second as players react to the action.
Spotlight: Blog post
Eureka: Evaluating and understanding progress in AI
How can we rigorously evaluate and understand state-of-the-art progress in AI? Eureka is an open-source framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. Learn more about the extended findings.
Kinect Identity, as the device’s player-recognition tool set is called, recognizes people by looking for three visual cues:
- The heights of the players.
- The color of their clothing.
- Their faces.
That last element is the key. Players might be close in height. They might be wearing similarly colored clothing. But faces are as individual as, well, individuals.
That’s where Microsoft Research work played a significant role in helping Kinect learn who is who. Jian Sun—a senior researcher with Microsoft Research Asia’s Visual Computing Group—worked with colleagues outside Microsoft on solving the complicated task of teaching a machine how to recognize people when they change poses, frown or smile, have shadows across their face, or are brightly lit.
Identifying a face is not an easy task for a machine.
“The fundamental difficulty comes from intrapersonal variation,” Sun says. “The face of a single person can appear very different under different conditions. Due to lighting, expressions, or poses, there can be even bigger differences than between two people.”
Prior Work
Sun began work on face recognition three years ago. His work contributed to a face-recognition feature in Windows Live Photo Gallery that enables users to tag and search for photos of friends or family members using face recognition.
Sun acknowledges that, at least for now, a machine never can be 100 percent successful at detecting all the variations a single face can exhibit. The trick, he says, is giving Kinect the ability to make extremely educated guesses.
Much of the face-recognition technology in Kinect is based on a paper called Face Recognition with Learning-based Descriptor, co-authored by Sun along with Zhimin Cao, from The Chinese University of Hong Kong; Qi Yin, from Tsinghua University; and professor Xiaoou Tang, from The Chinese University of Hong Kong.
Most face-recognition tools take what seems like the obvious route: They compare any faces they see with a stored database of faces. While a simple approach, it stumbles when confronted by faces with different lighting, or when a face is scowling when the initial face used as reference face is smiling.
Sun and his co-authors devised a method for teaching a device to recognize faces based on what facial features are most prominent under different poses or lighting. That is, a nose or the left or right eye might be more critical to recognition than other features, depending on the pose.
The technique uses two steps. First, it extracts nine key landmarks from a face: nose, mouth, eyes, and so on. The images are filtered to remove illumination variations, then assigned a compact snippet of computer code.
Next, the system determines the facial pose—whether the subject is looking straight at the camera or looking left or right. Poses can vary widely, of course, so Kinect uses an algorithm that determines what seems to be the most likely candidate. The system then matches the subject’s eyes, mouth, or nose to images in its database and finds the best match.
The facial-recognition tool also determines where the face is appearing in its field of view and “normalizes” the facial size to compensate for whether the player is near the Kinect or far away.
Under most conditions, the face-recognition tool achieves a success rate of nearly 85 percent.
Additional Approaches
Sun’s team also contributed to Kinect’s two other player-recognition approaches: identifying a player based on clothing and on height. Working with the Kinect product team, Sun and his colleague Yichen Wei helped develop Kinect’s approach to avoiding mistakes. And those do occur—as impressive as the facial-recognition technology is, it’s not perfect.
For each new game session, Kinect gathers the players’ characteristics—face, height, and clothing color—and matches them against information it has stored about previous players. For Kinect to “identify” a player, it must have one positive response, such as a recognized height, and no negative responses, such as wrong clothing color.
The facial-recognition component acts as something of a tiebreaker. It’s part of the recognition process itself, of course, and in the case of a strong facial match, it will identify a player even if one of the other identifiers—height or clothing color—comes back as a negative match.
The adoption of Microsoft Research’s work by Kinect occurred in part through serendipity. Tommer Leyvand, now a principal development lead with the Kinect team, interned at Microsoft Research Asia in 2005, and became familiar with that facility’s facial-recognition work.
By early 2009, Leyvand was part of the Kinect team in Redmond.
“When Kinect came along, we knew we were going to need facial recognition as part of it, and I knew Microsoft Research Asia had a lot of papers out on that technology,” he says. “They have been part of our visual-features team ever since—and they came over to Redmond at crunch time to help get Kinect ready for shipping. It was a very close working relationship.”
Microsoft Research Asia scientist Yichen Wei worked with Sun and the Kinect team on assembling the final Kinect Identify tool set.
Kinect’s ability to recognize people serves two purposes. One is to identify players, automatically sign them into their Xbox LIVE account, and deploy their avatars. Of course, during a game in which players can enter and exit a game, Kinect keeps track when changes are made and puts the right player into the game.
The way it does so is what makes Kinect so amazing to people.
‘Part of the Experience’
“It becomes part of the experience,” Leyvand says. “The magic is when you don’t do anything. You just stand there, and it knows who you are.”
Sun is working on how the next generation of Kinect will handle identities. He also is pursuing a new approach to facial recognition, one that recognizes faces in the same way people do. In a second paper on face recognition, An Associate-Predict Model for Face Recognition, Sun and co-authors Yin and Tang conjecture that a person takes prior memories of other people and uses those to predict how a particular person will appear under different settings.
To recognize a face that has changed pose or is under different lighting, the associate-predict model begins by building a database of “generic” faces. Facial components are broken down by key facial landmarks—such as eye centers and mouth corners—and 12 other facial features. This serves as the recognition engine’s basic “memory” of how faces appear under different conditions or in different poses.
In the next step, the face of a specific subject—such as a Kinect player—is compared to the 28 different “memory” images: seven poses times four lighting variations. The recognition engine “associates” the subject’s face to the memory bank of stored faces, matching one or more key facial features, such as an eye that is looking to the left and is on the shadowed side of a face. Then it uses that information to make an educated guess as to what the subject’s face will look like with a different pose or under different lighting.
The current Kinect’s player-recognition ability seems uncanny. Future generations of the device could appear to be downright supernatural.