Designer-centered reinforcement learning

已发布 2021年2月17日

作者 Batu Aytemiz , PhD Student Mikhail Jacob , Researcher Sam Devlin , Principal Researcher Katja Hofmann , Senior Principal Researcher

分享这个页面

In video games, nonplayer characters, bots, and other game agents help bring a digital world and its story to life. They can help make the mission of saving humanity feel urgent, transform every turn of a corner into a gamer’s potential demise, and intensify the rush of driving behind the wheel of a super-fast race car. These agents are meticulously designed and preprogrammed to contribute to an immersive player experience.

Join Us

This work was undertaken during an internship at Microsoft Research Cambridge. If you’re interested in exploring similar real-world challenges and developing actionable user-focused solutions, visit the lab’s internship page for details on internships in deep RL for games and other research areas.

For insights into AI and gaming research, register for the Microsoft AI and Gaming Research Summit 2021 (February 23–24), and for career opportunities in RL, check out the open positions with the Machine Intelligence theme at Microsoft Research Cambridge and other opportunities across Microsoft Research.

Now, what if these same agents could learn to behave in lifelike and interesting ways without a developer having to hardcode every possible natural behavior in each scenario? Imagine agents in an action game learning a variety of offensive strategies to challenge a protagonist or agents in an adventure game learning how to support the player in unlocking information about an unfamiliar environment. Reinforcement learning (RL), in which agents learn how to act when they must sequentially take actions over time, provides a framework for achieving that. Through RL, agents can be trained to devise their own solutions to tasks, transforming the role of game designers from defining behavior to defining tasks and letting the agents learn. Such a shift has the potential to lead to surprising responses, possibly ones a game designer may not have even imagined, helping to create more engaging characters and worlds.

Project Project Paidia: a Microsoft Research & Ninja Theory Collaboration

Reinforcement learning is already showing promising results. For example, we’ve demonstrated agents’ ability to effectively collaborate with each other in the Ninja Theory game Bleeding Edge as part of the Project Paidia research collaboration, which ultimately seeks to enable teamwork between agents and human players (for an RL overview, visit our Project Paidia website and interactive experience). At the same time, many experts feel the use of RL in the commercial game industry is still far below its ultimate potential. The reasons why are numerous, including the need for a certain level of expertise to execute the technology. From our previous research into the experiences of game agent creators, we’ve come to realize that for RL techniques to be used in the game industry, we need to design them with potential users and their existing workflows and requirements in mind. In recent work, we focus on three specific challenges:

exercising authorial control when it comes to specifying the aesthetic style of game agents
balancing multiple design constraints, specifically task completion and behaving in a desired style
developing RL tools and infrastructure that are more meaningful from a designer perspective, allowing designers to make desired changes without formal engineering training

In this work, we establish the first steps toward a designer-centered approach to RL, making it easier for designers to specify an agent’s style through preference learning, automatically, robustly combined reward signals that satisfy varied design constraints, and a contextually meaningful workflow.

We show our results in a navigation task, as navigation is one of the most fundamental agent capabilities. In our experiments, we start with an agent that’s rewarded for getting closer to a goal as quickly as possible—in our case, a blue circle behind two “walls.” This results in the agent learning to take the shortest path to the blue circle, leading the agent to bump into and drag along the walls on its way. In this exploration, we assume the role of a designer aiming for movements more reflective of how a human player might approach the challenge, by taking a more central path.

A video in which an agent, represented by a red circle, moves around two rectangular walls to reach its goal, represented by a smaller blue circle. A black dotted line shows the agent’s path, which runs closely along the walls. — Video 1: In a navigation task like the above, it’s common for agents to drag along the walls, especially if they’re rewarded for getting closer to the goal (the blue circle). We want to tune this agent so that it doesn’t hug the walls as much and behaves in a way that’s more reflective of how a human player might achieve the task, via a more central path.

Preference learning as a method to specify style rewards

RL algorithms learn through a reward function. Unfortunately, it’s very difficult to computationally specify aesthetic style. If we’re building a stealth game, we might want our agents to creep near building edges, but if we’re making a game about cyborg warriors, we would much prefer them charging through a scene. It’s unclear, though, how one might go about writing an RL style reward for being “stealthy” or “boisterous.” Even if it were, the designers who decide and tweak a game’s aesthetic aspects are often separate from the engineers who implement the underlying AI behavior, requiring that designers become proficient enough in RL to adjust the AI codebase to achieve a desired style. Such an expectation is unrealistic in larger teams and impractical with most designer workflows used today.

Video 2: In this human-controlled demonstration (exaggerated to illustrate how an RL agent would behave), the character grinding along the wall while moving to its target in Bleeding Edge looks jarring. It’s not representative of how a human player would likely move, nor does this type of behavior make sense for the aesthetic style of the game. From the perspective of an RL agent, however, there wouldn’t be any problem! The agent would take the quickest path to a given goal. (This video is for demonstration purposes only. Does not represent the agents used in Bleeding Edge or Project Paidia. Not representative of final game gameplay or visuals.)

The traditional method to achieve our goal of reaching the blue circle as quickly as possible while taking a more central path would be to write an extension to the reward function, also known as reward shaping. We can punish the agent for being too close to the walls while still rewarding it for reaching the goal. However, even with extensive experimentation, it’s difficult to attain a behavior that’s exactly what we want, as what we really want isn’t simply “staying away from the walls” but a more nuanced style of movement that’s hard to capture mathematically.

It would be much easier to recognize a desired style as opposed to describing it mathematically. Because of this, we implemented a preference-based learning method (opens in new tab) to allow designers to specify their desired style through a simple user interface—no coding required!

Our proposed method works as follows:

The policy, pretrained on the task reward, interacts with the environment and produces a set of trajectories.
The designer is shown segments of these trajectories, and they pick which segment is closer to their desired style.
A reward network tasked with capturing the style is updated according to the designer’s preferences.
The reward network predicts how much a state exhibits the learned style. This predicted style reward plus the original task reward is used to optimize the original policy.
A new set of trajectories is collected with the updated policy to get new preferences.

Because RL training takes time, we fine-tune an already competent agent, cutting the amount of iteration down drastically compared to training an agent from scratch; all we need to do is add style to it. Further, this makes it easier to apply different styles to the same base agent, allowing AI engineers to train a base model and then designers to fine-tune style preferences.

A video demonstration of the prototype user interface. Two agent trajectories appear side by side. In each, the agent, represented by a blue circle, displays different styles of movement and behavior in making its way to the target, represented by a smaller red circle. At the bottom right, a hypothetical designer, represented by the movement of a cursor, selects their preference from three choices: left, even, and right. — Video 4: Describing an aesthetic style mathematically is difficult. Our prototype user interface allows designers to choose the behavior that is closest to their desired style from a series of trajectories generated by the agent interacting with its environment.

In a feedback efficiency study, we show we can reliably train a successful agent in our given task. For our task, training is only considered successful if the agent completes the task with a high enough style reward (measured by the mean distance from the nearest wall) and an acceptable task reward (an approximate equivalent to the time it takes to reach the goal). We reached our desired behavior in 50 comparisons. We expect the number of required preferences to increase as the complexity of the environment and of the desired style goes up.

Side-by-side videos in which an agent, represented by a red circle, moves around two rectangular walls to reach its goal, represented by a smaller blue circle. On the left, a black dotted line shows a path that is mostly far from the wall. On the right, a black dotted line shows a style of behavior that starts far from the wall initially and becomes progressively more centered. — Video 5: In our navigation task, the agent on the left was fine-tuned to maximize distance to the walls, while the agent on the right was fine-tuned using preference learning. Through preference learning, we were able to achieve more nuanced behavior that better captured our intended style.

Potential-based shaping to combine style and task rewards

A screenshot with the equation total_reward = A * style_reward + B * task_reward. — The above represents the common workflow when combining multiple sources of reward. The RL agent is trying to optimize the total reward. It’s up to the designer to decide on the correct ratio of style reward to task reward by specifying the weight of each, A and B here. This workflow requires a lot of iteration and offers little control.

Specifying the reward that captures the desired style is only the first step; the style reward must then be integrated into the agent’s preexisting task reward. This is by no means a trivial undertaking. If the ratio of the style reward to the task reward is too high, the style reward overwhelms the task reward and navigation performance suffers. If the ratio is too low, then there’s no observable behavior change. The default approach to solving this problem is to iterate—tweak the ratio slightly and run another experiment. Since each RL training run can take hours, trying to tune the style and task reward ratios manually is laborious and mind-numbingly boring.

In the video, the agent demonstrates — Video 6: In the above video, the agent demonstrates “reward hacking” by moving to the point farthest away from the walls instead of reaching the goal. This happens when the combination of task and style rewards is misconfigured in a way that simply maximizing the style reward to avoid walls and ignoring the task reward to reach the goal yields the highest reward.

When we first tried to fine-tune our agent with our new style, it failed to be successful in the initial task, as it gave too much weight to exhibiting the style! We used potential-based reward shaping (PBRS) (opens in new tab) to solve this problem. PBRS ensures that when a shaping reward is introduced—that is, a reward encouraging behavior other than the initial task, like our style reward—the optimal policy for the initial task remains the same.

PBRS is a simple yet powerful technique where, at each step, we subtract the previous step’s style reward from the total reward (task reward plus style reward) of the current step. This means the agent is rewarded for being at a certain state and not moving into a certain state. The intuition behind PBRS can be expressed with the following example: imagine an agent is rewarded for crossing the finish line in a race. After crossing the finish line initially, the agent might be encouraged to step back over the finish line and forward again multiple times, effectively gaining infinite rewards. However, with PBRS, we only reward the agent for being at the finish line, not crossing it: whenever the agent steps back, we take away the reward we had given it, preventing it from accruing more reward by simply crossing back and forth.

While the distinction is subtle, this prevents the agent from “reward hacking” to maximize rewards without accomplishing the initial task. When we started using PBRS to integrate our style reward into the task reward, the training was more successful in both keeping the original task rewards high and displaying the desired style. This alternative to the time-consuming task of fine-tuning the ratio of style reward to task reward manually means designers can explore many more style variations instead of spending their resources getting a single style to work properly.

chart, bar chart, histogram — Figure 1: Potential-based reward shaping (PBRS) makes integrating the style reward into the task reward easier, preventing the style reward from degrading task performance even at high style-reward-to-task-reward ratios. The above graph shows the percentage of successful trials with a particular reward ratio. In our navigation task, a successful trial was defined as the agent completing the task with a high enough style reward, measured by the mean distance from the nearest wall, and an acceptable task reward, an approximate equivalent to the time it takes to reach the goal. As shown, when PBRS isn’t used, the agent is only successful with a 0.1 style-reward-to-task-reward ratio. When PBRS is used, the agent is successful at ratios between 0.5 and 100.

Automatic reward ratio adjustment to increase designer control

Even though using PBRS made finding an acceptable ratio of style reward to task reward much easier, we’re still asking designers to fine-tune a behavior by changing an arbitrary numeric ratio. There’s no designer-interpretable meaning to “combining one parts task reward with four parts style reward.”

Rewards, especially when designed intentionally, can be meaningful from a design perspective. For example, if the agent is penalized by 1 point every second, we can see how fast the agent reaches the goal by simply looking at the final reward. A –15 task reward means the agent took 15 seconds to reach the goal. In scenarios where it’s possible to provide similar types of meaningful rewards, it would be much more efficient for a designer to specify a minimum performance that’s acceptable—a lower bound reward threshold—versus tweaking arbitrary numeric ratios.

Toward this end, we implemented an automatic reward ratio schedule that tries to maximize the style reward while respecting the designer-specified threshold. The automated scheduler increases the ratio of style reward to task reward while the task reward is higher than the designer-specified threshold and reduces the ratio when the task performance starts to degrade. To be more specific, we linearly scale the style reward ratio between a maximum number and 0—between the starting performance and the threshold performance. In the above example, if a designer wanted their agent to reach the goal in at most 15 seconds, the automated scheduler would increase the ratio of style reward to task reward until the agent started taking longer than the specified 15 seconds. At that point, the scheduler would then reduce the style reward weight until a performance time of 15 seconds was achieved. This automated schedule would continue over the course of training.

The figures demonstrate how designer-specified minimum task reward thresholds affect training. Figure 2a (top) plots the task rewards over total timesteps (length of an experiment) under five different minimum reward thresholds, from 150 to 110. The graph shows that the training starts from the same initial task reward of 150 but goes down to the minimum task threshold for each of the experiments. The automatic reward ratio scheduler is effective in keeping the task reward above the specified task threshold. Figure 2b (bottom) plots the mean distance from the walls (our proxy for the style reward) over total timesteps under the same five reward thresholds. The mean distance from the wall increases as the threshold is reduced from 150 to 110. Most notably, the agent fails to move away from the walls with a reward threshold of 150 since it has no slack to sacrifice task reward. — Figures 2a and 2b: The above figures demonstrate how designer-specified minimum task reward thresholds affect training. Figure 2a (top) plots the task rewards over total timesteps (length of an experiment) under five different minimum reward thresholds, from 150 to 110. The graph shows that the training starts from the same initial task reward of 150 but goes down to the minimum task threshold for each of the experiments. The automatic reward ratio scheduler is effective in keeping the task reward above the specified task threshold. Figure 2b (bottom) plots the mean distance from the walls (our proxy for the style reward) over total timesteps under the same five reward thresholds. The mean distance from the wall increases as the threshold is reduced from 150 to 110. Most notably, the agent fails to move away from the walls with a reward threshold of 150 since it has no slack to sacrifice task reward.

Figure 2a shows the task reward with different designer-specified minimum task reward thresholds. The automated reward ratio adjustment is effective in keeping the task reward above the specified performance threshold.

Figure 2b shows the mean distance to the closest wall, a simple approximation of our target style, under different reward thresholds. When the minimum task reward threshold is very high (150), the change in behavior is rather small, as the agent is prioritizing the task reward over exhibiting the style. However, as the designer relaxes the constraints, more style behavior emerges.

This method of specifying a desired reward is much more meaningful than iteratively changing the numeric ratio to hit the desired target. We believe this workflow simplifies the job of the designer immensely.

Open questions and continued collaboration

While these results are encouraging, there are several open research questions. First, we need to validate our findings with a user study. While the high-level workflow is established, there’s more to learn regarding the specifics. In the context of our proposed solutions, do designers continuously monitor the training, or do they give feedback in batches? What information is shown to the designers for them to make accurate choices?

Another open question is exploring different methods of specifying a style. While preferences are useful, there are many other methods we can employ. Designers can demonstrate the desired style by taking control of the agent, or they can annotate individual states to guide the fine-tuning. It’s unclear which of these methods (or what combination) offers the most control to the designers.

Event Reinforcement learning in Minecraft: Challenges and opportunities in multiplayer games

The journey toward RL that can be easily and organically incorporated into commercial game design is a long one. We feel taking a designer-centered approach, as demonstrated by the prototypes above, offers a promising means to achieving that goal, and we look forward to continuing to work with professionals in the game industry to deliver practical and empowering solutions.