AI and Microsoft Research header - abstract neural network pattern on dark spectrum background

AI Frontiers

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

分享这个页面

By Adam Fourney, Principal Researcher; Gagan Bansal, Senior Researcher; Hussein Mozannar, Senior Researcher; Victor Dibia, Principal Research Software Engineer; Saleema Amershi, Partner Research Manager

Contributors: Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang (Eric) Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, Saleema Amershi

An illustrated workflow of Magentic-One completing a complex task from the GAIA agentic benchmark. The workflow starts with a description of the Task which reads “The attached image contains a Python script. Run the Python code against an array of strings, listed below. Output of the script is a URL containing C++ source code, compile, run and return the sum of the third and fifth integers…” The task description is shown flowing to the Orchestrator agent which then creates a dynamic/task-specific plan. The rest of the workflow lists the steps of the task being executed by the other agents on the Magentic-One team. First, the File Surfer accesses the image provided in the task description and extracts the code. Second, the Coder agent analyzes the Python code from the image. Third, the Computer Terminal executes the code provided by the Coder agent, outputting an url string. Fourth, the Web Surfer agent navigates to the url and extracts the C++ code shown on the page. Fifth, the Coder agent analyzes the C++ code. Sixth, the Computer Terminal executes the C++ code. Finally, the Orchestrator determines the task is complete and outputs the final result.
We are introducing Magentic-One, our new generalist multi-agent system for solving open-ended web and file-based tasks across a variety of domains. Magentic-One represents a significant step towards developing agents that can complete tasks that people encounter in their work and personal lives. We are also releasing an open-source implementation of Magentic-One (opens in new tab) on Microsoft AutoGen, our popular open-source framework for developing multi-agent applications.

The future of AI is agentic. AI systems are evolving from having conversations to getting things done—this is where we expect much of AI’s value to shine. It’s the difference between generative AI recommending dinner options to agentic assistants that can autonomously place your order and arrange delivery. It’s the shift from summarizing research papers to actively searching for and organizing relevant studies in a comprehensive literature review.

Modern AI agents, capable of perceiving, reasoning, and acting on our behalf, are demonstrating remarkable performance in areas such as software engineering, data analysis, scientific research, and web navigation. Still, to fully realize the long-held vision of agentic systems that can enhance our productivity and transform our lives, we need advances in generalist agentic systems. These systems must reliably complete complex, multi-step tasks across a wide range of scenarios people encounter in their daily lives.

Introducing Magentic-One (opens in new tab), a high-performing generalist agentic system designed to solve such tasks. Magentic-One employs a multi-agent architecture where a lead agent, the Orchestrator, directs four other agents to solve tasks. The Orchestrator plans, tracks progress, and re-plans to recover from errors, while directing specialized agents to perform tasks like operating a web browser, navigating local files, or writing and executing Python code.

Magentic-One achieves statistically competitive performance to the state-of-the-art on multiple challenging agentic benchmarks, without requiring modifications to its core capabilities or architecture. Built on AutoGen (opens in new tab), our popular open-source multi-agent framework, Magentic-One’s modular, multi-agent design offers numerous advantages over monolithic single-agent systems. By encapsulating distinct skills in separate agents, it simplifies development and reuse, similar to object-oriented programming. Magentic-One’s plug-and-play design further supports easy adaptation and extensibility by enabling agents to be added or removed without needing to rework the entire system—unlike single-agent systems, which often struggle with inflexible workflows.

We’re making Magentic-One open-source (opens in new tab) for researchers and developers. While Magentic-One shows strong generalist capabilities, it’s still far from human-level performance and can make mistakes. Moreover, as agentic systems grow more powerful, their risks—like taking undesirable actions or enabling malicious use-cases—can also increase. While we’re still in the early days of modern agentic AI, we’re inviting the community to help tackle these open challenges and ensure our future agentic systems are both helpful and safe. To this end, we’re also releasing AutoGenBench (opens in new tab), an agentic evaluation tool with built-in controls for repetition and isolation to rigorously test agentic benchmarks and tasks while minimizing undesirable side-effects.

How it works

A diagram illustrating Magentic-One’s multi-agent architecture. The diagram depicts the inner working of the Orchestrator agent at the top and points to the other agents on the team at the bottom. Within the Orchestrator, an outer and inner loop are depicted. The outer loop shows a task ledger, which contains facts, guesses, and the current plan, and a pointer into and out of an inner loop. The inner loop shows a progress ledger, which tracks the current task progress and assignments for each agent, pointing to a decision node with the text “Task complete?”. If “Yes” the diagram shows the flow breaking out of the Orchestrator and pointing to a “Task Complete” termination node. If “No” the diagram shows the flow pointing to another decision node with the text “Progress being made?”. If “Yes” the flow points out of the Orchestrator toward one of the other agents on the team, indicating a handoff of control. If “No”, the flow points to third decision node with the text “Stall count > 2”. If “Yes” the flow goes back to the outer loop’s Task Ledger which is updated before the agents try again. If “No”, the flow again points out of the Orchestrator toward one of the other agents. The other agents depicted at the bottom of the diagram are named and described as follows: a Coder (“Write code and reason to solve tasks”), Computer Terminal (“Execute code written by the coder agent”), WebSurfer (“Browse the internet (navigate pages, fill forms, etc)”), and a FileSurfer (“Navigate files (e.g., PDFs, pptx, WAV, etc)”).
Magentic-One features an Orchestrator agent that implements two loops: an outer loop and an inner loop. The outer loop (lighter background with solid arrows) manages the task ledger (containing facts, guesses, and plan) and the inner loop (darker background with dotted arrows) manages the progress ledger (containing current progress, task assignment to agents).

Magentic-One work is based on a multi-agent architecture where a lead Orchestrator agent is responsible for high-level planning, directing other agents and tracking task progress. The Orchestrator begins by creating a plan to tackle the task, gathering needed facts and educated guesses in a Task Ledger that is maintained. At each step of its plan, the Orchestrator creates a Progress Ledger where it self-reflects on task progress and checks whether the task is completed. If the task is not yet completed, it assigns one of Magentic-One other agents a subtask to complete. After the assigned agent completes its subtask, the Orchestrator updates the Progress Ledger and continues in this way until the task is complete. If the Orchestrator finds that progress is not being made for enough steps, it can update the Task Ledger and create a new plan. This is illustrated in the figure above; the Orchestrator work is thus divided into an outer loop where it updates the Task Ledger and an inner loop to update the Progress Ledger.

Magentic-One consists of the following agents:

  • Orchestrator: The lead agent responsible for task decomposition, planning, directing other agents in executing subtasks, tracking overall progress, and taking corrective actions as needed
  • WebSurfer: An LLM-based agent proficient in commanding and managing the state of a Chromium-based web browser. For each request, the WebSurfer performs actions such as navigation (e.g., visiting URLs, performing searches), interacting with webpages (e.g., clicking, typing), and reading actions (e.g., summarizing, answering questions). It then reports on the new state of the webpage. The WebSurfer relies on the browser’s accessibility tree and set-of-marks prompting to perform its tasks.
  • FileSurfer: An LLM-based agent that commands a markdown-based file preview application to read local files. It can also perform common navigation tasks such as listing directory contents and navigating through them.
  • Coder: An LLM-based agent specialized in writing code, analyzing information collected from the other agents, and creating new artifacts.
  • ComputerTerminal: Provides access to a console shell for executing programs and installing new libraries.

Together, Magentic-One’s agents equip the Orchestrator with the tools and capabilities it needs to solve a wide range of open-ended problems and autonomously adapt to, and act in, dynamic and ever-changing web and file-system environments.

While the default multimodal LLM used for all agents is GPT-4o, Magentic-One is model-agnostic, allowing the integration of heterogeneous models to support different capabilities or meet different cost requirements. For example, different LLMs and SLMs or specialized versions can power different agents. For the Orchestrator, we recommend a strong reasoning model, like GPT-4o. In a different configuration, we also experimented with using OpenAI o1-preview for the Orchestrator’s outer loop and for the Coder, while other agents continued to use GPT-4o.

Evaluation

To rigorously evaluate Magentic-One’s performance, we introduce AutoGenBench, an open-source standalone tool for running agentic benchmarks that allows repetition and isolation, e.g., to control for variance of stochastic LLM calls and side-effects of agents taking actions in the world. AutoGenBench facilitates agentic evaluation and allows adding new benchmarks. Using AutoGenBench, we can evaluate Magentic-One on a variety of benchmarks. Our criterion for selecting benchmarks is that they should involve complex multi-step tasks, with at least some steps requiring planning and tool use, including using web browsers to act on real or simulated webpages. We consider three benchmarks in this work that satisfy this criterion: GAIA, AssistantBench, and WebArena.

In the Figure below we show the performance of Magentic-One on the three benchmarks and compare with GPT-4 operating on its own and the per-benchmark highest-performing open-source baseline and non open-source benchmark specific baseline according to the public leaderboards as of October 21, 2024. Magentic-One (GPT-4o, o1) achieves statistically comparable performance to previous SOTA methods on both GAIA and AssistantBench and competitive performance on WebArena. Note that GAIA and AssistantBench have a hidden test set while WebArena does not, and thus WebArena results are self-reported. Together, these results establish Magentic-One as a strong generalist agentic system for completing complex tasks.

A bar chart showing evaluation results of Magentic-One on the GAIA, AssistantBench, and WebArena benchmarks. The bars are grouped along the x-axis by benchmark, with bars corresponding to: GPT-4, Benchmark specific non-open source SOTA, Benchmark specific open-source SOTA, Magentic-One (GPT-4o), Magentic-One (GPT-4o, o1-preview), and Human performance, in that order for each benchmark. The y-axis shows “Accuracy (%)” from 0-100%. The chart shows GPT-4 performing worst on all benchmarks (around 7%,16%, and 15%, respectively) while the human level performance (only available for GAIA and WebArena) achieves around 92% and 78%, respectively. The chart shows Magentic-One perform comparably to the SOTA solutions on all benchmarks, aside from the Benchmark specific non-OS SOTA results on WebArena. An asterisk is shown in this case to depict that the non-open-source solutions provide no documentation or implementation for the community.
Evaluation results of Magentic-One on the GAIA, AssistantBench and WebArena. Error bars indicate 95% confidence intervals. Note that WebArena results are self-reported.

Risks and mitigations

Agentic systems like Magentic-One mark a significant shift in both the opportunities and risks associated with AI. Magentic-One interacts with a digital world designed for humans, taking actions that can change states and potentially lead to irreversible consequences. These inherent and undeniable risks were evident during our testing, where several emerging issues surfaced. For example, during development, a misconfiguration led agents to repeatedly attempt and fail to log into a WebArena website. This resulted in the account being temporarily suspended. The agents then tried to reset the account’s password. Even more concerning were cases in which agents, until explicitly stopped, attempted to recruit human assistance by posting on social media, emailing textbook authors, or even drafting a freedom of information request to a government entity. In each case, the agents were unsuccessful due to a lack of the required tools or accounts, or because human observers intervened.

Aligned with the Microsoft AI principles and Responsible AI practices, we worked to identify, measure, and mitigate potential risks before deploying Magentic-One. Specifically, we conducted red-teaming exercises to assess risks related to harmful content, jailbreaks, and prompt injection attacks, finding no increased risk from our design. Additionally, we provide cautionary notices and guidance for using Magentic-One safely, including examples and appropriate default settings. Users are advised to keep humans in the loop for monitoring, and ensure that all code execution examples, evaluations, and benchmarking tools are run in sandboxed Docker containers to minimize risks.

Recommendations and looking forward

We recommend using Magentic-One with models that have strong alignment, pre- and post-generation filtering, and closely monitored logs during and after execution. In our own use, we follow the principles of least privilege and maximum oversight. Minimizing risks associated with agentic AI will require new ideas and extensive research, as much work is still needed to understand these emerging risks and develop effective mitigations. We are committed to sharing our learnings with the community and evolving Magentic-One in line with the latest safety research.

As we look ahead, there are valuable opportunities to improve agentic AI, particularly in safety and Responsible AI research. Agents acting on the public web may be vulnerable to phishing, social engineering, and misinformation threats, much like human users. To counter these risks, an important direction is to equip agents with the ability to assess the reversibility of their actions—distinguishing between those that are easily reversible, those that require effort, and those that are irreversible. Actions like deleting files, sending emails, or filing forms are often difficult or impossible to undo. Systems should therefore be designed to pause and seek human input before proceeding with such high-risk actions.

We invite the community to collaborate with us in ensuring that future agentic systems are both helpful and safe.

For further information, results and discussion, please see our technical report. (opens in new tab)