Research Focus: Week of October 28, 2024

Published

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: October 28, 2024

FLASH: A Workflow Automation Agent for Diagnosing Recurring Incidents

Cloud incidents such as unplanned interruptions or performance degradation can reduce customer satisfaction and revenue. Recurring incidents, typically raised by system monitors, allow for timely resolution, but also demand significant human effort for troubleshooting. Automating the diagnosis of recurring incidents would help minimize service downtime, reduce customer impact, and decrease manual labor.

In a recent paper: FLASH: A Workflow Automation Agent for Diagnosing Recurring Incidents, researchers from Microsoft present an approach that significantly improves diagnostic accuracy. LLM-based agent approaches have proven effective in handling complex tasks requiring multiple logical steps, but still present reliability issues, because they lack specific diagnostic knowledge. FLASH incorporates status supervision to break down complex instructions into manageable pieces aligned with identified status. The researchers generate hindsight using LLMs from past failure experiences, progressively enhancing diagnostic reliability for subsequent incidents. An extensive study of over 250 production incidents from Microsoft in five different workflow automation scenarios shows that the FLASH agent approach outperforms state-of-the-art agent models by an average of 13.2% in terms of accuracy. This underscores the viability of automating the diagnostic process for recurring incidents. 


METAREFLECTION: Learning Instructions for Language Agents using Past Reflections

Language agents are AI systems that can understand, reason and respond in natural language to complete various tasks. While the latest LLMs are capable enough to power reasonably good language agents, the closed-API model makes it hard to improve them when they perform sub-optimally. Recent studies have explored using techniques like self-reflection and prompt optimization to improve performance. Unfortunately, self-reflection can be used only during the agent’s current run, while contemporary prompt optimization techniques are designed and tested to work on simple single-step agents.

In a recent paper: METAREFLECTION: Learning Instructions for Language Agents using Past Reflections, researchers from Microsoft introduce a novel offline reinforcement learning technique that enhances the performance of language agents by augmenting a semantic memory based on experiential learnings from past trials. They demonstrate the efficacy of METAREFLECTION across multiple domains, including complex logical reasoning, biomedical semantic similarity, open world question answering, and vulnerability threat detection, in Infrastructure-as-Code, spanning different agent designs. METAREFLECTION boosts language agents’ performance by 4% to 16.82% over the baseline agent implementations and performs on par with existing state-of-the-art prompt optimization techniques while requiring fewer LLM calls. 

About Microsoft Research

Advancing science and technology to benefit humanity

Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

Generative AI applications rely on large, foundation models, particularly LLMs. LLMs often have tens to hundreds of billions of parameters, making them too large for a single graphics processing unit (GPU) to handle in terms of both memory and computation. Because of their size, training these models requires distributing the workload across hundreds or even thousands of GPUs. This can lead to significant communication overhead, a challenge that arises when data needs to be shared between different GPUs. 

In a recent paper: Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping, researchers from Microsoft introduce a system designed to enhance the efficiency of LLM training by reducing the time lost to communication between GPUs. 

Domino breaks down data dependencies in a single batch of training into smaller, independent pieces. These smaller pieces are processed in parallel, and communication between GPUs happens simultaneously with computation, minimizing delays. 

Test results comparing Domino to Megatron-LM show that Domino speeds up the training process by up to 1.3x on Nvidia DGX-H100 GPUs. 


Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task Decomposition

Data science involves large datasets, source code, domain expertise, and unwritten assumptions. Data scientists describe the need to “have a conversation” with their data to extract information from it. The natural language processing and code generation capabilities of large language models (LLMs) could help tackle the challenging task of data analysis, which requires expertise in data processing, programming, and statistics.  AI chat interfaces for data analysis have grown in popularity. However, in a recent paper: Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task Decomposition, researchers from Microsoft and the University of Toronto show serious challenges in verifying AI-generated results and guiding AI systems to produce the desired output. 

The researchers developed two contrasting approaches to address these challenges. The first (Stepwise) decomposes the problem into step-by-step subgoals with pairs of editable assumptions and code until task completion. The second approach (Phasewise) decomposes the entire problem into three editable, logical phases: structured input/output assumptions, execution plan, and code. A controlled, within-subjects experiment compared these systems against a conversational baseline. Users reported significantly greater control with the Stepwise and Phasewise systems, and found intervention, correction, and verification easier, compared to the baseline. The results suggest design guidelines and trade-offs for AI-assisted data analysis tools. 


OmniParser for pure vision-based GUI agent

Large vision-language models (VLMs) such as GPT-4V and GPT-4o show promise in driving intelligent agent systems that operate within user interfaces (UI). However, VLMs’ full potential remains underexplored in real-world applications, particularly when it comes to acting as general agents across diverse operating systems and applications with only vision input. One limiting factor is the absence of a robust technique for screen parsing which is capable of 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associating the intended action with the corresponding region on the screen. 

In a recent article: OmniParser for pure vision-based GUI agent, researchers from Microsoft present a compact screen parsing module that can convert UI screenshots into structured elements. OmniParser can be used with a variety of models to create agents capable of taking actions on UIs. When used with GPT-4V, OmniParser significantly improves the agent capability to generate precisely grounded actions for interface regions. 

OmniParser with GPT-4V agent achieved the best performance on the recently released  WindowsAgentArena (opens in new tab) benchmark. 

Microsoft Research in the news

AI Dreams: Microsoft @ 50, Chapter 1 

GeekWire | October 16, 2024

Since the early 1990s, the promise of AI has been a driving force at Microsoft Research, which has a track record of breakthroughs in speech recognition, computer vision, machine learning, and other research that continues to advance the state of the art in AI. 

Podcast: What's next for AI, with Peter Lee 

GeekWire | October 19, 2024

The weekly GeekWire Podcast features comments from Microsoft Research President Peter Lee on what’s next in AI, including the top three technical challenges. It’s a bonus feature that came from the AI Dreams: Microsoft @ 50 series. Peter’s comments begin at 29:20.

Microsoft’s Differential Transformer cancels attention noise in LLMs 

Venture Beat | October 16, 2024

Improving LLMs’ ability to retrieve in-prompt information can impact important applications such as retrieval-augmented generation (RAG) and in-context learning (ICL). Microsoft Research and Tsinghua University researchers have introduced Differential Transformer (Diff Transformer), a new LLM architecture that amplifies attention to relevant context while filtering out noise, outperforming the classic Transformer architecture in various settings.

AI-powered productivity tools that can make life harder 

Financial Times | October 22, 2024

Technology used to summarize notes or generate transcripts does not always work for deaf employees. The problem is compounded by a historic lack of input from disabled people into AI products, even some that are marketed as assistive technologies.

Prompts are Programs 

ACM SIGPLAN Blog | October 22, 2024

The challenges and effective strategies for creating robust prompts are not well understood and will evolve as rapidly as the underlying LLM models and systems evolve. The programming languages and software engineering communities must be agile and eager to bring the decades of research and experience building languages and tools for robust software development to this new and important domain.

Edge 440: Interested in AI Evaluation? Meet Microsoft's EUREKA 

TheSequence | October 17, 2024

This podcast explores EUREKA, a reusable, open evaluation framework designed to standardize evaluations of large foundation models (LFMs). The framework goes beyond single-score reporting and rankings to offer a more comprehensive analysis of LFM capabilities.

Related publications

Continue reading

See all blog posts