Research Focus: Week of April 15, 2024

Published

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus April 15, 2024

Appropriate reliance on Generative AI: Research synthesis

Appropriate reliance on AI happens when people accept correct AI outputs and reject incorrect ones. It requires users of AI systems to know when to trust the AI and when to trust themselves. But fostering appropriate reliance comes with new complexities when generative AI (genAI) systems are involved. Though their capabilities are advancing, genAI systems, which use generative models to produce content such as text, music, images, and videos, have limitations as well. Inappropriate reliance – either under-reliance or overreliance – on genAI can have negative consequences, such as poor task performance and even product abandonment.  

In a recent paper: Appropriate reliance on Generative AI: Research synthesis, researchers from Microsoft, who reviewed 50 papers from various disciplines, provide an overview of the factors that affect overreliance on genAI, the effectiveness of different mitigation strategies for overreliance on genAI, and potential design strategies to facilitate appropriate reliance on genAI. 


Characterizing Power Management Opportunities for LLMs in the Cloud

Cloud providers and datacenter operators are grappling with increased demand for graphics processing units (GPUs) due to expanding use of large language models (LLMs). To try to keep up, enterprises are exploring various means to address the challenge, such as power oversubscription and adding more servers. Proper power usage analysis and management could help providers meet demand safely and more efficiently. 

In a recent paper: Characterizing Power Management Opportunities for LLMs in the Cloud, researchers from Microsoft analyze power patterns for several popular, open-source LLMs across commonly used configurations and identify opportunities to improve power management for LLMs in the cloud. They present a new framework called POLCA, which enables power oversubscription in LLM inference clouds. POLCA is robust, reliable, and readily deployable. Using open-source models to replicate the power patterns observed in production, POLCA simulations demonstrate it could deploy 30% more servers in existing clusters while incurring minimal power throttling events. POLCA improves power efficiency, reduces the need for additional energy sources and datacenters, and helps to promptly meet demand for running additional LLM workloads. 

GigaPath: Whole-Slide Foundation Model for Digital Pathology

Digital pathology helps decode tumor microenvironments for precision immunotherapy. In joint work with Providence and UW, we’re sharing Prov-GigaPath, the first whole-slide pathology foundation model, for advancing clinical research.

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Various prompting techniques, such as chain-of-thought (CoT), in-context learning (ICL), and retrieval augmented generation (RAG), can empower large language models (LLMs) to handle complex and varied tasks through rich and informative prompts. However, these prompts are lengthy, sometimes exceeding tens of thousands of tokens, which increases computational and financial overhead and degrades the LLMs’ ability to perceive information. Recent efforts to compress prompts in a task-aware manner, without losing essential information, have resulted in shorter prompts tailored to a specific task or query. This typically enhances performance on downstream tasks, particularly in question answering. However, the task-specific features present challenges in efficiency and generalizability. 

In a recent paper: LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression, researchers from Microsoft and Tsinghua University propose a data distillation procedure to derive knowledge from an LLM (GPT-4) and compress the prompts without losing crucial information. They introduce an extractive text compression dataset, containing pairs of original texts from MeetingBank and their compressed versions. Despite its small size, their model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. The new model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x. 


AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

Despite recent progress in scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging. Evaluation is often performed using n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics like COMET have a higher correlation; however, challenges such as the lack of evaluation data with human ratings for under-resourced languages, the complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and the limited language coverage of multilingual encoders, have hampered their applicability to African languages. 

In a recent paper: AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages (opens in new tab), researchers from University College London, University of Maryland, Unbabel, Microsoft and the Masakhane Community (opens in new tab), address these challenges, creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. They also develop AFRICOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLMR) to create state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441). 


Comparing the Agency of Hybrid Meeting Remote Users in 2D and 3D Interfaces of the Hybridge System

Video communication often lacks the inclusiveness and simultaneity enabled by physical presence in a shared space. This is especially apparent during hybrid meetings, where some attendees meet physically in a room while others join remotely. Remote participants are at a disadvantage, unable to navigate the physical space like in-room participants. 

In a Late Breaking Work paper to be presented at CHI2024: Comparing the Agency of Hybrid Meeting Remote Users in 2D and 3D Interfaces of the Hybridge System,” Microsoft researchers present an experimental system for exploring designs for improving the inclusion of remote attendees in hybrid meetings. In-room users see remote participants on individual displays positioned around a table. Remote participants see video feeds from the room integrated into a digital twin of the meeting room, choosing where they appear in the meeting room and from where they view it. The researchers designed both a 2D and a 3D version of the interface. They found that 3D outperformed 2D in participants’ perceived sense of awareness, sense of agency, and physical presence. A majority of participants also subjectively preferred 3D over 2D. The next step in this research will test the inclusiveness of Hybridge 3D meetings against fully in-room meetings and traditional hybrid meetings. 


FeatUp: A Model-Agnostic Framework for Features at Any Resolution

Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction. This is because models like transformers and convolutional networks aggressively pool information over large areas. 

In a paper that was published at ICLR 2024: FeatUp: A Model-Agnostic Framework for Features at Any Resolution, researchers from Microsoft and external colleagues introduce a task- and model-agnostic framework to restore lost spatial information in deep features. The paper introduces two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multiview consistency loss with deep analogies to neural radiance fields (NeRFs), a deep learning method of building 3D representations of a scene using sparse 2D images. In the new research, features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains, even without re-training. FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation. 

Related publications

Continue reading

See all blog posts