Research Focus: Week of September 25, 2023

Published September 27, 2023

Share this page

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus 25 | Week of September 25, 2023

NEW RESEARCH

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Large Language Model (LLM) inference consists of two distinct phases – prefill phase, which processes the input prompt, and decode phase, which generates output tokens autoregressively. While the prefill phase effectively saturates graphics processing unit (GPU) compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles.

In a new paper: SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills, researchers from Microsoft present a solution to these challenges that yields significant improvements in inference performance across models and hardware. SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes. Chunked-prefills allow constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback. Furthermore, the uniform compute design of these batches ameliorates the imbalance between micro-batches, significantly reducing pipeline bubbles.

Read the paper

NEW RESEARCH

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M (opens in new tab). This constraint limits the models’ capability to process open-domain images and effectively handle complex curved trajectories.

In a new paper: DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory, researchers from Microsoft propose an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, DragNUWA simultaneously introduces text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, the researchers propose trajectory modeling with three aspects: a trajectory sampler (TS) to enable open-domain control of arbitrary trajectories, a multiscale fusion (MF) to control trajectories in different granularities, and an adaptive training (AT) strategy to generate consistent videos following trajectories. Their experiments demonstrate DragNUWA’s superior performance in fine-grained control in video generation.

DragNUWA is purely a research project and there are no current plans to incorporate DragNUWA into a product. Any further research will continue to follow Microsoft AI principles.

NEW RESEARCH

Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals

Understanding cortical responses to human visual perception has emerged a research hotspot. Yet, the underlying mechanism of how human visual perceptions are intertwined with our cognitions is still a mystery. Thanks to recent advances in both neuroscience and artificial intelligence, researchers have been able to record the visually evoked brain activities and mimic the visual perception ability through computational approaches.

In a new paper: Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals, researchers from Microsoft reconstruct observed images based on portably accessible brain signals, i.e., electroencephalography (EEG) data. Since EEG signals are dynamic in the time-series format and are notoriously noisy, processing and extracting useful information requires more dedicated efforts. The researchers propose a comprehensive pipeline, named NeuroImagen, to incorporate a novel multi-level perceptual information decoding to draw multi-grained and heterogeneous outputs from the given EEG data. A pretrained latent diffusion model then leverages the extracted semantic information to reconstruct the high-resolution visual stimuli images. The experimental results illustrate the effectiveness of image reconstruction and superior quantitative performance of the proposed method.

Read the paper