a tall building lit up at night

Microsoft Research Lab – Asia

OSDI 2022 highlights from MSR Asia: A peak at the latest research in computer systems

Share this page

Operating Systems Design and Implementation (OSDI) is one of the top academic conferences in the field of computer systems. The 16th OSDI was held from July 11 to 13, 2022. A total of 253 papers were submitted, and 49 were accepted at an acceptance rate of 19.4%. This article highlights three outstanding papers from MSR Asia that were accepted by OSDI 2022 to explore cutting-edge research in computer systems.

ROLLER: Fast and Efficient Tensor Compilation for Deep Learning

text, letter

Paper link: https://www.usenix.org/conference/osdi22/presentation/zhu (opens in new tab)

Code repo: https://github.com/microsoft/nnfusion/ (opens in new tab)

Deep neural networks (DNN) have been used extensively in intelligent tasks such as computer vision and natural language understanding. As DNN computation is known for its complexity, computation-intensive sub-tasks (e.g., matrix multiplication) in a DNN model are abstracted as operators and implemented as kernels, executed on modern accelerators (e.g., GPUs, TPUs) to speed up computation. DNN compilers play an important role in producing high-performance kernels for the development of DNN models. It reduces the burden of (often hand-crafted) library-based kernel development (e.g., cuDNN and cuBLAS) and provides a flexible way to cover the fast-growing number of custom operators that libraries struggle to catch up with and optimize, a growing pain especially for new hardware vendors.

Existing DNN compilers usually treat a DNN operator as tensor computation, which is then translated into nested multi-level loops iterated over the computation on each tensor element. Compiler optimization techniques such as loop partitioning/fusion/reordering are applied to nested loops. Due to the inherent complexity of loop rearrangement, it is a combinatorial optimization problem to find a good solution among a large search space, where there are often millions of choices. Therefore, advanced compilers have to adopt machine learning algorithms to search for a good solution. However, it usually takes thousands of search steps, each evaluated in a real accelerator, to find a reasonable solution using this method. As a result, tuning an end-to-end DNN model using state-of-the-art compilers like TVM or Ansor often requires days, if not weeks, to complete, and the tuning time may be even longer if the DNN model runs on less mature accelerators (e.g., AMD GPU or GraphCore IPU). The resulting inefficiency has become a critical barrier that prevents users from adopting these compilers in real-life scenarios.

To tackle this issue, the researchers studied the performance behavior of a large number of operators compiled by existing compilers and observed an interesting pattern: although there are thousands of optimization options for each operator, the optimal configuration usually matches the hardware configuration so that the hardware resources can be fully utilized. Based on this observation, they proposed a new tensor compiler called Roller. To better match the hardware configuration, Roller treats the computation in a DNN operator as a data processing pipeline, where data tiles are moved and processed in an abstracted hardware with a multilayer memory hierarchy. The goal of generating efficient kernel programs then becomes improving the throughput of the pipeline. Also, for an accelerator to work efficiently, the shape of a data tile should align with hardware characteristics, including the memory bank, memory transaction length, and minimum schedulable unit. To achieve full alignment across multiple hardware features, the available tile shapes are limited. With alignment as a constraint, one only needs to construct an aligned tile shape that saturates the execution unit of the accelerator to maximize the throughput of a pipeline. This type of “white-box” construction process is therefore fundamentally more efficient than solving the original unconstrained combinatorial optimization problem and can significantly reduce the tensor compilation time.

diagram
Figure 1: System overview of Roller

More importantly, Roller’s entire design is built on top of an abstracted hardware layer, helping it to adapt to the ever-increasing number of DNN accelerators. Through experimental evaluations carried out on NVIDIA GPU, AMD GPU and Graphcore IPU, researchers found that the performance of Roller-generated kernels is comparable to and often better than the state-of-the-art tensor compilers and even vendor-provided DNN libraries. For example, after compiling more than 100 popular DNN operators from several mainstream DNN models, researchers found that 59.7% and 73.1% of the kernel codes generated by Roller were better than the libraries of NVIDIA and AMD respectively, and 54.6% and 58.8% of the codes outperformed the programs generated by TVM and Ansor respectively. What’s more, Roller can generate highly optimized kernels in seconds, especially for large expensive custom operators. This achieves a three orders of magnitude improvement on compilation time.

The researchers strongly believe that the Roller technique can provide a more efficient way to build software ecosystems for various hardware accelerators. For new hardware manufacturers that lack mature operator libraries and that previously had to spend significant engineering efforts on efficient kernels, Roller offers a potentially breakthrough opportunity in the computing accelerator market. Roller’s “white-box” compilation approach also opens up new opportunities for deep learning model optimization and deployment for specific hardware. Microsoft Research Asia is currently working on building the next generation of DNN compilation stack based on Roller and is continuously conducting more fundamental research on model optimization, deployment, and new hardware support.

RESIN: A Holistic Service for Dealing with Memory Leaks in Production Cloud Infrastructure

text

Paper link: https://www.usenix.org/conference/osdi22/presentation/lou-chang-resin (opens in new tab)

The infrastructure for cloud computing contains many complex software components running on a massive number of machines with various workloads. Unsurprisingly, these components encounter memory leak issues from time to time. When a process develops leaks, it often affects other components running on the same machine, for example by causing excessive paging, having innocent processes being killed, and leading to machine reboots, seriously affecting customer experiences and even causing financial loss.

However, memory leaks are notoriously difficult to deal with, especially in a production cloud infrastructure setting. Existing solutions often incur high overhead and/or suffer from high inaccuracies. On the one hand, the issues are usually only triggered by rare conditions and occur slowly, and so can easily escape testing and failure detectors. On the other hand, unlike other failures like crashes that present obvious places to begin diagnosis, it is time consuming and sometimes impossible to reproduce memory leaks offline, and developers often struggle to find the root cause of the leak. To address the above issues, researchers from Johns Hopkins University, Microsoft Azure, and MSR Asia jointly proposed RESIN, a holistic service to address memory leaks in production cloud infrastructure.

RESIN takes a centralized approach. It does not require access to a component’s source code, and it also doesn’t require extensive instrumentation or re-compilations. RESIN uses a monitoring agent for each host that leverages low-level OS features to collect memory telemetry data. It automatically supports all components including the kernel. The data analysis is offloaded to a remote service, which minimizes overhead to the hosts. By aggregating data from different hosts, RESIN can run more sophisticated analyses to catch complex leaks. In addition, RESIN decomposes and tackles the memory leak problem in multi-level stages. It performs lightweight leak detection first and triggers more in-depth inspections on the fly when necessary for confirmation and diagnosis. This divide-and-conquer approach allows RESIN to achieve low overhead, high accuracy, and scalability.

diagram
Figure 2: Workflow of the RESIN system

RESIN has been running in production in Microsoft Azure for three years. It provides effective diagnosis reports with high accuracy and low overhead. Currently, the number of unexpected VM reboots caused by low memory in Azure have been reduced by 41x. 

SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute

text

Paper link: https://www.usenix.org/conference/osdi22/presentation/zheng-ningxin (opens in new tab)  

Artifact Evaluation: https://github.com/microsoft/nni/tree/sparta_artifact/sparta (opens in new tab)   

Open-sourced code: https://github.com/microsoft/SparTA.git (opens in new tab)   

With the rapid evolution of deep learning, the scale of deep learning models has experienced exponential growth. A single model now has trillions of parameters, far exceeding the computing power growth rate of hardware accelerators. At the same time, due to the limitations of computing power and power consumption, edge devices have strict requirements on the size and inference delay of deep learning models. Therefore, exploring the sparsity in deep learning models and effectively accelerating the sparse model have become key to the development and deployment of the deep learning model.

However, there are many deficiencies in the system support of sparse models that hinder the exploration of sparse models. These deficiencies are due to the following: (i) It is very difficult to program efficient kernels for various sparsity patterns, usually requiring great efforts from system experts. Thus, deep learning researchers usually use the proxy metric while developing sparse models (e.g., FLOPs, bit width) to estimate the acceleration effect, but the proxy metric cannot accurately reflect the acceleration effect, and sometimes the predicted effect is far from the real effect; (ii) Most of the current sparsity optimizations only focus on a single operator, while ignoring the propagation of the sparsity from one sparse operator to its neighbors in the model; (iii) Currently, it is difficult to reuse the sparsity optimization for a specific model or combine it with other optimizations. For example, an optimization for a pruned operator (or a quantized operator) can hardly be directly applied to an operator that is both pruned and quantized. 

diagram
Figure 3: Architecture of SparTA

In order to address the above-mentioned challenges, researchers proposed SparTA, an end-to-end compilation framework for optimizing sparse deep learning models. SparTA takes the sparsity attribute of tensors (generated by pruning and quantization) as the core abstraction called TeSA (Tensor-with-Sparsity-Attribute). The compilation process and optimizations are based on TeSA. Figure 3 shows the architecture of SparTA. Users first use TeSA to mark the sparsity pattern of certain tensors in the input deep learning model, then SparTA uses three core technologies to optimize the model from end to end. The first of these technologies propagates the sparsity attributes of tensors along the data flow graph of the deep learning model. This is automatically done using Tensor Algebra and Tensor Scrambling. The second technology transforms a sparse operator into sub-operators, each of which has a simple sparsity pattern, making it much easier to optimize. This transformation allows different optimization techniques to be organically combined. The third technology carries out code specialization for sparse operators, i.e., specializing concrete sparsity patterns into kernel codes, removing dead codes, and applying accelerator specific instructions in kernel codes, (e.g., wmma for Tensor Cores). 

After comprehensive testing, SparTA demonstrated an average speedup of 8.4x compared to existing tools. SparTA not only optimizes the sparse models that are shrunk from dense models, but it can also be used to accelerate large pre-trained models that are originally designed with a specific sparse pattern. For example, SparTA has been used to optimize NUWA, a pretrained video generation model developed by Microsoft Research. SparTA achieves more than 2x speedup on the newly proposed 3DNA sparse attention in the NUWA model. The researchers are currently reorganizing and optimizing the SparTA’s code to improve usability. It will soon be officially open sourced to promote the research and practicality of the sparse deep learning model.