No public cloud can host large latency-sensitive services, such as search engines, in a way that is economic for those services today! Project LEAP (short for Lean, Efficient, And Predictable) addresses the research challenges in enabling cloud platforms to host these services in a rightsized, elastic, and with predictable tail latency. The challenges include how to prevent interference from co-located workloads, how to prevent performance jitter due to I/O, how to make the services elastic despite their large storage footprints, and how to leverage spare capacity to run their batch workloads without affecting any co-located workloads. LEAP leverages emerging hardware and sophisticated software techniques to overcome these challenges. As a by-product of the project, we are also introducing an increasingly popular benchmark suite of representative Azure workloads.
As a concrete target, we aim to host Bing head services on Azure with similar tail latency to their current latency on bare metal. We are pursuing this target in collaboration with several product groups, including Bing, Azure, and Windows/Hyper-V. The benefits are significant, including new revenue for Azure as it will host latency-sensitive services for 1st and 3rd parties; new revenue for Bing as it will monetize its internal services, such as IndexServe, by providing them as Azure services to external users; and unifying AutoPilot and Azure into a single infrastructure.
Another major concrete target is to deploy aggressive resource (CPU, memory, etc) oversubscription and harvesting in Azure without producing performance impact for workloads. This will produce substantially lower costs for Azure.
Research challenges and our current progress:
- Reduce virtualization overhead (deployed in Bing and AZAP)
- Create accurate VM rightsizing algorithms (deployed in Azure)
- Eliminate cross-VM interference in last-level caches (SoCC’21)
- Eliminate performance jitter due to I/O processing, without wasting resources (ASPLOS’20)
- Enable elasticity with high storage performance, despite large storage footprints
- Enable services to leverage burstable VMs and recommend their use (deployed in Azure)
- Harvest spare resources for workloads without affecting co-located VMs (OSDI’20, SOSP’21, EuroSys’21, ASPLOS’22, deployed in Azure)
- Maximize resource oversubscription without performance impact on workloads (paper in preparation, deployed in Azure)