By MSRA Systems and Networking Team
When you surf the Internet with your favorite search engine, when you play with Xiaoice or Siri to see if they are smart or dumb, or when you experience the adventure and convenience of autonomous driving, that smooth user experience is made possible only with the seamless combination of the underlying systems, applications, and algorithms. Behind various technology revolutions, a key ingredient is the revolutions of the underlying systems.
We live in one of the most exciting times for systems research and innovation. Over the past decade, we have witnessed an acceleration of technological revolutions in the Internet, online search, big data, social networks, cloud, blockchain, and artificial intelligence (AI), just to name a few. All of these revolutions have demanded unprecedented scale and complexity that are made possible only through significant advancements in systems, especially in distributed systems, and the increasingly important co-evolution of fields such as database, programming language and compiler, theory and formal methods, hardware and architecture, and security and privacy.
These systems innovations seldom come under the spotlight, because systems are typically hidden from users. In this article, we will discuss some of these innovations and share our thoughts on some important questions, such as what is a system? What is a good system? What are the technology challenges and trends in systems innovation and designs for the future?
Invisible systems
Good systems are invisible.
Despite being a key driver of the recent major paradigm shifts in technology, systems innovations seldom come under the spotlight compared to breakthroughs in other technical areas, such as computer vision, speech, and multimedia, where the effects and results are often evident to the general public. The fundamental principles of a good system design also imply that the best systems are those that are mostly invisible, because systems research is all about bringing order, simplicity, and coherence to what is otherwise a chaotic, random, and unmanageable foundation. Consequently, the most important mission of system design is to come up with the right abstractions that hide all complexity from users in a way that does not introduce any impedance mismatch. The best system simply works naturally, consistently, and predictably. A system becomes visible and gets in the way only if it is poorly designed or when it fails.
Defining the future
Good systems must define the future. Just as the system researchers at Xerox PARC invented GUI (Graphical User Interface) with the concept of WYSIWYG (What You See Is What You Get”), personal computer, Ethernet, Laser Printer, and so on in the 1970’s, which defined computing for the following decades. Today, we are again at the point where there is an opportunity to redefine computing for the next decades to come.
The speed of systems innovations will continue to accelerate. It will create a new future where our lives and the world around us will be dramatically redefined. The future we are envisioning can be described as one with omnipresent intelligence, enabled by computing and storage capabilities invisibly embedded and accessible in our environment and connected to the cloud. One hallmark of the future is the disappearing boundary between our physical and virtual worlds, thanks to the abundance of connected and embedded sensors and actuators in a wide variety of forms, powerful analytics and intelligent services, and emerging technologies such as mixed reality for natural and immersive user experiences. Our relationship with computing will therefore be redefined in new interaction models.
Technology trends for future systems
We believe that the following technology trends will drive us toward the future we envision.
The shift back towards decentralization
Technology trend: Evolving from centralized cloud computing to a new decentralized computing paradigm.
Computer systems have experienced two major paradigm shifts over the history of computing. The first was a shift from (centralized) mainframes to (decentralized) personal computers in the 1980s and 1990s. We are now in the second paradigm shift from personal computing to (centralized) cloud computing. Each paradigm shift has been driven by advancements in computing technology and the applications that have been thus enabled, as well as by economic and social dynamics.
We believe that the alternation between centralized and decentralized computing paradigms will continue. One significant emerging technology trend is to move beyond (mostly) centralized cloud computing back to a decentralized computing paradigm. The Internet of Things (IoT) and edge computing are early indicators of this trend. Architecturally and philosophically, decentralization is the common denominator of the whole spectrum of public, consortium, and private systems, which enables applications that are infeasible in centralized systems. This is the focus on distributed systems research: we highlight several key directions and challenges.
- Scalable strong consistency. At the core of distributed system research are strong consistency protocols, on which many large-scale distributed systems, such as distributed control systems and transaction processing in digital banks, rely. With the emergence of the Internet, more people from a wider geographical range are connected than ever before, putting higher demands on system scales. Today, achieving scalable strong consistency across inter-continent datacenters has become a key focus in systems research and practice.
Strong consistency conventionally faces three constant challenges: consistency, availability, and network partitions. With advances in Internet technologies, scalable strong consistency protocols have made great advances in datacenters and even across the Internet. Protocols for in-datacenters and across-datacenters have shifted from weak-consistency protocols to a new generation of protocols based on Paxos and BFT. The emergence of blockchain also makes cross-Internet protocols a possibility. In the foreseeable future, new techniques, such as data partitioning, topological sorting, and asynchronous negotiations will further enhance the performance and applicability of strong consistency protocols.
- Infinite data and computations. Efficiency, availability, and reliability have become the most critical requirements for computation and data platforms, and as a result, they are at the center of system innovations. In contrast to conventional database systems, computation and data storage are decoupled in cloud computing. Computation platforms, from MapReduce to Spark to Flink, have evolved from designs that mainly focus on efficiency and scalability to ones that focus on low latency. Data storage, from Bigtable to MongoDB to Spanner, while initially providing high scalability, now puts special emphasis on consistency. Database systems, on the other hand, have started supporting a variety of data models, from graphs to documents to streams, and borrowing key design philosophies from cloud computing platforms, largely rendering the two types of systems similar in architecture and design.
High efficiency, high availability, and high reliability have been and will continue to be the key requirements of computation, data platforms, and main areas of innovation. Supporting OLAP and OLTP workloads in one system; processing multiple data models such as tables, graphs, and documents; and scaling elastically from one machine to across datacenters and providing global consistency and failover will all be key challenges in building a new generation of systems.
- Intelligent edge computing. Edge computing is becoming increasingly intelligent, thanks to advances in both software and hardware, including deep-learning algorithms, systems, and deep-learning chips. There are many new intelligent devices emerging, such as smart speakers, cameras, home appliances, and self-driving cars, with each enabling various new applications and scenarios such as smart homes, intelligent transportation, connected factories, and smart cities. These new edge devices and applications lead to a new computing paradigm shift, from centralized intelligence in the cloud to distributed intelligence across the cloud and the edge. With intelligent edge computing, edge devices can process data and make decisions locally, without having to constantly rely on the cloud.
The rise of intelligent edge computing has several key driving forces. First, there has been a data explosion as more and more smart devices are being connected. For example, the now prevalent surveillance cameras and the emerging self-driving cars both generate huge amounts of data. Transmitting and processing such large volumes of data goes beyond the capability of today’s network and cloud. Second, cloud computing itself cannot meet the requirements of new applications and scenarios, such as real-time data processing and user-privacy protection. Take self-driving cars as an example—the data must be processed in real-time to ensure safety. The extra latency of transmitting all the data to the cloud makes a cloud-based approach infeasible. In addition, many types of user data are sensitive and private, such as image and speech data acquired by smart cameras and speakers at home. To protect user privacy, these types of data should not be sent to the cloud; and that makes on-device data processing the only solution. More importantly, with the latest hardware advances, intelligent edge computing is now possible. With dedicated chips for deep learning, edge devices now are capable of processing huge amounts of vision and speech data, providing a solid foundation to enable many new edge-computing applications.
Intelligent edge computing still has many challenges. Compared to cloud servers, edge devices have limited computation and memory resources. Many deep-learning models are large and so must be compressed to run on resource-limited edge devices. Even if a compressed model can run on an edge device, the performance, such as the number of images processed per second, may be low. For battery-powered mobile devices, energy consumption is also a big concern, as running a deep-learning model is very power consuming. Optimizing system performance and power consumption is a significant research problem. Another critical problem is model protection. In the AI era, models are valuable assets. It usually takes a huge amount of resources and efforts to train a world-class deep-learning model. In cloud computing, models are stored in and run on the cloud, users cannot directly access the models. On the edge, models are stored and run on local devices. Therefore, malicious users may copy the models by hacking the devices. Thus, preventing piracy and illegal use of models becomes a critical problem.
It is worth noting that intelligent edge computing is not meant to replace cloud computing, but to complement it. With a huge number of edge devices deployed, the cloud might be called upon to address the key challenge of managing those distributed devices. By connecting every edge device to a cloud service, one may remotely deploy, manage, monitor, and update all the connected edge devices to ensure that all the devices run smoothly and to reduce management cost. The edge and the cloud may also work together to complete certain tasks. For example, in street surveillance, a smart edge camera may pre-process the raw video data for event detection and image filtering. Then, it would send the processed results to the cloud for more advanced processing, such as facial recognition and vehicle license plate recognition. As a result, all the resources of the edge and the cloud would be fully utilized for fast event detection, low network transmission, and on-demand deep video analysis. We believe that the edge and the cloud will continue to co-evolve to provide the best support for new applications and scenarios in the AI era.
Security, privacy, and trust as the foundation
Technology trend: Hardware-based security, blockchain (distributed consensus and applied cryptograph), and software verification form the new foundation for security, privacy, and trust.
The convergence of the physical and virtual worlds, together with the decentralization of computing, demands a new foundation for security, privacy, and trust, as our physical presence and interactions are now widely captured in digital forms, with analytics, intelligence, transactions, and actions enabled autonomously. Systems, with security, privacy, and trust as core design pillars, serve as the foundation to return data ownership back to users, to run businesses in a compliant way, and to lower technology adoption barriers for underserved populations.
We believe that hardware-based security, blockchain, and software verification will be an integral part of such a new foundation. Hardware-based security offers a root of trust. We are already seeing this manifested in Intel’s SGX and ARM’s TrustZone. The underlying technology for blockchain includes novel Byzantine fault-tolerant consensus algorithms, hash-linked data structures, cryptographic building blocks such as zero-knowledge proofs and secure multi-party computation, smart contract, and decentralized protocols and applications built around the principles of cryptoeconomics. This will play a central role in enabling trusted, privacy-preserving, censorship-resistant collaboration, communications, and transactions among untrusted parties. Finally, much progresses have been made in the holy grail of verified software. A combination of trusted hardware and verified software will likely serve as a trusted computing base. Verified software is also a necessity for the wide adoption of emerging technologies such as smart contracts for blockchain applications.
- Blockchain has been taking the world by storm, soon after the first working blockchain system, Bitcoin, came into existence and showcased the huge potential of a decentralized, trustless, and neutral system. Hundreds, if not thousands, of cryptocurrency systems (alt-coins) quickly followed, ranging from simple imitations of Bitcoin with a different brand name, to systems with significant improvements in terms of functionality and performance. Then came Ethereum, which first realized smart contracts on a global-scaled blockchain, making blockchain fully programmable. Smart contracts drastically reduced the process of issuing a new cryptocurrency or, more generally, creating a new decentralized application, from building a whole system and bootstrapping a global network to writing and deploying several lines of smart-contract code. This resulted in an explosion of new cryptocurrencies, new tokens representing physical and digital assets, and new protocols enabling new kinds of collaboration and transactions. Rooted in the open-source community, blockchain technology has also been quickly adopted and adapted by startups, established companies, governments, and all kinds of other organizations to streamline their existing business processes and to break new ground. Many proof-of-concept and pilot projects are rapidly expanding the scope of scenarios that blockchain can play a key role in and are, at the same time, pushing the boundaries of blockchain technology. With all the exploration being carried out in the blockchain space, blockchain itself is becoming a fast-moving collection of concepts, techniques, design principles, and systems.
The development and innovation of blockchain systems has several noteworthy themes: 1) Blockchain-based systems are moving away from a monolithic design, where networking, storage and computation are tightly grouped together, to a layered, hierarchical and sharded design, where the system and states are broken into loosely coupled parts and unnecessary dependencies are minimized. In such designs, node in the blockchain network is relieved of the burden of processing, storing, and transmitting every transaction. The transactions could also be processed in parallel, asynchronously, and even across different blockchain systems. This not only allows blockchain systems to scale up to handle higher transaction throughput, thereby alleviating one of most severe limitations of the current blockchain systems, but also enables devices with constrained resources, such as mobile phones and IoT devices, to participate in a blockchain-based network independently. 2) To be compliant with privacy-protection laws and content regulations, the design of an append-only and immutable datastore underlying the blockchain systems is being revisited. The state transitions of the blockchain systems have to be verifiable without requiring an immutable full copy of the transaction history, which might contain private, inappropriate, or sensitive data for which a removal request has been received by the data owners, courts, or other stakeholders. 3) To innovate and iterate at a fast pace, the community has been mostly employing a trial-and-error approach to the design and implementation of the blockchain systems and smart contracts. To grow out of experimental, use-at-your-own-risk platforms into mainstream systems, formal verification methods have been developed to help detect and remove bugs and security vulnerabilities before deployment, as well as in the design and implementation of every part of the blockchain system. Tools are being developed to make developing and using smart contracts much less prone to error. 4) In the current blockchain systems, new features, bug fixes, parameter settings, and protocol upgrades are mostly implemented by hard forks, which require all nodes to download and run new versions of the software. These hard forks are coordinated out-of-band and governed in an ad hoc manner. To become a sustainable platform, in-protocol governance and upgrade mechanisms are being tried so that the decision-making and code deployments can all be conducted efficiently and transparently on-chain.
The fast-evolving landscape of blockchain and the vast number of scenarios that are being researched will drive and accelerate the development of the underlying technologies, which span from distributed systems to applied cryptography. These technologies will become the foundation of trust in the future of invisible and connected systems that we are envisioning as they provide the needed security, privacy protection, and fundamental trust. The concept of smart contracts will also play an instrumental role in the digital transformation of society to push productivity and efficiency to new levels.
The revolution of Hardware and Systems Architecture
Technology trend: Defining the boundaries between hardware and software, and determining the best partitioning of responsibilities between them is becoming an important aspect of system architecture design.
Over the past few decades, the focus has always been on general-purpose hardware. The emergence of cloud computing, as well as deep learning, has led to a revolution in hardware; for example, hardware optimized for accelerating deep learning or supporting major cloud workloads for better cost and performance. As a result, for systems, software and hardware co-design is an emerging trend.
The advances of computer architecture are tightly coupled with the advances of the semiconductor integrated circuit (IC) design. Computers are the biggest driving force behind the advances in ICs and also benefit the most from its advances. Over more than half a century, IC technology has advanced at a high speed, following Moore’s Law. The high-density chips have brought multiple revolutions in computer architecture, from desktop to mobile phones, and from servers to cloud computing. At the same time, the high demand of computing has led to the fast advances of CPU, DRAM, GPU, and high-speed network chips. However, CMOS-based IC technology is reaching a bottleneck, due to physical limits. On one hand, it is very hard for the feature size of chips to continue shrinking under 5 nanometers. On the other hand, power and cooling have become very challenging problems for modern chips. In this situation, what will the future of hardware and architecture hold?
For computing, heterogeneous systems are becoming a hot research topic in computer architecture. Deep-learning accelerators, FPGA-based re-configurable hardware, programmable accelerators, and new types of application processors are emerging. On the memory side, there are innovations such as high-speed Non-Volatile Memory (NVM), High Bandwidth Memory (HBM), and Memory Disaggregation (MD), each of which will help further improve computation performance by removing the bottlenecks in memory-access bandwidth and latency. In terms of persistent storage, NVM may also be the foundation of the new Storage-Class Memory (SCM) with super-low latency. Following the revolution in data-center storage brought by SSD, the next-generation open channel SSD will further improve the IO bandwidth of datacenter storage systems and reduce the cost. Furthermore, storage systems also need architecture innovations similar to what is needed in memory disaggregation. Finally, on the networking side, new technologies such as low-latency lossless scalable networks, programmable switches, and NICs will provide smoother interconnection of computations.
With such new hardware resources emerging in data centers, we see the trend calling for fine-grained hardware disaggregation and sharing for computation, memory and I/O such as networking and storage so as to improving the efficiency and performance, that is crucial for cloud decentralization. Universal hardware virtualization can make the data centers controllable at a sub-component level by dedicated hardware with software specified policy, where all sub-components would be addressable across the data center networks transparently and, therefore, could be shared at much larger scale. All sub-components would be accessed with hardware-enforced quality-of-service.
Such capabilities provided by universal hardware virtualization could drastically improve the efficiencies of the data centers, as it would remove the granularity issues that persist with standard server/rack or even cluster/data center assembly and usage techniques. Doing so would remove the current requirement of physical isolation to guarantee performance. It would increase security by providing explicit hardware isolation and enable the transparent sharing of expensive components such as flash across many servers, while still providing low latency and high throughput. Unlike existing disaggregation that enables the sharing of remote components, universal hardware virtualization enables the exploitation of server-level locality to achieve much more predictable performance, which could enable such applications as finance, telco-cloud, and even, potentially, hard-real time mission critical applications.
Systems and intelligence intertwined
Technology trend: A new foundation for systems will surely leverage AI heavily in terms of its ability to capture statistical properties at scale, cope with uncertainty, and model complex interactions.
We will see the co-evolution of systems and artificial intelligence.
- System for AI. Artificial intelligence, especially deep learning, has shown great promise in areas such as vision, speech, and natural language processing. Its fast growth has fragmented the field so that there have emerged quite a few different and incompatible deep-learning frameworks with isolated and often redundant efforts, leading to overall inefficiency. There is an urgent need to introduce system thinking to bring about order, simplicity, structure, and coherence, much like what the relational model did for the database field. For example, a comprehensive stack, from programming language and compiler, all the way to heterogeneous hardware for deep learning and AI, is needed to address the fragmentation in frameworks.
In the era of artificial intelligence, large-scale systems should support diversified workloads, from MapReduce tasks, and graph computation, to machine learning algorithms and cutting-edge deep learning models. This introduces new challenges to system interfaces and mechanisms for the efficient execution of different workloads.
- Automatic compilation and optimization for deep-learning frameworks. The front end of AI systems is becoming increasingly flexible as it moves towards supporting generic computation, while the back-end computing resources are becoming more and more powerful. To deal with the fast evolution of both the front and the back end, deep-learning frameworks with automatic compiling and optimization abilities are critical to connecting the two ends together. Deep-learning compilers usually start from a higher-level abstraction layer, involving meta programming, high-dimension data structures, and automatic differentiation. There are many complicated problems in optimization that need to be solved, including unified languages in both ends for end-to-end code optimizing, model sparsity and compression for AI accelerators, automatic code fusion, and fusion with other computing logics.
In the AI era, large-scale computing systems must be able to not only process a huge amount of data on massive devices with high performance, but also provide the capability to support mixed execution of different types of tasks. In the future, the boundaries between different tasks will become vague. Many data-analytics applications, in practice, will become a combination of different computation tasks. The complexity of such systems will impose new challenges, such as abstraction of system interfaces, execution-mechanism designs, and system-wide global optimization. Yet, these systems will be able to manage and utilize computing resources and make it possible to optimize applications from end-to-end computation pipelines. This calls for more research on cross-task scheduling and optimization. Cross-task, large-scale systems will become a new direction for systems research.
- AI for System Systems are becoming increasingly complicated and are growing dramatically in scale. The traditional methodology used to determine the correctness, reliability, and performance of a system is becoming outdated and ineffective. A new foundation for systems will surely leverage AI heavily in terms of its ability to capture statistical properties at scale, cope with uncertainty, and model complex interactions.
In conclusion, we envision that the co-evolution of the above technologies will push forward the innovation of future systems, acting as the driving force to make computing ubiquitous and invisible. There is still a long road ahead for systems to evolve in order to fully support the ever-increasing demand from the diverse applications emerging on the horizon. Much research needs to be done and both the industry and the academic are heavily investing in many of the aforementioned areas. We foresee an exciting and very productive future ahead of us in the general area of computer systems research.