Astra is a compilation and execution framework that optimizes execution of a deep learning training job. Instead of treating the computation as a generic data flow graph, Astra exploits domain knowledge about deep learning training to adopt a custom approach to compiler optimization.
One of the key insights in Astra is to exploit the unique repetitiveness and predictability of a deep learning job, to perform online exploration of the optimization state space in a work-conserving manner, while making progress on the training job. This dynamic state space exploration and adaptation in Astra uses lightweight profiling and indexing of profile data, coupled with several techniques to prune the exploration state space. Effectively, the execution layer custom-wires the infrastructure end-to-end for each job and hardware, while keeping the compiler simple by obviating the need for building and maintaining accurate performance models of fast-changing hardware architectures and workloads in deep learning.