Instalytics (Intelligent Store-powered Analytics) is a vertically integrated infrastructure stack that enables efficient big data analytics in large-scale data centers, by careful co-design of the storage layer (cluster file system) with the compute layer (query engine and job scheduler).
As an example of the benefits from such co-design, Instalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, Instalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle To achieve this, Instalytics uses compute-awareness to customize the 3-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables Instalytics to preserve the same recovery cost and availability as traditional replication. Another example of using compute-awareness is that the file system in Instalytics exposes a new sliced-read API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently through coordinated request scheduling and selective caching at the storage nodes.
Personne
Kaushik Rajan
Principal Researcher