Adaptive Caching in Big SQL using the HDFS Cache

  • ,
  • Nimrod Megiddo ,
  • Navneet Potti ,
  • Fatma Özcan ,
  • Uday Kale ,
  • Jan Schmitz-Hermes

ACM Symposium on Cloud Computing (ACM SOCC) |

Published by ACM

Publication

The memory and storage hierarchy in database systems is currently undergoing a radical evolution in the context of Big Data systems. SQL-on-Hadoop systems share data with other applications in the Big Data ecosystem by storing their data in HDFS, using open file formats. However, they do not provide automatic caching mechanisms for storing data in memory. In this paper, we describe the architecture of IBM Big SQL and its use of the HDFS cache as an alternative to the traditional buffer pool, allowing in-memory data to be shared with other Big Data applications. We design novel adaptive caching algorithms for Big SQL tailored to the challenges of such an external cache scenario. Our experimental evaluation shows that only our adaptive algorithms perform well for diverse workload characteristics, and are able to adapt to evolving data access patterns. Finally, we discuss our experiences in addressing the new challenges imposed by external caching and summarize our insights about how to direct ongoing architectural evolution of external caching mechanisms.