Optimally leveraging density and locality for exploratory browsing and sampling

HILDA@SIGMOD |

Exploratory data analysis often involves repeatedly browsing a small sample of records that satisfy certain predicates. We propose a fast query evaluation engine, called NeedleTail, aimed at letting analysts browse a subset of the query result on large datasets as quickly as possible, independent of the overall size of the result. NeedleTail introduces DensityMaps, a lightweight in-memory indexing structure, and a set of efficient and theoretically sound algorithms to quickly locate promising blocks, trading off locality and density. In settings where the samples are used to compute aggregates, we extend techniques from survey sampling to mitigate the bias in our samples. Our experimental results demonstrate that NeedleTail returns results 7× faster on average on HDDs while occupying up to 23× less memory than existing techniques.