Cloud data warehouses are today’s standard for analytical query processing. Multiple cloud vendors offer state-of-the-art systems, such as Amazon Redshift. We have observed that customer work-loads experience highly repetitive query patterns, i.e., users and systems frequently send the same queries. In order to improve query performance on these queries, most systems rely on techniques like result caches or materialized views.
However, these caches are often stale due to inserts, deletes, or updates that occur between query repetitions. We propose a novel secondary index, predicate caching, to improve query latency for repeating scans and joins. Predicate caching stores ranges of qualifying tuples of base table scans. Such an index can be built on the fly, is lightweight, and can be kept online without recomputation.
We implemented a prototype of this idea in the cloud data warehouse Amazon Redshift. Our evaluation shows that predicate caching improves query runtimes by up to 10x on selected queries with negligible build overhead.
Predicate caching: Query-driven secondary indexing for cloud data warehouses
2024
Research areas