Freeing Compute Caches from Serialization and Garbage Collection in Managed Big Data Analytics

Abstract

Managed analytics frameworks (e.g., Spark) cache intermediate results in memory (on-heap) or storage devices (off-heap) to avoid costly recomputations, especially in graph processing. As datasets grow, on-heap caching requires more memory for long-lived objects, resulting in high garbage collection (GC) overhead. On the other hand, off-heap caching moves cached objects on the storage device, reducing GC overhead, but at the cost of serialization and deserialization (S/D). In this work, we propose TeraHeap, a novel approach for providing large analytics caches. TeraHeap uses two heaps within the JVM (1) a garbage-collected heap for ordinary Spark objects and (2) a large heap memory-mapped over fast storage devices for cached objects. TeraHeap eliminates both S/D and GC over cached data without imposing any language restrictions. We implement TeraHeap in Oracle’s Java runtime (OpenJDK-1.8). We use five popular, memory-intensive graph analytics workloads to understand S/D and GC overheads and evaluate TeraHeap. TeraHeap improves total execution time compared to state-of-the-art Apache Spark configurations by up to 72% and 81% for NVMe SSD and non-volatile memory, respectively. Furthermore, TeraCache requires 8x less DRAM capacity to provide performance comparable or higher than native Spark. This paper opens up emerging memory and storage devices for practical use in scalable analytics caching.