Using remote memory for the Java heap enables big data analytics frameworks to process large datasets. However, the Java Virtual Machine (JVM) runtime struggles to maintain low network traffic during garbage collection (GC) and to reclaim space efficiently. To reduce GC cost in big data analytics, systems group long-lived objects into regions and excludes them from frequent GC scans, regardless of whether the heap resides in local or remote memory. Recent work uses a dual-heap design, placing short-lived objects in a local heap and long-lived objects in a remote region-based heap, limiting GC activity to the local heap. However, these systems avoid scanning by reclaiming remote heap space only when regions are fully garbage, an inefficient strategy that delays reclamation and risks out-of-memory (OOM) errors. In this paper, we propose SmartSweep, a system that uses approximate liveness information to balance network traffic and space reclamation in remote heaps. SmartSweep adopts a dual-heap design and avoids scanning or compacting objects in the remote heap. Instead, it estimates the amount of garbage in each region without accessing the remote heap and selectively transfers regions with many garbage objects back to the local heap for reclamation. Preliminary results with Spark and Neo4j show that SmartSweep achieves performance comparable to TeraHeap, which reclaims remote objects lazily, while reducing peak remote memory usage by up to 49% and avoiding OOM errors.
Add the full text or supplementary notes for the publication here using Markdown formatting.