Insert-optimized implementation of streaming data sketches
2025
We present insert-optimized implementations of three fundamental data sketching algorithms: Count Sketch (CS), SpaceSaving (SS), and Karnin-Lang-Liberty (KLL).While these sketches are widely used for approximate query processing and stream analytics, their practical insert performance often falls short of their full potential. Through careful engineering and novel implementation strategies, we achieve substantial throughput improvements over both naïve and existing implementations. Our approach demonstrates speedups of up to 12.30x, 2.00x, and 1.52x for CS, SS, and KLL respectively when compared to the industry-standard Apache DataSketches library. When measured against naïve implementations, we achieve even more dramatic improvements: 8.63x for CS, 7.03x for SS, and 446.00x for KLL. We also measure against available open-source implementations used as baselines in other works and outperform them by a large margin. We detail the technical optimizations enabling these improvements, including fast hash range reduction and hash sharing for Count Sketch, SIMD vectorization for SpaceSaving, and preallocation and branching improvements for Karnin-Lang-Liberty. Our implementations maintain the theoretical guarantees of the original algorithms while providing substantially better practical insert performance, making them particularly valuable for high-throughput streaming applications where update speed is critical.
Research areas