Differential data quality verification for partitioned data
Modern companies and institutions rely on data to guide every single decision. Missing or incorrect information seriously compromises any decision process. In previous work, we presented Deequ, a Spark-based library for automating the verification of data quality at scale. Deequ provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables unit tests for data. However, we found that the previous computational model of Deequ is not flexible enough for many scenarios in modern data pipelines, which handle large, partitioned datasets. Such scenarios require the evaluation of dataset-level quality constraints after individual partition updates, without having to re-read already processed partitions. Additionally, such scenarios often require the verification of data quality on select combinations of partitions. We therefore present a differential generalization of the computational model of Deequ, based on algebraic states with monoid properties. We detail how to efficiently implement the corresponding operators and aggregation functions in Apache Spark. Furthermore, we show how to optimize the resulting workloads to minimize the required number of passes over the data, and empirically validate that our approach decreases the runtimes for updating data metrics under data changes and for different combinations of partitions.