ClusterClean: A weak semi-supervised approach for cleaning data labels
2019
Every day, tens of thousands of Amazon customers message Amazon’s selling partners inquiring about orders, products and services. Unfortunately, we have observed over time that buyers may send unsolicited and often times mal intended messages to sellers through the buyer-seller messaging (BSM) service. Although the BSM service gives the ability to sellers to report such messages, most of them do not make use of this feature. Hence, collecting training and testing data with clean labels to build machine learning models in order to proactively block unsolicited messages and help prevent and mitigate losses for Amazon is extremely challenging. To address this problem we propose ClusterClean, an algorithm that automatically cleans data labels with little to no human effort. ClusterClean has the ability to a) accurately infer the labels of unlabeled data points based on an initial small labeled set and b) detect new data patterns such as incoming unobserved spam attacks by soliciting feedback from users to decide on their label. Experiments on approximately 150,000 real messages from Amazon customers showed that ClusterClean can accurately clean the labels of these messages in a few minutes, drastically reducing the human effort and time spent on this task.