Federated self-supervised learning for acoustic event classification
Standard acoustic event classification (AEC) solutions require large-scale collection of customer data from client devices for model optimization. However, they inevitably suffer from the risks of compromising customer privacy. Federated learning (FL) is a compelling framework that decouples data collection and model training to protect customer privacy. In this work, we investigate the feasibility of applying FL to improve AEC performance under a strict constraint that no customer data can be directly uploaded to the server. We assume no pseudo labels can be inferred from on-device user inputs, aligning the typical use cases of AEC. We adapt self-supervised learning to the FL framework for on-device continual learning of representations. By training representation encoders on a growing and increasingly diverse pool of local customer data, we demonstrate that it results in improved performance of the downstream AEC classifiers without labeled/pseudo-labeled data available. Compared to the baseline w/o FL, the proposed method improves precision up to 20.3% relatively while maintaining the recall. Our work differs from prior work in FL that our approach does not require user-generated learning targets, and we use internal data from Amazon Alexa to maximally simulate the production settings.