Amazon takes top three spots in Audio Anomaly Detection Challenge
Team from Amazon Web Services also wins the best-paper award at the Workshop on Detection and Classification of Acoustic Scenes and Events.
This week at Amazon Web Services’ re:Invent 2020 conference, Amazon announced Amazon Monitron, an end-to-end machine-monitoring system composed of sensors, a gateway, and a machine learning model that detects anomalies in vibration (structure-borne sound) or temperature and predicts when equipment may require maintenance.
Machine condition monitoring was also the topic of a challenge at the Workshop on the Detection and Classification of Acoustic Scenes and Events (DCASE 2020), in November, in which Amazon took the top three spots, out of 117 submissions.
The challenge was to determine whether the sounds emitted by a machine — such as a fan, pump, or valve — were normal or anomalous. Forty academic and industry teams submitted entries, an average of almost three submissions per team.
In a pair of papers (paper 1|paper 2) we presented at the workshop, we describe the two different neural-network-based approaches we took in our submissions to the challenge. The first of those papers won the workshop’s best-paper award.
Auditory machine condition monitoring has been common in industrial settings for several decades. Seasoned maintenance experts can identify problems in the machines they monitor just by listening to them and realizing that “something doesn’t sound right.” But by the time anomalies are audible to the human ear, the underlying problems may already be well advanced.
With the advent of machine learning and big data, there has been a lot of interest in teaching machines to detect anomalies sooner, to help predict when preventative maintenance might be necessary.
Data, labels, and rare failures
In general, anomaly detection is the problem of identifying abnormal inputs in a stream of inputs. Depending on the available data, there are three different ways to train anomaly detection systems: (i) fully supervised, in which labeled examples of normal and abnormal data are presented; (ii) semi-supervised, in which only normal data is presented; and (iii) unsupervised, in which there are no labels in the data set, and outliers have to be classified automatically.
Anomalies can manifest themselves in different ways. For instance, you can have slow concept drift or sudden, instantaneous outliers. Typically, the data is also highly imbalanced — a lot more “normal” examples than “abnormal.”
Machines worth monitoring carefully — especially those that are critical or expensive — are usually also well maintained. This means that they rarely fail, and gathering anomalous data from them is challenging and may take many years and lots of effort.
Additionally, machines operate in different modes and under variable load or performance conditions, and their characteristics can change over time as they age and approach steady state. Some industries’ operational profiles have seasonal variations as well.
All of these factors make anomaly detection challenging in the industrial setting. When implementing an anomaly detection system, one has to depend mostly on “normal” data, gathering additional data over time and eliciting user feedback.
If accurate physical models of machines are available, it may be possible to simulate failures and generate “abnormal” data that way. One can also generate anomalous data by inducing hardware failures in the lab. But one has to be prepared to work with minimal data when a machine is instrumented for the first time (the so-called cold-start problem).
Anomaly detection and our two neural approaches
The first approach builds on recent advances in autoregressive neural-density estimation, or calculating a data distribution for streaming data by trying to predict each new data item on the basis of those that preceded it. As might be expected, such models are very sensitive to the order in which data arrives.
An earlier model, called the masked autoencoder for density estimation (MADE), makes a separate prediction for each feature — each dimension — of the input. With audio signals, however, the dimensions of the input are the energies in different frequency bands, which produce a composite picture of the signal that individual frequencies won’t capture.
We introduce a variation of MADE that bases its predictions on groups of input features — in this case, groups of frequency bands — and which we accordingly call Group MADE.
In the second paper, we use a self-supervised approach for representation learning, which has been successful recently in solving problems in vision and speech. We believe that we are the first to apply it to audio anomaly detection.
In the absence of anomalies in the training data, we trained a network to instead learn to distinguish multiple instances of machines within a given machine type. We found that the features learned by such a network were sensitive enough to detect delicate, previously unseen anomalies in the evaluation set. We used spectral warping and random mixing to simulate new machine instances in addition to the ones provided in the dataset.
The DCASE challenge provided data from six different machines: fan, pump, slide rail, valve, toy car, and toy conveyor. DCASE also provided a development data set and a separate evaluation data set. Scoring was calculated using area under the ROC curve (AUC) and partial area under the ROC curve. The ROC curve maps false-positive rate against false-negative rate, so the area under the curve indicates how well a given system manages that trade-off; partial AUC is the AUC over a small false-positive-rate range, in this case [0, 0.1].
The table below shows the accuracies we were able to obtain, both for the challenge and since the challenge. We have developed a third approach that helped improve some of these numbers, which we will detail in a future publication.
The challenge ranking method involved two steps, to account for the the disparate difficulty levels across various machine types. First, machine-specific rankings were assigned to all submissions, based on AUC and pAUC. The submissions were then ranked by the average of their machine-specific ranks. Please see the full leaderboard here.
While our models won the challenge using the across-all-machine-types scoring described above, fine-tuning them for specific machine types yielded the results in the last row.
We believe that as more industrial machine data is accumulated and curated over the next few years, machine learning and neural-network-based approaches will start making a huge difference in the monitoring and maintenance of machines, and AWS and its services will be at the forefront of this revolution.