Learning new language-understanding tasks from just a few examples
New approach to few-shot learning improves on state of the art by combining prototypical networks with data augmentation.
One of the first things a voice agent like Alexa does when receiving a new instruction is to classify its intent — playing music, getting the weather, turning on a smart-home device, and the like.
Alexa adds new intents all the time, as new skills are developed or old ones extended. Often, because the new intents correspond to newly envisioned use cases, training data is sparse. In such cases, it would be nice to be able to leverage Alexa’s existing capacity for intent classification to learn new intents from just a few examples — maybe five or 10.
Machine learning from limited examples is known as few-shot learning. At this year’s Spoken Language Technology Workshop, my colleagues and I presented a new approach to few-shot learning for intent classification that combines two techniques: prototypical networks, or ProtoNets, which have been widely used in image classification; and neural data augmentation, or using a neural network to generate new, synthetic training examples from the small number available in the few-shot-learning scenario.
In experiments, we first compared our ProtoNet, without data augmentation, to a neural network that used conventional transfer learning to adapt to new tasks. According to F1 score, which factors in both false-positive and false-negative rate, the ProtoNet outperformed the baseline by about 1% in the five-shot case and 5% in the 10-shot case.
Then we added neural data augmentation to the ProtoNet and compared its performance to that of a ProtoNet in which we augmented data by the standard technique, adding noise to the real samples. Both augmented-data models outperformed the basic ProtoNet, but our model returned 8.4% fewer F1 errors in the five-shot case and 12.4% fewer in the 10-shot case.
ProtoNets are used to do meta-learning, or learning how to learn. With ProtoNets, a machine learning model is trained to embed inputs, or represent them as points in a high-dimensional space. The goal of training is to learn an embedding that maximizes the distance between points representing instances of different classes and minimizes the distance between points representing instances of the same classes. In our case, the classes are different intents, but they might be different types of objects, or different types of sounds, or the like.
ProtoNets are trained in batches, such that each batch contains multiple instances of several different classes. After each batch, stochastic gradient descent adjusts the parameters of the model to optimize the distances between embeddings.
It’s not necessary that each batch include instances of all the classes the model will see. This makes ProtoNets very flexible, in terms of both the number of classes they’re trained on and the number of instances per class.
Doing few-shot learning with a trained ProtoNet is a matter of simply using it to embed, say, five or ten examples of each new class. Then the embeddings for each class are averaged to produce a representative embedding — or prototype — of the class as a whole. Classifying a new input involves embedding it and then determining which prototype it’s closest to.
To this general procedure, we add data augmentation (DA), to enable better separation between prototypes. (Hence the name of our model: ProtoDA.) During few-shot learning, the embedded samples for each new class pass to a neural-network-based generator, which produces additional embedded samples, labeled as belonging to the same classes as the input samples.
We train the sample generator using the same loss function we use to train the ProtoNet. That is, the generator learns to generate new samples that, when combined with the real samples, maximize the separation between instances of different classes and minimize the separation between instances of the same classes.
Location, location, location
In our experiments, we positioned the sample generator at two different locations in our network (see diagram above). Before passing to the ProtoNet, textual inputs run through an encoder that performs an initial embedding. This embedding is a fixed-length representation of variable-length sentences, and it leverages bidirectional long-short-term-memory (LSTM) networks to capture contextual information about the inputs.
The output of the sentence encoder is a 768-dimensional embedding, in which spatial relationships represent semantic relationships. This passes to the ProtoNet, whose output is a 128-dimensional embedding, in which spatial relationships represent membership in different classes.
In one experiment, we positioned the sample generator between the semantic encoder and the ProtoNet, and in another, we positioned the generator between the ProtoNet and our model’s classification layer.
We found that adding a neural sample generator to our model worked best when its inputs were the embeddings produced by the ProtoNet. That’s the model that reduced F1 errors 8.4% and 12.4% relative to the model that produced synthetic samples by adding noise.
We believe that the lower dimensionality of the ProtoNet space (128 instead of 768 features) and proximity to the training objective function (ProtoNet loss) contribute to the difference in performance.