Adversarial training produces synthetic data for machine learning
Sentiment analysis is the attempt, computationally, to determine from someone’s words how he or she feels about something. It has a host of applications, in market research, media analysis, customer service, and product recommendation, among other things.
Sentiment classifiers are typically machine learning systems, and any given application of sentiment analysis may suffer from a lack of annotated data for training purposes. In a paper I’m presenting at the International Conference on Acoustics, Speech, and Signal Processing, I describe my efforts to build a system that will generate synthetic training data for applications of sentiment analysis where real data is scarce.
Although the results I report are modest — augmenting training sets with my synthetic data improved sentiment classifiers’ accuracy by around 2% — they do demonstrate the viability of the approach. And the paper includes an analysis of the data generated by my system that could point toward techniques for improving its quality.
One of my first design decisions was not to try to generate actual text. Instead, the generator produces embeddings, a uniform way of representing text strings that is ubiquitous in natural-language-understanding applications.
Embeddings represent texts as points — vectors — in a high-dimensional space, such that texts with similar meanings are grouped together. Most common embeddings are based on analyses of huge bodies of text, in which words are judged to have similar meanings if they commonly co-occur with the same groups of other words.
The embeddings of any number of words can be averaged to produce a new point in the embedding space, so the length of the embedding vector is fixed, no matter the length of the corresponding text string.
To train my embedding generator, I used a generative adversarial network (GAN), an instance of an increasingly popular machine learning technique called adversarial training. The standard GAN consists of two neural networks, a generator and a discriminator. The discriminator is trained to distinguish between real data and the fake data produced by the generator; at the same time, the generator is trained to fool the discriminator. Hence the adversarial relationship.
In my case, the inputs to the discriminator, both real and fake, had two components: an embedding vector and a one-hot vector. A one-hot vector is a string of zeroes with, somewhere among them, a single one. The location of the one corresponds to a particular property — in this case, the sentiment of the text that the embedding vector (supposedly) represents.
My goal was to train the generator to produce synthetic data that could augment the training data for another neural network, a sentiment classifier. The addition of data produced by this type of plain-vanilla GAN, however, did not improve the sentiment classifier’s performance.
So I made several modifications to the system. The first was to equip it with a simple sentiment classifier, trained only on the real data that the generator was intended to augment. During training, the generator tried not only to fool the discriminator but also to produce one-hot vectors that matched the outputs of the simple classifier. This ensured greater consistency between the semantic content of the synthetic embeddings and the associated sentiments.
The other modifications addressed a problem called mode collapse, common in GANs. If the generator stumbles across a type of output that will reliably fool the discriminator, it has an incentive to restrict itself to outputs of that type. But this leads to very homogenous outputs, and homogenous data is not useful for training neural networks.
In my experiments, I was using two types of data, for both training and testing. One data set consisted of product reviews, the other of movie reviews. The training sets were small, to mimic the case in which training data is scarce.
To combat mode collapse, I first trained the GAN on a much larger set of texts, labeled according to sentiment — a set of Twitter posts, commonly used as a benchmark in the field. The tweets were shorter than the reviews and covered a wider range of topics, but they primed the generator to produce more diverse embeddings and the discriminator to recognize more subtle distinctions. After training the GAN on the tweets, I then fine-tuned it on review data.
As is typical in machine learning, I retrained the GAN several times on the same training data, until further training no longer improved its performance. With each pass through the training data, I added a different, random noise pattern to each training example; the data looked somewhat different each time around. That discouraged the generator from keying in on a single trick for fooling the discriminator.
Finally, I also used a technique called one-sided label smoothing. During training, instead of labeling the inputs to the GAN as 0 or 1 — fake or real — I label them as 0 or .9 — fake or 90% likely to be real. If the discriminator is never more than 90% confident in its classification of real inputs, the generator will keep exploring new options, in an attempt to wring out that extra 10% of certainty.
With these modifications, data produced by the generator led to slight improvements in the sentiment classifier’s performance, 1.6% on the movie reviews and 1.7% on the product reviews.
After each experiment, I used a technique called t-SNE (t-distributed stochastic neighbor embedding) to project the high-dimensional embeddings into a two-dimensional space. As can be seen in the figure above, the fake data never exhibited as much diversity as the real data.
However, after my modifications to the GAN, the data diversity did improve, which suggests a correlation between the diversity of the synthetic data and the performance of the sentiment classifier. GANs were first introduced in 2014, and some more recent GAN architectures appear to do a better job at preventing mode collapse than the architecture I used. In future work, my colleagues and I will explore some of those architectures, as well as experimenting with other techniques for diversifying the generator’s outputs.