Neural style transfer is the use of neural networks to transfer the style of one input image — say, a famous painting — to another input image — say, a backyard photograph.
Researchers have proposed a number of different techniques for doing style transfer, but which one works best? There’s no right answer to that question; viewers’ opinions differ. In the results reported in prior papers on style transfer, the most-preferred methods rarely receive more than two-thirds of reviewers’ votes, while the least-preferred methods rarely receive less than 5%.
In a paper we presented at this year’s meeting of the Association for the Advancement of Artificial Intelligence (AAAI), my colleagues and I describe a new style transfer model that can output multiple options, controlled by a model parameter that the user selects.
We show that most prior approaches to style transfer can be rewritten in a standardized form that we call the assign-and-mix model. The model’s “assign” step involves an assignment matrix, which maps features of one input image to features of the other. In the paper, we show that the differences between style transfer techniques generally come down to the entropy of the assignment matrix, or the diversity of the matrix’s values.
Finally, we show that, given a user-specified setting of the input parameter, an algorithm called Sinkhorn-Knopp can efficiently calculate the associated assignment matrix, enabling a diversity of outputs from the same style transfer model.
In a series of experiments, we compared our approach to its predecessors. We found that, according to standard metrics, our method did a better job of preserving the content of the content input and the style of the style input, and it produced more diverse outputs. We also conducted a study with 10 human evaluators and found that — at a particular setting of our diversity parameter — subjects preferred images generated by our method to those produced by other methods.
Assign and mix
In style transfer, the first step is to pass both the content example and the style example to the same visual encoder, which is typically pretrained on a broad object recognition task. The encoder produces a representation of each image, in which each image region has an associated feature vector.
The feature vectors will typically encode visual information — about, say, colors and orientations of gradients — but also semantic information — indicating, say, that a particular image region depicts part of an eye.
Style transfer typically involves (1) reshuffling elements of the style image to reproduce the content of the content image, (2) warping the content image so that its aggregate statistics resemble those of the style image, or (3) some combination of the two. We assimilate all such approaches to the assign-and-mix model.
The “assign” step of assign-and-mix corresponds to approach (1). It involves the assignment matrix, which assigns feature vectors from the style representation to regions of a new image, guided by the content representation. Although prior style transfer approaches use a variety of techniques to find correspondences between style and content features, we analyze several of them in the paper and show that they can often be assimilated to the assignment-matrix model.
The assignment for a particular point in the new image may be a single vector from the style encoding, or it may be a weighted combination of vectors. In the first case, the assignment matrix is binary: every matrix entry is either 0 or 1. This is a minimal-entropy assignment.
By contrast, if every point in the new content image consists of a weighted combination of every vector in the style image, the assignment matrix has higher entropy. There are existing style transfer approaches with binary assignment matrices, and there are existing approaches with high-entropy matrices, and our method can approximate both.
After the assignment step, we proceed to the mixing phase, which corresponds to approach (2), above. In this phase, we step through the encoding of the new, synthetic image, and for each image region, we measure the distance between its encoding and that of the original content example. Then we mix in the feature vectors from the original content encoding, in proportion to the degree of divergence. This ensures that the new image preserves the content of the original.
The computational bottleneck in this process is the creation of multiple assignment matrices, with different degrees of entropy. But we show in our paper that the Sinkhorn-Knopp algorithm, which enables matrices to be rewritten in a standardized form that enables efficient solution, can be applied to the problem of constructing assignment matrices.
In the paper, we rewrite three prior style transfer methods using the assign-and-mix format. We selected those methods because their assignment matrices cover the full spectrum of entropies. Our method should be able to approximate the outputs of any style transfer models whose assignment matrix entropies fall within a more limited range as well.