Computer vision

Virtual try-all: Visualizing any product in any personal setting

First model to work across a wide range of products uses a second U-Net encoder to capture fine-grained product details.

April 16, 2024

4 min read

A way for online shoppers to virtually try out products is a sought-after technology that can create a more immersive shopping experience. Examples include realistically draping clothes on an image of the shopper or inserting pieces of furniture into images of the shopper’s living space.

Representing light and density fields as weighted sums over basis functions, whose weights vary over time, improves motion capture, texture, and lighting.

In the clothing category, this problem is traditionally known as virtual try-on; we call the more general problem, which targets any category of product in any personal setting, the virtual try-all problem.

In a paper we recently posted in arXiv, we presented a solution to the virtual-try-all problem called Diffuse-to-Choose (DTC). Diffuse-to-Choose is a novel generative-AI model that allows users to seamlessly insert any product at any location in any scene.

The customer starts with a personal scene image and a product and draws a mask in the scene to tell the model where to insert the object. The model then integrates the item into the scene, with realistic angles, lighting, shadows, and so on. If necessary, the model infers new perspectives on the item, and it preserves the item’s fine-grained visual-identity details.

Diffuse-to-choose

New "virtual try-all" method works with any product, in any personal setting, and enables precise control of image regions to be modified.

The Diffuse-to-Choose model has a number of characteristics that set it apart from existing work on related problems. First, it is the first model to address the virtual-try-all problem, as opposed to the virtual-try-on problem: it is a single model that works across a wide range of product categories. Second, it doesn’t require 3-D models or multiple views of the product, just a single 2-D reference image. Nor does it require sanitized, white-background, or professional-studio-grade images: it works with “in the wild” images, such as regular cellphone pictures. Finally, it is fast, cost effective, and scalable, generating an image in approximately 6.4 seconds on a single AWS g5.xlarge instance (NVIDIA A10G with 24GB of GPU memory).

In the top row are three images of couches available through the Amazon store, photographed at various angles and in different settings. In the bottom row, far left, is an image of a different couch in a living room. Next to it in the second row are three images in which the couches from the top row have been substituted for the one in the first image.

1 of 5

— Sofas, superimposed on a source image

In the top row are three images of dresses available through the Amazon store, photographed against a white background. In the bottom row, far left, is an image of a woman standing on a hillside with hands folded in front of her. Next to her in the second row are three images in which the dresses from the top row have been substituted for the one in the first image, with the woman's arms still crossed in front of them.

2 of 5

— Dresses, superimposed on a source image behind the model's crossed arms, which remain in the foreground

In the top row are three images of easy chairs available through the Amazon store, photographed in different settings. In the bottom row, far left, is an image of a different easy chair in a living room, photographed from behind. Next to it in the second row are three images in which the chairs from the top row have been substituted for the one in the first image, rotated accordingly, with the appearance of their backs accurately inferred.

3 of 5

— Easy chairs, rotated to preserve perspective, and with the appearance of their backs inferred, superimposed on a source image

In the top row are three images of men's pants available through the Amazon store, photographed at various angles and in different settings. In the bottom row, far left, is an image of a man walking down a dirt road before a hillside covered with pines. Next to it in the second row are three images in which the pants from the top row have been substituted for the ones in the first image.

4 of 5

— Men's pants, superimposed on a source image

In the top row are three images of women's tops available through the Amazon store, photographed against a white background. In the bottom row, far left, is an image of a woman standing in a marble-floored lobby. Next to it in the second row are three images in which the tops from the top row have been substituted for the one the model wears in the first image.

5 of 5

— Women's tops, superimposed on a source image

Under the hood, Diffuse-to-Choose is an inpainting latent-diffusion model, with architectural enhancements that allow it to preserve products’ fine-grained visual details. A diffusion model is one that’s incrementally trained to denoise increasingly noisy inputs, and a latent-diffusion model is one in which the denoising happens in the model’s representation (latent) space. Inpainting is a technique in which part of an image is masked, and the latent-diffusion inpainting model is trained to fill in (“inpaint”) the masked region with a realistic reconstruction, sometimes guided by a text prompt or an image reference.

Four rows of images. In the top row is a single image of a woman wearing a short-sleeved pink top and white pants. In the second row are four versions of the first image, in which different regions of the image have been masked out: (1) the existing sleeves and waistline; (2) the existing waistline but the entirety of the woman's arms: (3) the entirety of the woman's arms and part of her pants below the waistline; (4) the existing sleeves but part of the woman's pants below the waistline. In the next three rows, at left, are three different long-sleeved tops: a tight-fitting black top, a polka-dotted orange top, and a loose black top. Beside each of the tops are four images in which the tops have been superimposed on the model from the top row, with sleeve lengths and waistlines adjusted to match the masking in the second row. — Diffuse-to-choose allows customers to control virtual-try-on features such as sleeve length and whether shirts are worn tucked or untucked, simply by specifying the region of the image to be modified.

Like most inpainting models, DTC uses an encoder-decoder model known as a U-Net to do the diffusion modeling. The U-Net’s encoder consists of a convolutional neural network, which divides the input image into small blocks of pixels and applies a battery of filters to each block, looking for particular image features. Each layer of the encoder steps down the resolution of the image representation; the decoder steps the resolution back up. (The U-shaped curve describing the resolution of the representation over successive layers gives the network its name.)

These schematics compare conventional attention-head knowledge distillation (right) and a new approach, attention map alignment distillation (AMAD) on the left. The image contains a series of 3 by 3 grids with labels like head 1, head 2, and head 3. Each grid has some colored squares and arrows of different thickness and colors are connecting some of the grids. The grids on the right show the conventional attention-head knowledge distillation approach and the grids on the left show the new approach.

Method preserves knowledge encoded in teacher model’s attention heads even when student model has fewer of them.

Our main innovation is to introduce a secondary U-Net encoder into the diffusion process. The input to this encoder is a rough copy-paste collage in which the product image, resized to match the scale of the background scene, has been inserted into the mask created by the customer. It’s a very crude approximation of the desired output, but the idea is that the encoding will preserve fine-grained details of the product image, which the final image reconstruction will incorporate.

We call the secondary encoder’s output a “hint signal”. Both it and the output of the primary U-Net’s encoder pass to a feature-wise linear-modulation (FiLM) module, which aligns the features of the two encodings. Then the encodings pass to the U-Net decoder.

DTC architecture.png — The Diffuse-to-Choose (DTC) architecture, with sample input and output. The main difference between DTC and a typical inpainting diffusion model is the second U-Net encoder that produces a “hint signal” that carries additional information about details of the product image.

We trained Diffuse-to-Choose on AWS p4d.24xlarge instances (with NVIDIA A100 40GB GPUs), with a dataset of a few million pairs of public images. In experiments, we compared its performance on the virtual-try-all task to those of four different versions of a traditional image-conditioned inpainting model, and we compared it to the state-of-the-art model on the more-specialized virtual-try-on task.

Virtual try-all: Visualizing any product in any personal setting

First model to work across a wide range of products uses a second U-Net encoder to capture fine-grained product details.

Related content

Work with us