Computer vision

A little public data makes privacy-preserving AI models more accurate

Technique that mixes public and private training data can meet differential-privacy criteria while cutting error increase by 60%-70%.

By Alessandro Achille, Yu-Xiang Wang

June 24, 2022

5 min read

Many useful computer vision models are trained on large corpora of public data, such as ImageNet. But some applications — models that analyze medical images for indications of disease, for instance — need to be trained on data whose owners might like to keep it private. In such cases, we want to be sure that no one can infer anything about specific training examples from the output of the trained model.

Differential privacy offers a way to quantify both the amount of private information that a machine learning model might leak and the effectiveness of countermeasures. The standard way to prevent data leakage is to add noise during the model training process. This can obscure the inferential pathway leading from model output to specific training examples, but it also tends to compromise model accuracy.

DP.CV.jpeg — A differential-privacy guarantee means that it is statistically impossible to tell whether a given sample was or was not part of the dataset used to train a machine learning model.

Natural-language-processing researchers have had success training models on a mixture of private and public training data, enforcing differential-privacy (DP) guarantees on the private data while compromising model accuracy very little. But attempts to generalize these methods to computer vision have fared badly. In fact, they fare so badly that training a model on public data and then doing zero-shot learning on the private-data task tends to work better than training mixed-data models.

In a paper we presented at this year’s Conference on Computer Vision and Pattern Recognition (CVPR), we address this problem, with an algorithm called AdaMix. We consider the case in which we have at least a little public data whose label set is the same as — or at least close to — that of the private data. In the medical-imaging example, we might have a small public dataset of images labeled to show evidence of the disease of interest, or something similar.

Graphic that illustrates the Task2Vec method for transforming learning tasks into vectors.

Information transfer and memorization

Computer vision models learn to identify image features relevant to particular tasks. A cat recognizer, for instance, might learn to identify image features that denote pointy ears when viewed from various perspectives. Since most of the images in the training data feature cats with pointy ears, the recognizer will probably model pointy ears in a very general way, which is not traceable to any particular training example.

Calibrating noise addition to word density in the embedding space improves utility of privacy-protected text.

If, however, the training data contains only a few images of Scottish Fold cats, with their distinctive floppy ears, the model might learn features particular to just those images, a process we call memorization. And memorization does open the possibility that a canny adversary could identify individual images used in the training data.

Information theory provides a way to quantify the amount of information that the model-training process transfers from any given training example to the model parameters, and the obvious way to prevent memorization would be to cap that information transfer.

But as one of us (Alessandro) explained in an essay for Amazon Science, “The importance of forgetting in artificial and animal intelligence”, during training, neural networks begin by memorizing a good deal of information about individual training examples before, over time, forgetting most of the memorized details. That is, they develop abstract models by gradually subtracting extraneous details from more particularized models. (This finding was unsurprising to biologists, as the development of the animal brain involves a constant shedding of useless information and a consolidation of useful information.)

DP provably prevents unintended memorization of individual training examples. But this also imposes a universal cap on the information transfer between training examples and model parameters, which could inhibit the learning process. The characteristics of specific training examples are often needed to map out the space of possibilities that the learning algorithm should explore as examples accumulate.

A little public data makes privacy-preserving AI models more accurate

Technique that mixes public and private training data can meet differential-privacy criteria while cutting error increase by 60%-70%.

Information transfer and memorization

Related content

Work with us