Amazon wins best-paper award for protecting privacy of training data

Calibrating noise addition to word density in the embedding space improves utility of privacy-protected text.

Differential privacy is a popular technique that provides a way to quantify the privacy risk of releasing aggregate statistics based on individual data. In the context of machine learning, differential privacy provides a way to protect privacy by adding noise to the data used to train a machine learning model. But the addition of noise can also reduce model performance.

In a pair of papers at the annual meeting of the Florida Artificial Intelligence Research Society (FLAIRS), the Privacy Engineering for Alexa team is presenting a new way to calibrate the noise added to the textual data used to train natural-language-processing (NLP) models. The idea is to distinguish cases where a little noise is enough to protect privacy from cases where more noise is necessary. This helps minimize the impact on model accuracy while maintaining privacy guarantees, which aligns with the team’s mission to measurably preserve customer privacy across Alexa.

One of the papers, “Density-aware differentially private textual perturbations using truncated Gumbel noise”, has won the conference’s best-paper award.

Calibrated noise addition.gif
A simplified example of the method proposed in the researchers' award-winning paper. Noise is added to the three nearest neighbors of a source word, A, and to A itself. After noise addition, the word closest to A's original position — B — is chosen as a substitute for A.
Credit: Glynis Condon

Differential privacy says that, given an aggregate statistic, the probability that the underlying dataset does or does not contain a particular item should be virtually the same. The addition of noise to the data helps enforce that standard, but it can also obscure relationships in the data that the model is trying to learn.

In NLP applications, a standard way to add noise involves embedding the words of the training texts. An embedding represents words as vectors, such that vectors that are close in the space have related meanings. 

Adding noise to an embedding vector produces a new vector, which would correspond to a similar but different word. Ideally, substituting the new words for the old should disguise the original data while preserving the attributes that the NLP model is trying to learn. 

However, words in an embedding space tend to form clusters, united by semantic similarity, with sparsely populated regions between clusters. Intuitively, within a cluster, much less noise should be required to ensure enough semantic distance to preserve privacy. However, if the noise added to each word is based on the average distance between embeddings — factoring in the sparser regions — it may be more than is necessary for words in dense regions.

Noise calibration.png
A simplified representation of words (red dots) in an embedding space. Adding noise to a source vector (A) produces a new vector, and the nearest (green circle) embedded word (B) is chosen as a substitute. In the graph at left, adding a lot of noise to the source word produces an output word that is far away and hence semantically dissimilar. In the middle graph, however, a lot of noise is needed to produce a semantically different output. In the graph at right, the amount of noise is calibrated to the density of the vectors around the source word.

This leads us to pose the following question in our FLAIRS papers: Can we recalibrate the noise added such that it varies for every word depending on the density of the surrounding space, rather than resorting to a single global sensitivity?

Calibration techniques

We study this question from two different perspectives. In the paper titled “Research challenges in designing differentially private text generation mechanisms”, my Alexa colleagues Oluwaseyi Feyisetan, Zekun Xu, Nathanael Teissier, and I discuss general techniques to enhance the privacy of text mechanisms by exploiting features such as local density in the embedding space.  

For example, one technique deduces a probability distribution (a prior) that assigns high probability to dense areas of the embedding and low probability to sparse areas. This prior can be produced using kernel density estimation, which is a popular technique for estimating distributions from limited data samples. 

However, these distributions are often highly nonlinear, which makes them difficult to sample from. In this case, we can either opt for an approximation to the distribution or adopt indirect sampling strategies such as the Metropolis–Hastings algorithm (which is based on well-known Monte Carlo Markov chain techniques). 

Another technique we discuss is to impose a limit on how far away a noisy embedding may be from its source. We explore two ways to do this: distance-based truncation and k-nearest-neighbor-based truncation. 

Distance-based truncation simply caps the distance between the noisy embedding and its source, according to some measure of distance in the space. This prevents the addition of a large amount of noise, which is useful in the dense regions of the embedding. But in the sparse regions, this can effectively mean zero perturbation, since there may not be another word within the distance limit. 

To avoid this drawback, we consider the alternate approach of k-nearest-neighbor-based truncation. In this approach, the  words closest to the source delineate the acceptable search area. We then execute a selection procedure to choose the new word from these candidates (plus the source word itself). This is the approach we adopt in our second paper.

Nearest-neighbor search.png
A schematic of distance-based (left and middle graphs) and nearest-neighbor-based (right graph) truncation techniques. In the first graph, the blue circle represents a limit on the distance from the source word, A. Randomly adding noise produces a vector within this limit, and the output word B is selected. In the middle graph, a large amount of noise has been randomly added, but it’s truncated at the boundary of the blue circle. The right graph shows k-nearest-neighbor truncation, where a random number of neighbors (in this case, three) are selected around the source word, A. Noise is added to each of these neighbors independently, and the nearest word after noise addition — B — is chosen (see animation, above).

In “Density-aware differentially private textual perturbations using truncated Gumbel noise”, Nan Xu, a summer intern with our group in 2020 and currently a PhD student in computer science at the University of Southern California, joins us to discuss a particular algorithm in detail. 

This algorithm calibrates noise by selecting a few neighbors of the source word and perturbing the distance to these neighbors using samples from the Gumbel distribution (the rightmost graph, above). We chose the Gumbel distribution because it is more computationally efficient than existing mechanisms for differentially private selection (e.g., the exponential mechanism). The number of neighbors is chosen randomly using Poisson samples.

Together, these two techniques, when calibrated appropriately, provide the required amount of differential privacy while enhancing utility. We call the resulting algorithm the truncated Gumbel mechanism, and it better preserves semantic meanings than multivariate Laplace mechanisms, a widely used method for adding noise to textual data. (The left and middle graphs of the top figure above depict the use of Laplace mechanisms). 

In tests, we found that this new algorithm provided improvements in accuracy of up to 9.9% for text classification tasks on two different datasets. Our paper also includes a formal proof of the privacy guarantees offered by this mechanism and analyzes relevant privacy statistics. 

Our ongoing research efforts continue to improve upon the techniques described above and enable Alexa to continue introducing new features and inventions that make customers’ lives easier while keeping their data private.

Related content

US, WA, Seattle
The retail pricing science and research group is a team of scientists and economists who design and implement the analytics powering pricing for Amazon’s on-line retail business. The team uses world-class analytics to make sure that the prices for all of Amazon’s goods and services are aligned with Amazon’s corporate goals. We are seeking an experienced high-energy Economist to help envision, design and build the next generation of retail pricing capabilities. You will work at the intersection of economic theory, statistical inference, and machine learning to design new methods and pricing strategies to deliver game changing value to our customers. Roughly 85% of previous intern cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. Key job responsibilities Amazon’s Pricing Science and Research team is seeking an Economist to help envision, design and build the next generation of pricing capabilities behind Amazon’s on-line retail business. As an economist on our team, you will work at the intersection of economic theory, statistical inference, and machine learning to design new methods and pricing strategies with the potential to deliver game changing value to our customers. This is an opportunity for a high-energy individual to work with our unprecedented retail data to bring cutting edge research into real world applications, and communicate the insights we produce to our leadership. This position is perfect for someone who has a deep and broad analytic background and is passionate about using mathematical modeling and statistical analysis to make a real difference. You should be familiar with modern tools for data science and business analysis. We are particularly interested in candidates with research background in applied microeconomics, econometrics, statistical inference and/or finance. A day in the life Discussions with business partners, as well as product managers and tech leaders to understand the business problem. Brainstorming with other scientists and economists to design the right model for the problem in hand. Present the results and new ideas for existing or forward looking problems to leadership. Deep dive into the data. Modeling and creating working prototypes. Analyze the results and review with partners. Partnering with other scientists for research problems. About the team The retail pricing science and research group is a team of scientists and economists who design and implement the analytics powering pricing for Amazon’s on-line retail business. The team uses world-class analytics to make sure that the prices for all of Amazon’s goods and services are aligned with Amazon’s corporate goals.
US, CA, San Francisco
The retail pricing science and research group is a team of scientists and economists who design and implement the analytics powering pricing for Amazon's on-line retail business. The team uses world-class analytics to make sure that the prices for all of Amazon's goods and services are aligned with Amazon's corporate goals. We are seeking an experienced high-energy Economist to help envision, design and build the next generation of retail pricing capabilities. You will work at the intersection of statistical inference, experimentation design, economic theory and machine learning to design new methods and pricing strategies for assessing pricing innovations. Roughly 85% of previous intern cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. Key job responsibilities Amazon's Pricing Science and Research team is seeking an Economist to help envision, design and build the next generation of pricing capabilities behind Amazon's on-line retail business. As an economist on our team, you will will have the opportunity to work with our unprecedented retail data to bring cutting edge research into real world applications, and communicate the insights we produce to our leadership. This position is perfect for someone who has a deep and broad analytic background and is passionate about using mathematical modeling and statistical analysis to make a real difference. You should be familiar with modern tools for data science and business analysis. We are particularly interested in candidates with research background in experimentation design, applied microeconomics, econometrics, statistical inference and/or finance. A day in the life Discussions with business partners, as well as product managers and tech leaders to understand the business problem. Brainstorming with other scientists and economists to design the right model for the problem in hand. Present the results and new ideas for existing or forward looking problems to leadership. Deep dive into the data. Modeling and creating working prototypes. Analyze the results and review with partners. Partnering with other scientists for research problems. About the team The retail pricing science and research group is a team of scientists and economists who design and implement the analytics powering pricing for Amazon's on-line retail business. The team uses world-class analytics to make sure that the prices for all of Amazon's goods and services are aligned with Amazon's corporate goals.
US, WA, Bellevue
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and UNIX would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of interns from previous cohorts have converted to full time economics employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com.
US
The Amazon Supply Chain Optimization Technology (SCOT) organization is looking for an Intern in Economics to work on exciting and challenging problems related to Amazon's worldwide inventory planning. SCOT provides unique opportunities to both create and see the direct impact of your work on billions of dollars’ worth of inventory, in one of the world’s most advanced supply chains, and at massive scale. We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. We are looking for a PhD candidate with exposure to Program Evaluation/Causal Inference. Knowledge of econometrics and Stata/R/or Python is necessary, and experience with SQL, Hadoop, and Spark would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com.
US, WA, Seattle
The Selling Partner Fees team owns the end-to-end fees experience for two million active third party sellers. We own the fee strategy, fee seller experience, fee accuracy and integrity, fee science and analytics, and we provide scalable technology to monetize all services available to third-party sellers. We are looking for an Intern Economist with excellent coding skills to design and develop rigorous models to assess the causal impact of fees on third party sellers’ behavior and business performance. As a Science Intern, you will have access to large datasets with billions of transactions and will translate ambiguous fee related business problems into rigorous scientific models. You will work on real world problems which will help to inform strategic direction and have the opportunity to make an impact for both Amazon and our Selling Partners.
US, WA, Bellevue
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. We are looking for a PhD candidate with exposure to Program Evaluation/Causal Inference. Some knowledge of econometrics, as well as basic familiarity with Stata or R is necessary, and experience with SQL, Hadoop, Spark and Python would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com.
US, MA, Westborough
Are you inspired by invention? Is problem solving through teamwork in your DNA? Do you like the idea of seeing how your work impacts the bigger picture? Answer yes to any of these and you’ll fit right in here at Amazon Robotics. We are a smart team of doers that work passionately to apply cutting edge advances in robotics and software to solve real-world challenges that will transform our customers’ experiences in ways we can’t even imagine yet. We invent new improvements every day. We are Amazon Robotics and we will give you the tools and support you need to invent with us in ways that are rewarding, fulfilling and fun. Amazon Robotics is seeking interns and co-ops with a passion for robotic research to work on cutting edge algorithms for robotics. Our team works on challenging and high-impact projects, including allocating resources to complete a million orders a day, coordinating the motion of thousands of robots, autonomous navigation in warehouses, identifying objects and damage, and learning how to grasp all the products Amazon sells. We are seeking internship candidates with backgrounds in computer vision, machine learning, resource allocation, discrete optimization, search, and planning/scheduling. You will be challenged intellectually and have a good time while you are at it! Please note that by applying to this role you would be considered for Applied Scientist summer intern, spring co-op, and fall co-op roles on various Amazon Robotics teams. These teams work on robotics research within areas such as computer vision, machine learning, robotic manipulation, navigation, path planning, perception, artificial intelligence, human-robot interaction, optimization and more.
US, MA, Westborough
Are you inspired by invention? Is problem solving through teamwork in your DNA? Do you like the idea of seeing how your work impacts the bigger picture? Answer yes to any of these and you’ll fit right in here at Amazon Robotics. We are a smart team of doers that work passionately to apply cutting edge advances in robotics and software to solve real-world challenges that will transform our customers’ experiences in ways we can’t even imagine yet. We invent new improvements every day. We are Amazon Robotics and we will give you the tools and support you need to invent with us in ways that are rewarding, fulfilling and fun. Amazon Robotics is seeking interns and co-ops with a passion for robotic research to work on cutting edge algorithms for robotics. Our team works on challenging and high-impact projects, including allocating resources to complete a million orders a day, coordinating the motion of thousands of robots, autonomous navigation in warehouses, identifying objects and damage, and learning how to grasp all the products Amazon sells. We are seeking internship candidates with backgrounds in computer vision, machine learning, resource allocation, discrete optimization, search, and planning/scheduling. You will be challenged intellectually and have a good time while you are at it! Please note that by applying to this role you would be considered for Applied Scientist summer intern, spring co-op, and fall co-op roles on various Amazon Robotics teams. These teams work on robotics research within areas such as computer vision, machine learning, robotic manipulation, navigation, path planning, perception, artificial intelligence, human-robot interaction, optimization and more.
US, WA, Bellevue
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and UNIX would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com.
US, MA, Boston
Are you inspired by invention? Is problem solving through teamwork in your DNA? Do you like the idea of seeing how your work impacts the bigger picture? Answer yes to any of these and you’ll fit right in here at Amazon Robotics. We are a smart team of doers that work passionately to apply cutting edge advances in robotics and software to solve real-world challenges that will transform our customers’ experiences in ways we can’t even imagine yet. We invent new improvements every day. We are Amazon Robotics and we will give you the tools and support you need to invent with us in ways that are rewarding, fulfilling and fun. Amazon Robotics, a wholly owned subsidiary of Amazon.com, empowers a smarter, faster, more consistent customer experience through automation. Amazon Robotics automates fulfillment center operations using various methods of robotic technology including autonomous mobile robots, sophisticated control software, language perception, power management, computer vision, depth sensing, machine learning, object recognition, and semantic understanding of commands. Amazon Robotics has a dedicated focus on research and development to continuously explore new opportunities to extend its product lines into new areas. AR is seeking uniquely talented and motivated data scientists to join our Global Services and Support (GSS) Tools Team. GSS Tools focuses on improving the supportability of the Amazon Robotics solutions through automation, with the explicit goal of simplifying issue resolution for our global network of Fulfillment Centers. The candidate will work closely with software engineers, Fulfillment Center operation teams, system engineers, and product managers in the development, qualification, documentation, and deployment of new - as well as enhancements to existing - operational models, metrics, and data driven dashboards. As such, this individual must possess the technical aptitude to pick-up new BI tools and programming languages to interface with different data access layers for metric computation, data mining, and data modeling. This role is a 6 month co-op to join AR full time (40 hours/week) from July – December 2023. The Co-op will be responsible for: Diving deep into operational data and metrics to identify and communicate trends used to drive development of new tools for supportability Translating operational metrics into functional requirements for BI-tools, models, and reporting Collaborating with cross functional teams to automate AR problem detection and diagnostics