Amazon wins best-paper award for protecting privacy of training data

Calibrating noise addition to word density in the embedding space improves utility of privacy-protected text.

Differential privacy is a popular technique that provides a way to quantify the privacy risk of releasing aggregate statistics based on individual data. In the context of machine learning, differential privacy provides a way to protect privacy by adding noise to the data used to train a machine learning model. But the addition of noise can also reduce model performance.

In a pair of papers at the annual meeting of the Florida Artificial Intelligence Research Society (FLAIRS), the Privacy Engineering for Alexa team is presenting a new way to calibrate the noise added to the textual data used to train natural-language-processing (NLP) models. The idea is to distinguish cases where a little noise is enough to protect privacy from cases where more noise is necessary. This helps minimize the impact on model accuracy while maintaining privacy guarantees, which aligns with the team’s mission to measurably preserve customer privacy across Alexa.

One of the papers, “Density-aware differentially private textual perturbations using truncated Gumbel noise”, has won the conference’s best-paper award.

Calibrated noise addition.gif
A simplified example of the method proposed in the researchers' award-winning paper. Noise is added to the three nearest neighbors of a source word, A, and to A itself. After noise addition, the word closest to A's original position — B — is chosen as a substitute for A.
Credit: Glynis Condon

Differential privacy says that, given an aggregate statistic, the probability that the underlying dataset does or does not contain a particular item should be virtually the same. The addition of noise to the data helps enforce that standard, but it can also obscure relationships in the data that the model is trying to learn.

In NLP applications, a standard way to add noise involves embedding the words of the training texts. An embedding represents words as vectors, such that vectors that are close in the space have related meanings. 

Adding noise to an embedding vector produces a new vector, which would correspond to a similar but different word. Ideally, substituting the new words for the old should disguise the original data while preserving the attributes that the NLP model is trying to learn. 

However, words in an embedding space tend to form clusters, united by semantic similarity, with sparsely populated regions between clusters. Intuitively, within a cluster, much less noise should be required to ensure enough semantic distance to preserve privacy. However, if the noise added to each word is based on the average distance between embeddings — factoring in the sparser regions — it may be more than is necessary for words in dense regions.

Noise calibration.png
A simplified representation of words (red dots) in an embedding space. Adding noise to a source vector (A) produces a new vector, and the nearest (green circle) embedded word (B) is chosen as a substitute. In the graph at left, adding a lot of noise to the source word produces an output word that is far away and hence semantically dissimilar. In the middle graph, however, a lot of noise is needed to produce a semantically different output. In the graph at right, the amount of noise is calibrated to the density of the vectors around the source word.

This leads us to pose the following question in our FLAIRS papers: Can we recalibrate the noise added such that it varies for every word depending on the density of the surrounding space, rather than resorting to a single global sensitivity?

Calibration techniques

We study this question from two different perspectives. In the paper titled “Research challenges in designing differentially private text generation mechanisms”, my Alexa colleagues Oluwaseyi Feyisetan, Zekun Xu, Nathanael Teissier, and I discuss general techniques to enhance the privacy of text mechanisms by exploiting features such as local density in the embedding space.  

For example, one technique deduces a probability distribution (a prior) that assigns high probability to dense areas of the embedding and low probability to sparse areas. This prior can be produced using kernel density estimation, which is a popular technique for estimating distributions from limited data samples. 

However, these distributions are often highly nonlinear, which makes them difficult to sample from. In this case, we can either opt for an approximation to the distribution or adopt indirect sampling strategies such as the Metropolis–Hastings algorithm (which is based on well-known Monte Carlo Markov chain techniques). 

Another technique we discuss is to impose a limit on how far away a noisy embedding may be from its source. We explore two ways to do this: distance-based truncation and k-nearest-neighbor-based truncation. 

Distance-based truncation simply caps the distance between the noisy embedding and its source, according to some measure of distance in the space. This prevents the addition of a large amount of noise, which is useful in the dense regions of the embedding. But in the sparse regions, this can effectively mean zero perturbation, since there may not be another word within the distance limit. 

To avoid this drawback, we consider the alternate approach of k-nearest-neighbor-based truncation. In this approach, the  words closest to the source delineate the acceptable search area. We then execute a selection procedure to choose the new word from these candidates (plus the source word itself). This is the approach we adopt in our second paper.

Nearest-neighbor search.png
A schematic of distance-based (left and middle graphs) and nearest-neighbor-based (right graph) truncation techniques. In the first graph, the blue circle represents a limit on the distance from the source word, A. Randomly adding noise produces a vector within this limit, and the output word B is selected. In the middle graph, a large amount of noise has been randomly added, but it’s truncated at the boundary of the blue circle. The right graph shows k-nearest-neighbor truncation, where a random number of neighbors (in this case, three) are selected around the source word, A. Noise is added to each of these neighbors independently, and the nearest word after noise addition — B — is chosen (see animation, above).

In “Density-aware differentially private textual perturbations using truncated Gumbel noise”, Nan Xu, a summer intern with our group in 2020 and currently a PhD student in computer science at the University of Southern California, joins us to discuss a particular algorithm in detail. 

This algorithm calibrates noise by selecting a few neighbors of the source word and perturbing the distance to these neighbors using samples from the Gumbel distribution (the rightmost graph, above). We chose the Gumbel distribution because it is more computationally efficient than existing mechanisms for differentially private selection (e.g., the exponential mechanism). The number of neighbors is chosen randomly using Poisson samples.

Together, these two techniques, when calibrated appropriately, provide the required amount of differential privacy while enhancing utility. We call the resulting algorithm the truncated Gumbel mechanism, and it better preserves semantic meanings than multivariate Laplace mechanisms, a widely used method for adding noise to textual data. (The left and middle graphs of the top figure above depict the use of Laplace mechanisms). 

In tests, we found that this new algorithm provided improvements in accuracy of up to 9.9% for text classification tasks on two different datasets. Our paper also includes a formal proof of the privacy guarantees offered by this mechanism and analyzes relevant privacy statistics. 

Our ongoing research efforts continue to improve upon the techniques described above and enable Alexa to continue introducing new features and inventions that make customers’ lives easier while keeping their data private.

Related content

US, WA, Bellevue
The Worldwide Design Engineering (WWDE) organization delivers innovative, effective and efficient engineering solutions that continually improve our customers’ experience. WWDE optimizes designs throughout the entire Amazon value chain providing overall fulfillment solutions from order receipt to last mile delivery. We are seeking a Simulation Scientist to assist in designing and optimizing the fulfillment network concepts and process improvement solutions using discrete event simulations for our World Wide Design Engineering Team. Successful candidates will be visionary technical expert and natural self-starter who have the drive to apply simulation and optimization tools to solve complex flow and buffer challenges during the development of next generation fulfillment solutions. The Simulation Scientist is expected to deep dive into complex problems and drive relentlessly towards innovative solutions working with cross functional teams. Be comfortable interfacing and influencing various functional teams and individuals at all levels of the organization in order to be successful. Lead strategic modelling and simulation projects related to drive process design decisions. Responsibilities: - Lead the design, implementation, and delivery of the simulation data science solutions to perform system of systems discrete event simulations for significantly complex operational processes that have a long-term impact on a product, business, or function using FlexSim, Demo 3D, AnyLogic or any other Discrete Event Simulation (DES) software packages - Lead strategic modeling and simulation research projects to drive process design decisions - Be an exemplary practitioner in simulation science discipline to establish best practices and simplify problems to develop discrete event simulations faster with higher standards - Identify and tackle intrinsically hard process flow simulation problems (e.g., highly complex, ambiguous, undefined, with less existing structure, or having significant business risk or potential for significant impact - Deliver artifacts that set the standard in the organization for excellence, from process flow control algorithm design to validation to implementations to technical documents using simulations - Be a pragmatic problem solver by applying judgment and simulation experience to balance cross-organization trade-offs between competing interests and effectively influence, negotiate, and communicate with internal and external business partners, contractors and vendors for multiple simulation projects - Provide simulation data and measurements that influence the business strategy of an organization. Write effective white papers and artifacts while documenting your approach, simulation outcomes, recommendations, and arguments - Lead and actively participate in reviews of simulation research science solutions. You bring clarity to complexity, probe assumptions, illuminate pitfalls, and foster shared understanding within simulation data science discipline - Pay a significant role in the career development of others, actively mentoring and educating the larger simulation data science community on trends, technologies, and best practices - Use advanced statistical /simulation tools and develop codes (python or another object oriented language) for data analysis , simulation, and developing modeling algorithms - Lead and coordinate simulation efforts between internal teams and outside vendors to develop optimal solutions for the network, including equipment specification, material flow control logic, process design, and site layout - Deliver results according to project schedules and quality Key job responsibilities • You influence the scientific strategy across multiple teams in your business area. You support go/no-go decisions, build consensus, and assist leaders in making trade-offs. You proactively clarify ambiguous problems, scientific deficiencies, and where your team’s solutions may bottleneck innovation for other teams. A day in the life The dat-to-day activities include challenging and problem solving scenario with fun filled environment working with talented and friendly team members. The internal stakeholders are IDEAS team members, WWDE design vertical and Global robotics team members. The team solve problems related to critical Capital decision making related to Material handling equipment and technology design solutions. About the team World Wide Design EngineeringSimulation Team’s mission is to apply advanced simulation tools and techniques to drive process flow design, optimization, and improvement for the Amazon Fulfillment Network. Team develops flow and buffer system simulation, physics simulation, package dynamics simulation and emulation models for various Amazon network facilities, such as Fulfillment Centers (FC), Inbound Cross-Dock (IXD) locations, Sort Centers, Airhubs, Delivery Stations, and Air hubs/Gateways. These intricate simulation models serve as invaluable tools, effectively identifying process flow bottlenecks and optimizing throughput. We are open to hiring candidates to work out of one of the following locations: Bellevue, WA, USA
US, WA, Seattle
Amazon's Global Fixed Marketing Campaign Measurement & Optimization (CMO) team is looking for a senior economic expert in causal inference and applied ML to advance the economic measurement, accuracy validation and optimization methodologies of Amazon's global multi-billion dollar fixed marketing spend. This is a thought leadership position to help set the long-term vision, drive methods innovation, and influence cross-org methods alignment. This role is also an expert in modeling and measuring marketing and customer value with proven capacity to innovate, scale measurement, and mentor talent. This candidate will also work closely with senior Fixed Marketing tech, product, finance and business leadership to devise science roadmaps for innovation and simplification, and adoption of insights to influence important resource allocation, fixed marketing spend and prioritization decisions. Excellent communication skills (verbal and written) are required to ensure success of this collaboration. The candidate must be passionate about advancing science for business and customer impact. Key job responsibilities - Advance measurement, accuracy validation, and optimization methodology within Fixed Marketing. - Motivate and drive data generation to size. - Develop novel, innovative and scalable marketing measurement techniques and methodologies. - Enable product and tech development to scale science solutions and approaches. A day in the life - Propose and refine economic and scientific measurement, accuracy validation, and optimization methodology to improve Fixed Marketing models, outputs and business results - Brief global fixed marketing and retails executives about FM measurement and optimization approaches, providing options to address strategic priorities. - Collaborate with and influence the broader scientific methodology community. About the team CMO's vision is to maximizing long-term free cash flow by providing reliable, accurate and useful global fixed marketing measurement and decision support. The team measures and helps optimize the incremental impact of Amazon (Stores, AWS, Devices) fixed marketing investment across TV, Digital, Social, Radio, and many other channels globally. This is a fully self supported team composed of scientists, economists, engineers, and product/program leaders with S-Team visibility. We are open to hiring candidates to work out of one of the following locations: Irvine, CA, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
GB, Cambridge
Our team builds generative AI solutions that will produce some of the future’s most influential voices in media and art. We develop cutting-edge technologies with Amazon Studios, the provider of original content for Prime Video, with Amazon Game Studios and Alexa, the ground-breaking service that powers the audio for Echo. Do you want to be part of the team developing the future technology that impacts the customer experience of ground-breaking products? Then come join us and make history. We are looking for a passionate, talented, and inventive Applied Scientist with a background in Machine Learning to help build industry-leading Speech, Language, Audio and Video technology. As an Applied Scientist at Amazon you will work with talented peers to develop novel algorithms and generative AI models to drive the state of the art in audio (and vocal arts) generation. Position Responsibilities: * Participate in the design, development, evaluation, deployment and updating of data-driven models for digital vocal arts applications. * Participate in research activities including the application and evaluation and digital vocal and video arts techniques for novel applications. * Research and implement novel ML and statistical approaches to add value to the business. * Mentor junior engineers and scientists. We are open to hiring candidates to work out of one of the following locations: Cambridge, GBR
US, TX, Austin
The Workforce Solutions Analytics and Tech team is looking for a senior Applied Scientist who is interested in solving challenging optimization problems in the labor scheduling and operations efficiency space. We are actively looking to hire senior scientists to lead one or more of these problem spaces. Successful candidates will have a deep knowledge of Operations Research and Machine Learning methods, experience in applying these methods to large-scale business problems, the ability to map models into production-worthy code in Python or Java, the communication skills necessary to explain complex technical approaches to a variety of stakeholders and customers, and the excitement to take iterative approaches to tackle big research challenges. As a member of our team, you'll work on cutting-edge projects that directly impact over a million Amazon associates. This is a high-impact role with opportunities to designing and improving complex labor planning and cost optimization models. The successful candidate will be a self-starter comfortable with ambiguity, with strong attention to detail and outstanding ability in balancing technical leadership with strong business judgment to make the right decisions about model and method choices. Successful candidates must thrive in fast-paced environments, which encourage collaborative and creative problem solving, be able to measure and estimate risks, constructively critique peer research, and align research focuses with the Amazon's strategic needs. Key job responsibilities • Candidates will be responsible for developing solutions to better manage and optimize flexible labor capacity. The successful candidate should have solid research experience in one or more technical areas of Operations Research or Machine Learning. As a senior scientist, you will also help coach/mentor junior scientists on the team. • In this role, you will be a technical leader in applied science research with significant scope, impact, and high visibility. You will lead science initiatives for strategic optimization and capacity planning. They require superior logical thinkers who are able to quickly approach large ambiguous problems, turn high-level business requirements into mathematical models, identify the right solution approach, and contribute to the software development for production systems. • Invent and design new solutions for scientifically-complex problem areas and identify opportunities for invention in existing or new business initiatives. • Successfully deliver large or critical solutions to complex problems in the support of medium-to-large business goals. • Apply mathematical optimization techniques and algorithms to design optimal or near optimal solution methodologies to be used for labor planning. • Research, prototype, simulate, and experiment with these models and participate in the production level deployment in Python or Java. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Austin, TX, USA | Bellevue, WA, USA | Nashville, TN, USA | Seattle, WA, USA | Tempe, AZ, USA
CA, BC, Vancouver
Do you want to be part of the team developing the future technology that impacts the customer experience of ground-breaking products? Then come join us and make history. We are looking for a passionate, talented, and inventive Applied Scientist with a background in AI, Gen AI, Machine Learning, NLP, to help build LLM solutions for Amazon core shopping. Our team works on a variety of projects, including state of the art generative AI, LLM finetuning, alignment, prompt engineering, benchmarking solutions. Key job responsibilities As a Applied Scientist will be expected to work on state of the art technologies which will result in papers publications, however you will not be only theorizing about the algorithms, but you will also have the opportunity to implement them and see how they behave in the field. As a tech lead, this Applied scientist will also be expected to define the research direction, and influence multiple teams to build solutions that improve Amazon and Alexa customer experience. This is an incredible opportunity to validate your research on one of the most exciting Amazon AI products, where assumptions can be tested against real business scenarios and supported by an abundance of data. We are open to hiring candidates to work out of one of the following locations: Vancouver, BC, CAN
US, WA, Seattle
At Amazon, a large portion of our business is driven by third-party Sellers who set their own prices. The Pricing science team is seeking a Sr. Applied Scientist to use statistical and machine learning techniques to design, evangelize, and implement state-of-the-art solutions for never-before-solved problems, helping Marketplace Sellers offer Customers great prices. This role will be a key member of an Advanced Analytics team supporting Pricing related business challenges based in Seattle, WA. The Sr. Applied Scientist will work closely with other research scientists, machine learning experts, and economists to design and run experiments, research new algorithms, and find new ways to improve Seller Pricing to optimize the Customer experience. The Applied Scientist will partner with technology and product leaders to solve business and technology problems using scientific approaches to build new services that surprise and delight our customers. An Applied Scientist at Amazon applies scientific principles to support significant invention, develops code and are deeply involved in bringing their algorithms to production. They also work on cross-disciplinary efforts with other scientists within Amazon. The key strategic objectives for this role include: - Understanding drivers, impacts, and key influences on Pricing dynamics. - Optimizing Seller Pricing to improve the Customer experience. - Drive actions at scale to provide low prices and increased selection for customers using scientifically-based methods and decision making. - Helping to support production systems that take inputs from multiple models and make decisions in real time. - Automating feedback loops for algorithms in production. - Utilizing Amazon systems and tools to effectively work with terabytes of data. You can also learn more about Amazon science here - https://www.amazon.science/ We are open to hiring candidates to work out of one of the following locations: Seattle, WA, USA
US, NY, New York
Where will Amazon's growth come from in the next year? What about over the next five? Which product lines are poised to quintuple in size? Are we investing enough in our infrastructure, or too much? How do our customers react to changes in prices, product selection, or delivery times? These are among the most important questions at Amazon today. The Topline Forecasting team in the Supply Chain Optimization Technologies (SCOT) group is looking for innovative, passionate and results-oriented Economists to answer these questions. You will have an opportunity to own the long-run outlook for Amazon’s global consumer business and shape strategic decisions at the highest level. The successful candidate will be able to formalize problem definitions from ambiguous requirements, build econometrics models using Amazon’s world-class data systems, and develop cutting-edge solutions for non-standard problems. Key job responsibilities · Develop new econometric models or improve existing approaches using scalable techniques. · Extract data for analysis and model development from large, complex datasets. · Closely work with engineering teams to build scalable, efficient systems that implement prototypes in production. · Apply economic theory to solve business problems in a fast moving environment. · Distill problem definitions from informal business requirements and communicate technical solutions to senior business leaders. · Drive innovation and best practices in applied research across the Amazon research science community. We are open to hiring candidates to work out of one of the following locations: New York, NY, USA
US, WA, Bellevue
We are seeking a passionate, talented, and inventive individual to join the Applied AI team and help build industry-leading technologies that customers will love. This team offers a unique opportunity to make a significant impact on the customer experience and contribute to the design, architecture, and implementation of a cutting-edge product. Key job responsibilities On our team you will push the boundaries of ML and Generative AI techniques to scale the inputs for hundreds of billions of dollars of annual revenue for our eCommerce business. If you have a passion for AI technologies, a drive to innovate and a desire to make a meaningful impact, we invite you to become a valued member of our team. We are seeking an experienced Scientist who combines superb technical, research, analytical and leadership capabilities with a demonstrated ability to get the right things done quickly and effectively. This person must be comfortable working with a team of top-notch developers and collaborating with our research teams. We’re looking for someone who innovates, and loves solving hard problems. You will be expected to have an established background in building highly scalable systems and system design, great communication skills, and a motivation to achieve results in a fast-paced environment. You should be somebody who enjoys working on complex problems, is customer-centric, and feels strongly about building good software as well as making that software achieve its operational goals. A day in the life You will be responsible for developing and maintaining the systems and tools that enable us to accelerate knowledge operations and work in the intersection of Science and Engineering. You will push the boundaries of ML and Generative AI techniques to scale the inputs for hundreds of billions of dollars of annual revenue for our eCommerce business. If you have a passion for AI technologies, a drive to innovate and a desire to make a meaningful impact, we invite you to become a valued member of our team. About the team The mission of the Applied AI team is to enable organizations within Worldwide Amazon.com Stores to accelerate the adoption of AI technologies across various parts of our business. We are looking for an Applied Scientist to join our Applied AI team to work on LLM-based solutions. We are open to hiring candidates to work out of one of the following locations: Bellevue, WA, USA
US, WA, Bellevue
We are seeking a passionate, talented, and inventive individual to join the Applied AI team and help build industry-leading technologies that customers will love. This team offers a unique opportunity to make a significant impact on the customer experience and contribute to the design, architecture, and implementation of a cutting-edge product. The mission of the Applied AI team is to enable organizations within Worldwide Amazon.com Stores to accelerate the adoption of AI technologies across various parts of our business. We are looking for a Senior Applied Scientist to join our Applied AI team to work on LLM-based solutions. We are seeking an experienced Scientist who combines superb technical, research, analytical and leadership capabilities with a demonstrated ability to get the right things done quickly and effectively. This person must be comfortable working with a team of top-notch developers and collaborating with our research teams. We’re looking for someone who innovates, and loves solving hard problems. You will be expected to have an established background in building highly scalable systems and system design, excellent project management skills, great communication skills, and a motivation to achieve results in a fast-paced environment. You should be somebody who enjoys working on complex problems, is customer-centric, and feels strongly about building good software as well as making that software achieve its operational goals. Key job responsibilities You will be responsible for developing and maintaining the systems and tools that enable us to accelerate knowledge operations and work in the intersection of Science and Engineering. A day in the life On our team you will push the boundaries of ML and Generative AI techniques to scale the inputs for hundreds of billions of dollars of annual revenue for our eCommerce business. If you have a passion for AI technologies, a drive to innovate and a desire to make a meaningful impact, we invite you to become a valued member of our team. We are open to hiring candidates to work out of one of the following locations: Bellevue, WA, USA
US, WA, Bellevue
We are seeking a passionate, talented, and inventive individual to join the Applied AI team and help build industry-leading technologies that customers will love. This team offers a unique opportunity to make a significant impact on the customer experience and contribute to the design, architecture, and implementation of a cutting-edge product. The mission of the Applied AI team is to enable organizations within Worldwide Amazon.com Stores to accelerate the adoption of AI technologies across various parts of our business. We are looking for a Senior Applied Scientist to join our Applied AI team to work on LLM-based solutions. We are seeking an experienced Scientist who combines superb technical, research, analytical and leadership capabilities with a demonstrated ability to get the right things done quickly and effectively. This person must be comfortable working with a team of top-notch developers and collaborating with our research teams. We’re looking for someone who innovates, and loves solving hard problems. You will be expected to have an established background in building highly scalable systems and system design, excellent project management skills, great communication skills, and a motivation to achieve results in a fast-paced environment. You should be somebody who enjoys working on complex problems, is customer-centric, and feels strongly about building good software as well as making that software achieve its operational goals. Key job responsibilities You will be responsible for developing and maintaining the systems and tools that enable us to accelerate knowledge operations and work in the intersection of Science and Engineering. You will push the boundaries of ML and Generative AI techniques to scale the inputs for hundreds of billions of dollars of annual revenue for our eCommerce business. If you have a passion for AI technologies, a drive to innovate and a desire to make a meaningful impact, we invite you to become a valued member of our team. We are open to hiring candidates to work out of one of the following locations: Bellevue, WA, USA