Privacy challenges in extreme gradient boosting

Scientists describe the use of privacy-preserving machine learning to address privacy challenges in XGBoost training and prediction.

(Editor’s note: This is the fourth in a series of articles Amazon Science is publishing related to the science behind products and services from companies in which the Amazon Alexa Fund has invested. The Alexa Fund completed a strategic investment in Inpher, Inc., earlier this year; the New York and Swiss-based company develops privacy-preserving machine learning and analytics solutions that help organizations unlock the value of sensitive, siloed data to enable secure collaboration across organizations. This article is co-authored by Dimitar Jetchev, the cofounder and chief technology officer of Inpher, and Joan Feigenbaum, an Amazon Scholar and the Grace Murray Hopper professor of computer science at Yale University.)

Joan Feigenbaum and Dimitar Jetchev
Dimitar Jetchev (left), the cofounder and chief technology officer of Inpher, and Joan Feigenbaum, the Grace Murray Hopper professor of computer science at Yale University, and an Amazon Scholar, describe the use of privacy-preserving machine learning to address privacy challenges in XGBoost training and prediction.
Credit: Glynis Condon

Machine learning (ML) is increasingly important in a wide range of applications, including market forecasting, service personalization, voice and facial recognition, autonomous driving, health diagnostics, education, and security analytics. Because ML touches so many aspects of our lives, it’s of vital concern that ML systems protect the privacy of the data used to train them, the confidential queries submitted to them, and the confidential predictions they return.

Privacy protection — and the protection of organizations’ intellectual property — motivates the study of privacy-preserving machine learning (PPML). In essence, the goal of PPML is to perform machine learning in a manner that does not reveal any unnecessary information about training-data sets, queries, and predictions.

Suppose, for example, that schools supplied encrypted student records to educational researchers who used them to train ML models. Suppose further that students, parents, teachers, and other researchers could feed encrypted queries to the models and receive encrypted predictions in return. By taking advantage of PPML techniques in this manner, all of the participants could mine the knowledge contained in educational-record databases without compromising the privacy of the data subjects or the data users.

PPML is a very active area, with an eponymous annual workshop and many strong papers in general-ML and security venues. Techniques have been developed for privacy-preserving training and prediction on a wide range of ML model types, e.g., neural nets, decision trees, and logistic-regression formulae.

In the sections below, we describe PPML methods for training and prediction in extreme gradient boosting.

Training

Gradient boosting is an ML method for regression and classification problems that yields a set of prediction trees, typically classification and regression trees (CARTs), which together constitute a model. A CART is a generalization of a binary decision tree; while a binary tree produces a binary output, classifying each input query as a “yes” or “no,” a CART assigns each input query a (real) numerical score.

Interpretation of scores is application dependent. If v is a query, then each CART in the model assigns a score to v, and the final prediction of the model on input v is the sum of these scores. In some applications, the softmax function may be used instead of sum to produce a probability distribution over the predicted output classes.

Extreme gradient boosting (XGBoost) is an optimized, distributed, gradient-boosting framework that is efficient, portable, and flexible. In this section, we consider confidentiality of training data in the creation of XGBoost models for disease prediction — specifically, for prediction of multiple sclerosis (MS).

Early diagnosis and treatment of MS is crucial to prevent degenerative progression of the disease and patient disabilities. A recent paper proposes an early-diagnosis method that applies XGBoost to electronic health records and uses three types of features: diagnostic, epidemiologic, and laboratory.

How cryptographic computing can accelerate the adoption of cloud computing

In a previous Amazon Science article, Joan Feigenbaum reviewed secure multiparty computation and privacy-preserving machine learning – two cryptographic techniques employed to address cloud-computing privacy concerns and accelerate enterprise cloud adoption.

The presence of another neurological disease (e.g., acute disseminated encephalomyelitis (ADEM)) is an example of a diagnostic feature. Epidemiologic features include age, gender, and total number of visits to a hospital. Two more features that are discovered by lab tests are used in the model and referred to as laboratory features: hyperlipidemia (abnormally elevated levels of any or all lipids) and hyperglycemia (elevated blood sugar). The proposed XGBoost model significantly outperforms other ML techniques (including naïve Bayes methods, k-nearest neighbor, and support vector machines) that have been proposed for early diagnosis of MS.

Collecting a sufficient number of high-quality data samples and features to train such a diagnostic model is quite challenging, because the data reside in different private locations. The training data can be split in different ways among these locations: horizontally split, vertically split, or both.

If the private data sources contain samples with the same feature set (as would be the case if, say, the same features are extracted from health records residing in different hospitals), the dataset is said to be horizontally split. The other extreme — vertically split data — occurs when a private data source contributes a new feature for all of the training samples. For example, a health-insurance company could supply reimbursement receipts for past medication (the new feature) to complement the features in clinical health records. In these scenarios, aggregating the training data on a central server violates GDPR regulations.

The figure below illustrates one possible CART in the trained model. The weights at the leaves might indicate probabilities of MS resulting from the various paths from root to leaf.

Classification and regression trees (CART)

Research on privacy-preserving training of XGBoost models for prediction of MS uses two distinct techniques: secure multiparty computation (SMPC) and privacy-preserving federated learning (PPFL). We briefly describe both of them here.

An SMPC protocol enables several parties, each of whom holds a private input, to jointly evaluate a publicly known function on these inputs without revealing anything about the inputs except what is implied by the output of the function. Private inputs are secret shared among the parties, e.g., via additive secret sharing, in which each owner of a private input v generates random “shares” that add up to v.

For instance, suppose that Alice’s private input is v = 5. She can secret share it among herself, Bob, and Charlie by generating two random integers SBob =125621 and SCharlie = 56872, sending Bob’s share to him and Charlie’s to him, and keeping SAlice = v - SBob - SCharlie = -182488. Unless an adversary controls all three parties, he cannot learn anything about Alice’s private input v.  
  
In an execution of an SMPC protocol, the inputs to each elementary operation (addition or multiplication) are secret shared, and the output of the operation is a set of secret shares of the result. We say that a secret-shared value y (which may be the final output of the computation) is revealed to party P if all the parties send their shares to P, thus enabling P to reconstruct y. Further discussion of SMPC and its relevance to cloud computing can be found here and in Inpher’s Secret Computing Explainer Series.

A recent paper by researchers at Inpher proposes an SMPC protocol, called XORBoost, for privacy-preserving training of XGBoost models. It improves the state of the art by several orders of magnitude and ensures that

  • The CARTs computed by the protocol are secret shared among the training-data owners and revealed only to a designated party, namely the data analyst.
  • The training algorithm not only protects the input data but also reveals no information about the paths in the CARTs taken by any of the training samples. 
  • XORBoost supports both numerical and categorical features, thus providing enough flexibility and generality to support the above model.    

XORBoost works well for training datasets of reasonable size — hundreds of thousands of samples and hundreds of features. However, many real-world applications require training on more than a million samples. To achieve that type of scale, one can use federated learning (FL), which is an ML technique used to train a model on data samples held locally by multiple, decentralized edge devices without requiring the devices to exchange the samples.

FL differs from XORBoost mainly in that FL does not perform the entire training exercise on secret-shared values. Rather, each device trains a local model on its local data samples and sends its local model to one or more servers for aggregation. The aggregation protocol typically uses simple operations such as sum, average, and oblivious comparisons but no complex optimization.

If the server receives the plaintext local-model updates from all of the devices, it could, in principle, recover the local training-data samples using model-inversion attacks. SMPC and other privacy-preserving computational techniques can be applied to aggregate local models without revealing them to the server. See the diagram below for the overall architecture. 

XORBoost architecture

Prediction

PPXGBoost is a privacy-preserving version of XGBoost prediction. More precisely, it is a system that supports encrypted queries to encrypted XGBoost models. PPXGBoost is designed for applications that start by training a plaintext model Ω on a suitable training-data set and then create, for each user U, a personalized, encrypted version ΩU of the model to which U will submit encrypted queries and from which she will receive encrypted results. 

PPXGBoost system architecture

The PPXGBoost system architecture is shown in the figure above. On the client side, there is an app with which a user encrypts queries and decrypts results. On the server side, there is a module called Proxy that runs in a trusted environment and is responsible for setup (i.e., creating, for each authorized user, a personalized, encrypted model and a set of cryptographic keys) and an ML module that executes the encrypted queries. PPXGBoost uses two specialized types of encryption schemes (symmetric-key, order-preserving encryption and public-key, additive, homomorphic encryption) to encrypt models and evaluate encrypted queries. Each user is issued keys for both schemes during the setup phase.

Note that PPXGBoost is a natural choice for researchers, clinicians, and patients who wish to make disease predictions repeatedly as the patients’ circumstances change. Potentially relevant changes include exposure to new environmental factors, experimental treatment for another condition, or simply aging. An individual patient can create a personalized, encrypted version of a disease-prediction model and store it on a server owned by the medical center at which he is receiving treatment. Patient and physician can then use it to monitor, in a privacy-preserving manner, changes in the patient’s likelihood of contracting the disease.

Conclusion

We have described the use of PPML to address privacy challenges in XGBoost training and prediction. In a future post, we will elaborate on how privacy-preserving federated learning enables researchers to train more-complex ML models on millions of samples stored on hundreds of thousands of devices.

Related content

US, NY, New York
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and UNIX would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time economics employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com.
FR, Clichy
The role can be based in any of our EU offices. Amazon Supply Chain forms the backbone of the fastest growing e-commerce business in the world. The sheer growth of the business and the company's mission "to be Earth’s most customer-centric company” makes the customer fulfillment business bigger and more complex with each passing year. The EU SC Science Optimization team is looking for a Science leader to tackle complex and ambiguous forecasting and optimization problems for our EU fulfillment network. The team owns the optimization of our Supply Chain from our suppliers to our customers. We are also responsible for analyzing the performance of our Supply Chain end-to-end and deploying Statistics, Econometrics, Operations Research and Machine Learning models to improve decision making within our organization, including forecasting, planning and executing our network. We work closely with Supply Chain Optimization Technology (SCOT) teams, who own the systems and the inputs we rely on to plan our networks, the worldwide scientific community, and with our internal EU stakeholders within Supply Chain, Transportation, Store and Finance. The ideal candidate has a well-rounded-technical/science background as well as a history of leading large projects end-to-end, and is comfortable in developing long term research strategy while ensuring the delivery of incremental results in an ever-changing operational environment. As a Sr. Science Manager, you will lead and grow a high-performing team of data and research scientists, technical program managers and business intelligence engineers. You will partner with operations, finance, store, science and engineering leadership to identify opportunities to drive efficiency improvement in our Fulfillment Center network flows via optimization and scalable execution. As a science leader, you will not only develop optimization solutions, but also influence strategy and outcomes across multiple partner science teams such as forecasting, transportation network design, or modelling teams. You will identify new areas of investment and research and work to align roadmaps to deliver on these opportunities. This role is inherently cross-functional and requires an ability to communicate, influence and earn the trust of science, technical, operations and business leadership.
US, WA, Bellevue
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and UNIX would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time economics employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. Key job responsibilities Estimate econometric models using large datasets. Must know SQL and Matlab.
US, WA, Seattle
The AWS AI Labs team has a world-leading team of researchers and academics, and we are looking for world-class colleagues to join us and make the AI revolution happen. Our team of scientists have developed the algorithms and models that power AWS computer vision services such as Amazon Rekognition and Amazon Textract. As part of the team, we expect that you will develop innovative solutions to hard problems, and publish your findings at peer reviewed conferences and workshops. AWS is the world-leading provider of cloud services, has fostered the creation and growth of countless new businesses, and is a positive force for good. Our customers bring problems which will give Applied Scientists like you endless opportunities to see your research have a positive and immediate impact in the world. You will have the opportunity to partner with technology and business teams to solve real-world problems, have access to virtually endless data and computational resources, and to world-class engineers and developers that can help bring your ideas into the world. Our research themes include, but are not limited to: few-shot learning, transfer learning, unsupervised and semi-supervised methods, active learning and semi-automated data annotation, large scale image and video detection and recognition, face detection and recognition, OCR and scene text recognition, document understanding, 3D scene and layout understanding, and geometric computer vision. For this role, we are looking for scientist who have experience working in the intersection of vision and language. We are located in Seattle, Pasadena, Palo Alto (USA) and in Haifa and Tel Aviv (Israel).
US, WA, Seattle
Amazon Prime Video is changing the way millions of customers enjoy digital content. Prime Video delivers premium content to customers through purchase and rental of movies and TV shows, unlimited on-demand streaming through Amazon Prime subscriptions, add-on channels like Showtime and HBO, and live concerts and sporting events like NFL Thursday Night Football. In total, Prime Video offers nearly 200,000 titles and is available across a wide variety of platforms, including PCs and Macs, Android and iOS mobile devices, Fire Tablets and Fire TV, Smart TVs, game consoles, Blu-ray players, set-top-boxes, and video-enabled Alexa devices. Amazon believes so strongly in the future of video that we've launched our own Amazon Studios to produce original movies and TV shows, many of which have already earned critical acclaim and top awards, including Oscars, Emmys and Golden Globes. The Global Consumer Engagement team within Amazon Prime Video builds product and technology solutions that drive customer activation and engagement across all our supported devices and global footprint. We obsess over finding effective, programmatic and scalable ways to reach customers via a broad portfolio of both in-app and out-of-app experiences. We would love to have you join us to build models that can classify and detect content available on Prime Video. We need you to analyze the video, audio and textual signal streams and improve state-of-art solutions while being scalable to Amazon size data. We need to solve problems across many cultures and languages, working alongside an operations team generating labels across many languages to help us achieve these goals. Our team consistently strives to innovate, and holds several novel patents and inventions in the motion picture and television industry. We are highly motivated to extend the state of the art. As a member of our team, you will apply your deep knowledge of Computer Vision and Machine Learning to concrete problems that have broad cross-organizational, global, and technology impact. Your work will focus on addressing fundamental computer vision models like video understanding and video summarization in addition to building appropriate large scale datasets. You will work on large engineering efforts that solve significantly complex problems facing global customers. You will be trusted to operate with independence and are often assigned to focus on areas with significant impact on audience satisfaction. You must be equally comfortable with digging in to customer requirements as you are drilling into design with development teams and developing production ready learning models. You consistently bring strong, data-driven business and technical judgment to decisions. You will work with internal and external stakeholders, cross-functional partners, and end-users around the world at all levels. Our team makes a big impact because nothing is more important to us than pleasing our customers, continually earning their trust, and thinking long term. You are empowered to bring new technologies and deep learning approaches to your solutions. We embrace the challenges of a fast paced market and evolving technologies, paving the way to universal availability of content. You will be encouraged to see the big picture, be innovative, and positively impact millions of customers. This is a young and evolving business where creativity and drive will have a lasting impact on the way video is enjoyed worldwide.
US, CA, Palo Alto
Join a team working on cutting-edge science to innovate search experiences for Amazon shoppers! Amazon Search helps customers shop with ease, confidence and delight WW. We aim to transform Search from an information retrieval engine to a shopping engine. In this role, you will build models to generate and recommend search queries that can help customers fulfill their shopping missions, reduce search efforts and let them explore and discover new products. You will also build models and applications that will increase customer awareness of related products and product attributes that might be best suited to fulfill the customer needs. Key job responsibilities On a day-to-day basis, you will: Design, develop, and evaluate highly innovative, scalable models and algorithms; Design and execute experiments to determine the impact of your models and algorithms; Work with product and software engineering teams to manage the integration of successful models and algorithms in complex, real-time production systems at very large scale; Share knowledge and research outcomes via internal and external conferences and journal publications; Project manage cross-functional Machine Learning initiatives. About the team The mission of Search Assistance is to improve search feature by reducing customers’ effort to search. We achieve this through three customer-facing features: Autocomplete, Spelling Correction and Related Searches. The core capability behind the three features is backend service Query Recommendation.
US, CA, Palo Alto
Amazon is investing heavily in building a world class advertising business and we are responsible for defining and delivering a collection of self-service performance advertising products that drive discovery and sales. Our products are strategically important to our Retail and Marketplace businesses driving long term growth. We deliver billions of ad impressions and millions of clicks daily and are breaking fresh ground to create world-class products. We are highly motivated, collaborative and fun-loving with an entrepreneurial spirit and bias for action. With a broad mandate to experiment and innovate, we are growing at an unprecedented rate with a seemingly endless range of new opportunities. The Ad Response Prediction team in Sponsored Products organization build advanced deep-learning models, large-scale machine-learning (ML) pipelines, and real-time serving infra to match shoppers’ intent to relevant ads on all devices, for all contexts and in all marketplaces. Through precise estimation of shoppers’ interaction with ads and their long-term value, we aim to drive optimal ads allocation and pricing, and help to deliver a relevant, engaging and delightful ads experience to Amazon shoppers. As the business and the complexity of various new initiatives we take continues to grow, we are looking for energetic, entrepreneurial, and self-driven science leaders to join the team. Key job responsibilities As a Principal Applied Scientist in the team, you will: Seek to understand in depth the Sponsored Products offering at Amazon and identify areas of opportunities to grow our business via principled ML solutions. Mentor and guide the applied scientists in our organization and hold us to a high standard of technical rigor and excellence in ML. Design and lead organization wide ML roadmaps to help our Amazon shoppers have a delightful shopping experience while creating long term value for our sellers. Work with our engineering partners and draw upon your experience to meet latency and other system constraints. Identify untapped, high-risk technical and scientific directions, and simulate new research directions that you will drive to completion and deliver. Be responsible for communicating our ML innovations to the broader internal & external scientific community.
US, CA, Palo Alto
We’re working to improve shopping on Amazon using the conversational capabilities of large language models, and are searching for pioneers who are passionate about technology, innovation, and customer experience, and are ready to make a lasting impact on the industry. You'll be working with talented scientists, engineers, and technical program managers (TPM) to innovate on behalf of our customers. If you're fired up about being part of a dynamic, driven team, then this is your moment to join us on this exciting journey!"?
US, CA, Santa Clara
AWS AI/ML is looking for world class scientists and engineers to join its AI Research and Education group working on foundation models, large-scale representation learning, and distributed learning methods and systems. At AWS AI/ML you will invent, implement, and deploy state of the art machine learning algorithms and systems. You will build prototypes and innovate on new representation learning solutions. You will interact closely with our customers and with the academic and research communities. You will be at the heart of a growing and exciting focus area for AWS and work with other acclaimed engineers and world famous scientists. Large-scale foundation models have been the powerhouse in many of the recent advancements in computer vision, natural language processing, automatic speech recognition, recommendation systems, and time series modeling. Developing such models requires not only skillful modeling in individual modalities, but also understanding of how to synergistically combine them, and how to scale the modeling methods to learn with huge models and on large datasets. Join us to work as an integral part of a team that has diverse experiences in this space. We actively work on these areas: * Hardware-informed efficient model architecture, training objective and curriculum design * Distributed training, accelerated optimization methods * Continual learning, multi-task/meta learning * Reasoning, interactive learning, reinforcement learning * Robustness, privacy, model watermarking * Model compression, distillation, pruning, sparsification, quantization About Us Inclusive Team Culture Here at AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 14 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust. Work/Life Balance Our team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives. Mentorship & Career Growth Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future.
US, WA, Seattle
Do you want to join an innovative team of scientists who use machine learning to help Amazon provide the best experience to our Selling Partners by automatically understanding and addressing their challenges, needs and opportunities? Do you want to build advanced algorithmic systems that are powered by state-of-art ML, such as Natural Language Processing, Large Language Models, Deep Learning, Computer Vision and Causal Modeling, to seamlessly engage with Sellers? Are you excited by the prospect of analyzing and modeling terabytes of data and creating cutting edge algorithms to solve real world problems? Do you like to build end-to-end business solutions and directly impact the profitability of the company and experience of our customers? Do you like to innovate and simplify? If yes, then you may be a great fit to join the Selling Partner Experience Science team. Key job responsibilities Use statistical and machine learning techniques to create the next generation of the tools that empower Amazon's Selling Partners to succeed. Design, develop and deploy highly innovative models to interact with Sellers and delight them with solutions. Work closely with teams of scientists and software engineers to drive real-time model implementations and deliver novel and highly impactful features. Establish scalable, efficient, automated processes for large scale data analyses, model development, model validation and model implementation. Research and implement novel machine learning and statistical approaches. Lead strategic initiatives to employ the most recent advances in ML in a fast-paced, experimental environment. Drive the vision and roadmap for how ML can continually improve Selling Partner experience. About the team Selling Partner Experience Science (SPeXSci) is a growing team of scientists, engineers and product leaders engaged in the research and development of the next generation of ML-driven technology to empower Amazon's Selling Partners to succeed. We draw from many science domains, from Natural Language Processing to Computer Vision to Optimization to Economics, to create solutions that seamlessly and automatically engage with Sellers, solve their problems, and help them grow. Focused on collaboration, innovation and strategic impact, we work closely with other science and technology teams, product and operations organizations, and with senior leadership, to transform the Selling Partner experience.