A screenshot from SageMaker Clarify
SageMaker Clarify is integrated with Amazon SageMaker Data Wrangler, making it easier to identify bias during data preparation. You specify attributes of interest, such as gender or age, and SageMaker Clarify runs a set of algorithms to detect any presence of bias in those attributes.
Credit: AWS

How Clarify helps machine learning developers detect unintended bias

Learn why the science team behind Clarify turned to a concept from 1951 to address a modern complexity.

In his machine learning keynote at re:Invent on Tuesday, Swami Sivasubramanian, vice president of machine learning, Amazon Web Services (AWS), announced Amazon SageMaker Clarify, a new service that helps customers detect statistical bias in their data and machine learning models, and helps explain why their models are making specific predictions. Clarify saves developers time and effort by providing them the ability to better understand and explain how their machine learning models arrive at their predictions.

Understanding the predictions made by machine learning (ML) models and their potential biases remains a challenging and labor-intensive task that depends on the application, the dataset, and the specific model. We present Amazon SageMaker Clarify, an explainability feature for Amazon SageMaker that launched in December 2020, providing insights into data and ML models by identifying biases and explaining

Developers today contend with both increasingly large volumes of data, as well as more complex machine learning models. In order to detect bias in those complex models and data sets, developers must rely on open-source libraries replete with custom code recipes that are inconsistent across machine learning frameworks. This tedious approach requires a lot of manual effort and often arrives too late to correct unintended bias.

“If you care about this stuff, it's pretty much a roll-your-own situation right now,” said University of Pennsylvania computer science professor and Amazon Scholar Michael Kearns, who provided guidance to the team of scientists that developed SageMaker Clarify. “If you want to do some practical bias detection, you either need to implement it yourself or go to one of the open-source libraries, which vary in quality. They're frequently not well-maintained or documented. In many cases, it's just, ‘Here is the code we used to run our experiments for this academic paper, good luck.’”

SageMaker Clarify helps address the challenges of relying on multiple open-source libraries by offering robust, reliable code in an integrated, cloud-based framework.

Increasingly complex networks

The efficacy of machine learning models depends in part on understanding how much influence a given input has on the output.

AWS on Air 2020: AWS What’s Next ft. Amazon SageMaker Clarify

“A lending model for consumer loans might include credit history, employment history, and how long someone has lived at their current address,” Kearns explained. “It might also utilize variables that aren't specifically financial, such as demographic variables. One thing you might naturally want to know is which of these variables is more important in the model’s predictions, which may be used in lending decisions, and which are less important.”

With linear models, each variable is assigned some weight, positive or negative, and the overall decision is a sum of those weighted inputs. In those cases, the inputs with the bigger weights clearly have more influence on the output.

However, that approach falls short with neural networks or more complicated, non-linear models. “When you get to models like neural networks, it's no longer a simple matter of determining or measuring the influence of an input on the output,” Kearns said.

To help account for the growing complexity of modern machine learning models, the Amazon science team looked to the past — specifically to an idea from 1951.

Shapley values

The team wanted to design a solution to help machine learning pros be able to better explain their models’ decisions in the face of growing complexity. They found inspiration in a popular scientific method called Shapley values.

Shapley values were named in honor of Lloyd Shapley, who introduced the idea in 1951 and who won the Nobel Prize in Economics for it in 2012. The Shapley value approach, which is rooted in game theory, considers a wide range of possible inputs and outputs and offers “the average marginal contribution of a feature value across all possible coalitions”.  The comprehensive nature of the approach means it can help provide a framework for understanding the relative weight of a set of inputs, even across complex models and multiple inputs.

“SageMaker Clarify utilizes Shapley values to essentially take your model and run a number of experiments on it or on your data set,” Kearns said. “It then uses that to help come up with a visualization and quantification of which of those inputs is more or less important.”

Nor does it matter which kind of model a developer uses. “One of the nice things about this approach is it is model agnostic,” Kearns said. “It performs input-output experiments and gives you some sense of the relative importance of the different inputs to the output decision.”

The science team also worked to be certain SageMaker Clarify had a comprehensive view. They designed it so everyday developers and data scientists can detect bias across the entire machine learning workflow — including data preparation, training, and inference. SageMaker Clarify is able to achieve that comprehensive view, Kearns explained, because (again) it is model agnostic. “Each of these steps has been designed to avoid making strong assumptions about the type of model that the user is building.”

Bias detection and explainability

Model builders who learn that their models are making predictions that are strongly correlated to a specific input may find those predictions fall short of their definition of fairness. Kearns offered the example of a lending company that discovers its model’s predictions are skewed. “That company will want to understand why its model is making predictions that might lead to decisions to give loans at a lower rate to group A than to group B, even if they're equally credit worthy.”

SageMaker Clarify can examine tabular data and help the modelers spot where gaps might exist. “This company would upload a spreadsheet of data showing who they gave loans to, what they knew about them, et cetera,” Kearns said. “What the data bias detection part does is say, ‘For these columns, there may be over or underrepresentation of certain features, which could lead to a discriminatory outcome if not addressed.’”

A screenshot from SageMaker Clarify
SageMaker Clarify is integrated with SageMaker Model Monitor, enabling you to configure alerting systems like Amazon CloudWatch to notify you if your model exceeds certain bias metric thresholds. 
Credit: AWS

That can be influenced by a number of factors, including simply lacking the correct data to build accurate predictions. For example, SageMaker Clarify can indicate whether modelers have enough data on certain groups of applicants to expect an accurate prediction. The metrics provided by SageMaker Clarify can then be used to correct unintended bias in machine learning models, and automatically monitor model predictions in production to help ensure they are not trending toward biased outcomes.

Future applications

The SageMaker Clarify science team is already looking to the future.

Their research areas include algorithmic fairness and machine learning, as well as explainable AI. Team members have published widely in the academic literature on these topics, and worked hard in the development of SageMaker Clarify to balance the science of fairness with engineering solutions and practical product design. Their approaches are both statistical and causal, and focus not only on bias measurement in trained models, but also bias mitigation. It is that last part that has Kearns particularly excited about the future.

“The ability to not just identify problems in your models, but also have the tools to train them in a different way would go a long way toward mitigating that bias,” he said. “It’s good to know that you have a problem, but it's even better to have a solution to your problem.”

Best practices

The notions of bias and fairness are highly application dependent and the choice of the attributes for which bias is to be measured, as well as the choice of the bias metrics, may need to be guided by social, legal, and other non-technical considerations,” said principal applied scientist Krishnaram Kenthapadi, who led the scientific effort behind SageMaker Clarify. “For successful adoption of fairness-aware machine learning and explainable AI approaches in practice, it’s important to build consensus and achieve collaboration across key stakeholders such as product, policy, legal, engineering, and AI/ML teams, as well as end users and communities,” he said. “Further, it’s good to take into account fairness and explainability considerations during each stage of the ML lifecycle, for example, Problem Formation, Dataset Construction, Algorithm Selection, Model Training Process, Testing Process, Deployment, and Monitoring/Feedback.

Find more best practices on the AWS website.

Research areas

Related content

US, CA, Palo Alto
The Amazon Search team creates powerful, customer-focused search and advertising solutions and technologies. Whenever a customer visits an Amazon site worldwide and types in a query or browses through product categories, the Amazon Search services go to work. We design, develop, and deploy high performance, fault-tolerant distributed search systems used by millions of Amazon customers every day. Our team works to maximize the quality and effectiveness of the search experience for visitors to Amazon websites worldwide.
JP, Tokyo
The Amazon Logistics (AMZL) Team is responsible for the acquisition, design, construction, and management of all facilities in the Amazon Delivery Station Network. AMZL is looking for a talented and passionate Data Scientist to help shape its Last Mile business with technical strategies and solutions, by processing, analyzing and interpreting huge data sets. You should be comfortable with ambiguity, problem solving and enjoy working in a fast-paced, diverse and dynamic environment. Using analytical rigor and statistical methods, you mine through data to identify opportunities for Amazon and our delivery channels. And you collaborate with other scientists, engineers, Product and Program Managers to deploy new products and solutions. [More Information] Last Mile Department Data Analyst/BI Engineer Tokyo Office *Amazon is committed to a diverse and inclusive workplace. Amazon is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status. For individuals with disabilities who would like to request an accommodation, visit https://www.amazon.jobs/disability/jp Key job responsibilities Creating a roadmap of the most challenging business questions and use data to articulate possible root cause analysis and solutions Managing and executing entire projects or components of large projects from start to finish including project management, data gathering and manipulation, synthesis and modeling, problem solving, and communication of insights Partnering with Product, Program and Engineering teams to design and run models, research new algorithms, and prove incrementality and drive growth Understanding drivers, impacts, and key influences on seller growth dynamics Developing and scaling end-to-end ML Models and solutions Automating feedback loops for algorithms in production Utilizing Amazon systems and tools to effectively work with terabytes of data About the team Last Mile Execution Analytics (LMEA) team of JP works as an integral part of Amazon Logistics to ensure that its business intelligence, analytics, tools and planning needs are met. By providing information, insight, and decision support, we strive to enable success of all parts of AMZL. Our customer set includes senior management, station operations, external vendors, long-term planning, Ops technology (Voice of the Delivery Station, Voice of the Customer), network planning, and pretty much every BI and Ops teams. Voice of Employee [Work Life Harmony] We believe, it is important to spend private time such as spending time with your family or doing anything you like to spur innovation. Amazon promotes a fulfilling and flexible work style according to the work volume and lifestyle of each employee.
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
LU, Luxembourg
Are you a talented and inventive scientist with a strong passion about modern data technologies and interested to improve business processes, extracting value from the data? Would you like to be a part of an organization that is aiming to use self-learning technology to process data in order to support the management of the procurement function? The Global Procurement Technology, as a part of Global Procurement Operations, is seeking a skilled Data Scientist to help build its future data intelligence in business ecosystem, working with large distributed systems of data and providing Machine Learning (ML) and Predictive Modeling expertise. You will be a member of the Data Engineering and ML Team, joining a fast-growing global organization, with a great vision to transform the Procurement field, and become the role model in the market. This team plays a strategic role supporting the core Procurement business domains as well as it is the cornerstone of any transformation and innovation initiative. Our mission is to provide a high-quality data environment to facilitate process optimization and business digitalization, on a global scale. We are supporting business initiatives, including but not limited to, strategic supplier sourcing (e.g. contracting, negotiation, spend analysis, market research, etc.), order management, supplier performance, etc. We are seeking an individual who can thrive in a fast-paced work environment, be collaborative and share knowledge and experience with his colleagues. You are expected to deliver results, but at the same time have fun with your teammates and enjoy working in the company. In Amazon, you will find all the resources required to learn new skills, grow your career, and become a better professional. You will connect with world leaders in your field and you will be tackling Data Science challenges to ensure business continuity, by taking the right decisions for your customers. As a Data Scientist in the team, you will: -be the subject matter expert to support team strategies that will take Global Procurement Operations towards world-class predictive maintenance practices and processes, driving more effective procurement functions, e.g. supplier segmentation, negotiations, shipping supplies volume forecast, spend management, etc. -have strong analytical skills and excel in the design, creation, management, and enterprise use of large data sets, combining raw data from different sources -provide technical expertise to support the development of ML models to facilitate intelligent digital services, such as Contract Lifecycle Management (CLM) and Negotiations platform -cooperate closely with different groups of stakeholders, e.g. data/software engineers, product/program managers, analysts, senior leadership, etc. to evaluate business needs and objectives to set up the best data management environment -create and share with audiences of varying levels technical papers and presentations -deal with ambiguity, prioritizing needs, and delivering results in a dynamic environment Basic qualifications -Master’s Degree in Computer Science/Engineering, Informatics, Mathematics, or a related technical discipline -3+ years of industry experience in data engineering/science, business intelligence or related field -3+ years experience in algorithm design, engineering and implementation for very-large scale applications to solve real problems -Very good knowledge of data modeling and evaluation -Very good understanding of regression modeling, forecasting techniques, time series analysis, machine-learning concepts such as supervised and unsupervised learning, classification, random forest, etc. -SQL and query performance tuning skills Preferred qualifications -2+ years of proficiency in using R, Python, Scala, Java or any modern language for data processing and statistical analysis -Experience with various RDBMS, such as PostgreSQL, MS SQL Server, MySQL, etc. -Experience architecting Big Data and ML solutions with AWS products (Redshift, DynamoDB, Lambda, S3, EMR, SageMaker, Lex, Kendra, Forecast etc.) -Experience articulating business questions and using quantitative techniques to arrive at a solution using available data -Experience with agile/scrum methodologies and its benefits of managing projects efficiently and delivering results iteratively -Excellent written and verbal communication skills including data visualization, especially in regards to quantitative topics discussed with non-technical colleagues
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, WA, Seattle
We are a team of doers working passionately to apply cutting-edge advances in deep learning in the life sciences to solve real-world problems. As a Senior Applied Science Manager you will participate in developing exciting products for customers. Our team rewards curiosity while maintaining a laser-focus in bringing products to market. Competitive candidates are responsive, flexible, and able to succeed within an open, collaborative, entrepreneurial, startup-like environment. At the leading edge of both academic and applied research in this product area, you have the opportunity to work together with a diverse and talented team of scientists, engineers, and product managers and collaborate with others teams. Location is in Seattle, US Embrace Diversity Here at Amazon, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 14 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust Balance Work and Life Our team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives Mentor & Grow Careers Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future. Key job responsibilities • Manage high performing engineering and science teams • Hire and develop top-performing engineers, scientists, and other managers • Develop and execute on project plans and delivery commitments • Work with business, data science, software engineer, biological, and product leaders to help define product requirements and with managers, scientists, and engineers to execute on them • Build and maintain world-class customer experience and operational excellence for your deliverables