How SageMaker’s algorithms help democratize machine learning

System enables efficient updating and parallelization and stable scaling.

SageMaker is a service from Amazon Web Services that lets customers quickly and easily build machine learning models for deployment in the cloud. It includes a suite of standard machine learning algorithms such as k-means clustering, principal component analysis, neural topic modeling, and time series forecasting.

Last week at SIGMOD/PODS, the Association for Computing Machinery’s major conference on data systems, my colleagues and I described the design of the system that supports these algorithms.

The contexts in which cloud-based machine learning models operate are rarely static. Models often need updating as new training data becomes available or new use cases arise; some models are updated hourly.

Simply retraining a model on new data, however, risks eroding the knowledge the model has previously acquired. Retraining the model on a combination of both new and old data avoids this problem, but it can be prohibitively time consuming.

The SageMaker system design helps resolve this impasse. It also enables easier parallelization of model training and more efficient optimization of model “hyperparameters”, structural features of the model whose variation can affect performance.

In neural networks, for instance, hyperparameters include features like the number of network layers, the number of nodes per layer, and the network’s learning rate. The optimal settings of a model’s hyperparameters vary from task to task, and tuning hyperparameters to a particular task is typically a tedious, trial-and-error process.

Our system design addresses these problems by distinguishing between a model and the model state. In this context, the state is an executive summary of the data that the model has seen so far.

HPO.png
The system that supports the machine learning algorithms offered through the AWS SageMaker service stores the state of a machine learning model, an executive summary of the data that the model has seen so far (black square). This enables the rapid exploration of different hyperparameters for the model (grey squares).

To take a trivial example, suppose that a model is calculating a running average of an incoming stream of numbers. The state of the model would include both the sum of all the numbers it’s seen and their quantity. If the model stores that state, then, when a new stream of numbers comes in the next week, it can simply continue to increment both values, without needing to re-add the numbers it’s already seen.

Of course, most machine learning models perform tasks that are more complex than simple averaging, and the information that the state must capture will vary from task to task: it could, for instance, include representative samples from the data it’s seen. With SageMaker, we’ve identified separate state variables for each of the machine learning algorithms we support.

One of the advantages of tracking state is model stability. The state is of fixed size: the model may see more and more data, but the state’s summary of the data always takes up the same space in memory.

This means that the cost of training the model, in both time and system resources, scales linearly with the amount of new training data. If training time scales superlinearly, a large enough volume of data could cause the training to time out and therefore fail.

The averaging example illustrates another facet of our system: it needs to operate on streaming data. That is, it may see each training example only once, and the sequence of examples may break off at any point. At any such breakpoint, it should be able to synthesize what it’s learned to produce a working, up-to-date model.

Distributed state

Our system supports this learning paradigm. But it also works perfectly well in the standard machine learning setting, where training examples are broken into fixed-size batches, and the model runs through the same training set multiple times until its performance stops improving.

When the system trains a model in parallel, each parallel processor receives its own copy of the state, which it updates locally. To synchronize the locally stored state updates, we use an open-source framework called a parameter server.

The synchronization schedule is again algorithm specific. With k-means clustering and principal component analysis, for instance, a given processor doesn’t need to report its state update to the parameter server until it’s completed all its computations. With a neural network, whose training involves finding a global optimum, synchronization would need to occur much more frequently.

Just as the state’s data summaries enable efficient retraining of models, so they enable efficient estimates of the effects of different hyperparameter settings on the model’s performance. Hence SageMaker’s ability to automate hyperparameter tuning.

In the paper, we report the results of experiments in which we compared our system to some standard implementations of the same machine learning techniques.

We found that, on average, our approach was much more resource efficient. With the linear learner, for instance — an algorithm that learns linear models such as linear regressions and multiclass classification — our approach enabled an eight-fold increase in parallelization efficiency.

And with k-means clustering, a technique for clustering data points, our approach enabled a nearly 10-fold increase in training efficiency. Indeed, in our experiments, data sets larger than 100 gigabytes caused existing implementations to crash.

Related content

LU, Luxembourg
Have you ever wondered how Amazon delivers timely and reliably hundreds of millions of packages to customer’s doorsteps? Are you passionate about data and mathematics, and hope to impact the experience of millions of customers? Are you obsessed with designing simple algorithmic solutions to very challenging problems? If so, we look forward to hearing from you! Amazon Transportation Services is seeking Applied (or Research) Scientists. As a key member of the central Research Science Team of ATS operations, these persons will be responsible for designing algorithmic solutions based on data and mathematics for optimizing the middle-mile Amazon transportation network. The job is opened in the EU Headquarters in Luxembourg (alternatively: Barcelona, Berlin or London), designed to maximize interaction with the team and stakeholders, but we will consider applicants with remote work requirements as well. Key job responsibilities Solve complex optimization and machine learning problems using scalable algorithmic techniques. Design and develop efficient research prototypes that address real-world problems in the middle-mile operations of Amazon. Lead complex time-bound, long-term as well as ad-hoc analyses to assist decision making. Communicate to leadership results from business analysis, strategies and tactics. A day in the life You will be brainstorming algorithmic approaches with team-mates to solve challenging problems for the middle-mile operations of Amazon. You will be developing and testing prototype solutions with above algorithmic techniques. You will be scavenging information from the sea of Amazon data to improve these solutions. You will be meeting with other scientists, engineers, stakeholders and customers to enhance the solutions and get them adopted. About the team The Science and Tech team of ATS EU is looking for candidates who are looking to impact the world with their mathematical and data-driven skills. ATS stands for Amazon Transportation Service, we are the middle-mile planners: we carry the packages from the warehouses to the cities in a limited amount of time to enable the “Amazon experience”. As the core research team, we grow with ATS business to support decision making in an increasingly complex ecosystem of a data-driven supply chain and e-commerce giant. We schedule more than 1 million trucks with Amazon shipments annually; our algorithms are key to reducing CO2 emissions, protecting sites from being overwhelmed during peak days, and ensuring a smile on Amazon’s customer lips. Our mathematical algorithms provide confidence in leadership to invest in programs of several hundreds millions euros every year. Above all, we are having fun solving real-world problems, in real-world speed, while failing & learning along the way. We use modular algorithmic designs in the domain of combinatorial optimization, solving complicated generalizations of core OR problems with the right level of decomposition, employing parallelization and approximation algorithms. We use deep learning, bandits, and reinforcement learning to put data into the loop of decision making. We like to learn new techniques to surprise business stakeholders by making possible what they cannot anticipate. For this reason, we work closely with Amazon scholars and experts from Academic institutions. We code our prototypes to be production-ready We prefer provably optimal solutions than heuristics, though we settle for heuristics when performance dictates it. Overall, we appreciate the value of correct modeling. We are open to hiring candidates to work out of one of the following locations: Luxembourg, LUX
US, VA, Herndon
Do you love decomposing problems to develop machine learning (ML) products that impact millions of people around the world? Would you enjoy identifying, defining, and building ML software solutions that revolutionize how businesses operate? The Global Practice Organization in Professional Services at Amazon Web Services (AWS) is looking for a Software Development Engineer II to build, deliver, and maintain complex ML products that delight our customers and raise our performance bar. You’ll design fault-tolerant systems that run at massive scale as we continue to innovate best-in-class services and applications in the AWS Cloud. Key job responsibilities Our ML Engineers collaborate across diverse teams, projects, and environments to have a firsthand impact on our global customer base. You’ll bring a passion for the intersection of software development with generative AI and machine learning. You’ll also: - Solve complex technical problems, often ones not solved before, at every layer of the stack. - Design, implement, test, deploy and maintain innovative ML solutions to transform service performance, durability, cost, and security. - Build high-quality, highly available, always-on products. - Research implementations that deliver the best possible experiences for customers. A day in the life As you design and code solutions to help our team drive efficiencies in ML architecture, you’ll create metrics, implement automation and other improvements, and resolve the root cause of software defects. You’ll also: - Build high-impact ML solutions to deliver to our large customer base. - Participate in design discussions, code review, and communicate with internal and external stakeholders. - Work cross-functionally to help drive business solutions with your technical input. - Work in a startup-like development environment, where you’re always working on the most important stuff. About the team The Global Practice Organization for Analytics is a team inside the AWS Professional Services Organization. Our mission in the Global Practice Organization is to be at the forefront of defining machine learning domain strategy, and ensuring the scale of Professional Services' delivery. We define strategic initiatives, provide domain expertise, and oversee the development of high-quality, repeatable offerings that accelerate customer outcomes. Inclusive Team Culture Here at AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have thirteen employee-led affinity groups, reaching 85,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 16 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust. Work/Life Balance Our team puts a high value on work-life harmony. Striking a healthy balance between your personal and professional life is crucial to your happiness and success here. We are a customer-obsessed organization—leaders start with the customer and work backwards. They work vigorously to earn and keep customer trust. As such, this is a customer facing role in a hybrid delivery model. Project engagements include remote delivery methods and onsite engagement that will include travel to customer locations as needed. Mentorship & Career Growth Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded professional and enable them to take on more complex tasks in the future. This is a customer-facing role and you will be required to travel to client locations and deliver professional services as needed. We are open to hiring candidates to work out of one of the following locations: Atlanta, GA, USA | Austin, TX, USA | Boston, MA, USA | Chicago, IL, USA | Herndon, VA, USA | Minneapolis, MN, USA | New York, NC, USA | San Diego, CA, USA | San Francisco, CA, USA | Seattle, WA, USA
US, MA, North Reading
Are you inspired by invention? Is problem solving through teamwork in your DNA? Do you like the idea of seeing how your work impacts the bigger picture? Answer yes to any of these and you’ll fit right in here at Amazon Robotics. We are a smart team of doers that work passionately to apply cutting edge advances in robotics and software to solve real-world challenges that will transform our customers’ experiences in ways we can’t even imagine yet. We invent new improvements every day. We are Amazon Robotics and we will give you the tools and support you need to invent with us in ways that are rewarding, fulfilling and fun. Amazon Robotics is seeking Applied Science Interns and Co-ops with a passion for robotic research to work on cutting edge algorithms for robotics. Our team works on challenging and high-impact projects within robotics. Examples of projects include allocating resources to complete a million orders a day, coordinating the motion of thousands of robots, autonomous navigation in warehouses, identifying objects and damage, and learning how to grasp all the products Amazon sells. As an Applied Science Intern/Co-op at Amazon Robotics, you will be working on one or more of our robotic technologies such as autonomous mobile robots, robot manipulators, and computer vision identification technologies. The intern/co-op project(s) and the internship/co-op location are determined by the team the student will be working on. Please note that by applying to this role you would be considered for Applied Scientist summer intern, spring co-op, and fall co-op roles on various Amazon Robotics teams. These teams work on robotics research within areas such as computer vision, machine learning, robotic manipulation, navigation, path planning, perception, optimization and more. Learn more about Amazon Robotics: https://amazon.jobs/en/teams/amazon-robotics We are open to hiring candidates to work out of one of the following locations: North Reading, MA, USA | Seattle, WA, USA | Westborough, MA, USA
CA, BC, Vancouver
Amazon Web Services (AWS) is building a world-class marketing organization that drives awareness and customer engagement with the goal of educating developers, IT and line-of-business professionals, startups, partners, and executive decision makers about AWS services and solutions, their benefits, and differentiation. As the central data and science organization in AWS Marketing, the Data: Science and Engineering (D:SE) team builds measurement products, AI/ML models for targeting, and self-service insights capabilities for AWS Marketing to drive better measurement and personalization, improve data access and analytical self-service, and empower strategic data-driven decisions. We work globally as a central team and establish standards, benchmarks, and best practices for use throughout AWS Marketing. We are looking for a Principal Data Scientist with deep expertise in scaling measurement science, content ranking and rapid experimentation at scale, with strong interest in building scalable solutions in partnership with our engineering organization. You will lead strategic measurement science initiatives across AWS Marketing & Sales ranging anywhere between recommender engines, scaling experimentation and measurement science, real-time inference, and cross-channel orchestration. You are an hands-on innovator who can contribute to advancing Marketing measurement technology in a B2B environment, and push the limits on what’s scientifically possible with a razor sharp focus on measurable customer and business impact. You will work with recognized B2B Marketing Science and AI/ML experts to develop large-scale, high-performing measurement science models and AI/ML capabilities. We are at a pivotal moment in our organization where AI/ML and measurement velocity has reached an unseen momentum, and we need to scale fast in order to maintain it. Your work will be a key input into a few of our key business goals. You will advance the state of the art in measurement at scale. We are open to hiring candidates to work out of one of the following locations: Vancouver, BC, CAN
US, WA, Seattle
Innovators wanted! Are you an entrepreneur? A builder? A dreamer? This role is part of an Amazon Special Projects team that takes the company’s Think Big leadership principle to the extreme. We focus on creating entirely new products and services with a goal of positively impacting the lives of our customers. No industries or subject areas are out of bounds. If you’re interested in innovating at scale to address big challenges in the world, this is the team for you. Here at Amazon, we embrace our differences. We are committed to furthering our culture of inclusion. We have thirteen employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We are constantly learning through programs that are local, regional, and global. Amazon’s culture of inclusion is reinforced within our 16 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust. Our team highly values work-life balance, mentorship and career growth. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We care about your career growth and strive to assign projects and offer training that will challenge you to become your best. Key job responsibilities • Develop automated laboratory workflows. • Perform data QC, document results, and communicate to stakeholders. • Maintain updated understanding and knowledge of methods. • Identify and escalate equipment malfunctions; troubleshoot common errors. • Participate in the updating of protocols and database to accurately reflect the current practices. • Maintain equipment and instruments in good operating condition • Adapt to unexpected schedule changes and respond to emergency situations, as needed. We are open to hiring candidates to work out of one of the following locations: Seattle, WA, USA
US, WA, Seattle
Are you excited about developing generative AI and foundation models to revolutionize automation, robotics and computer vision? Are you looking for opportunities to build and deploy them on real problems at truly vast scale? At Amazon Fulfillment Technologies and Robotics we are on a mission to build high-performance autonomous systems that perceive and act to further improve our world-class customer experience - at Amazon scale. We are looking for scientists, engineers and program managers for a variety of roles. The Amazon Robotics software team is seeking a Applied Scientist to focus on large vision and manipulation machine learning models. This includes building multi-viewpoint and time-series computer vision systems. It includes using machine learning to drive hardware movement. It includes building large-scale models using data from many different tasks and scenes. This work spans from basic research such as cross domain training, to experimenting on prototype in the lab, to running wide-scale A/B tests on robots in our facilities. Key job responsibilities * Research vision - Where should we be focusing our efforts * Research delivery – Proving/dis-proving strategies in offline data or in the lab * Production studies - Insights from production data or ad-hoc experimentation. About the team This team invents and runs robots focused on grasping and packing items. These are typically 6-dof style robotic arms. Our work ranges from the long-term-research on basic science to deploying/supporting large production fleets handling billions of items per year. We are open to hiring candidates to work out of one of the following locations: Seattle, WA, USA
US, VA, Arlington
Amazon launched the Generative AI (GenAI) Innovation Center (GAIIC) in Jun 2023 to help AWS customers accelerate enterprise innovation and success with Generative AI (https://press.aboutamazon.com/2023/6/aws-announces-generative-ai-innovation-center). Customers such as Highspot, Lonely Planet, Ryanair, and Twilio are engaging with the GAI Innovation Center to explore developing generative solutions. GAIIC provides opportunities to innovate in a fast-paced organization that contributes to game-changing projects and technologies that get deployed on devices and in the cloud. As a data scientist at GAIIC, you are proficient in designing and developing advanced Generative AI based solutions to solve diverse customer problems. You will be working with terabytes of text, images, and other types of data to solve real-world problems through Gen AI. You will be working closely with account teams and ML strategists to define the use case, and with other scientists and ML engineers on the team to design experiments, and find new ways to deliver value to the customer. The successful candidate will possess both technical and customer-facing skills that will allow you to be the technical “face” of AWS within our solution providers’ ecosystem/environment as well as directly to end customers. You will be able to drive discussions with senior technical and management personnel within customers and partners. This position requires that the candidate selected be a US Citizen and currently possess and maintain an active Top Secret security clearance. About the team Work/Life Balance Our team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives. Mentorship & Career Growth Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Denver, CO, USA
US, CA, Santa Monica
Amazon is investing heavily in building a world class advertising business and we are responsible for defining and delivering a collection of self-service performance advertising products that drive discovery and sales. Our products are strategically important to our Retail and Marketplace businesses driving long term growth. We deliver billions of ad impressions and millions of clicks daily and are breaking fresh ground to create world-class products. We are highly motivated, collaborative and fun-loving with an entrepreneurial spirit and bias for action. With a broad mandate to experiment and innovate, we are growing at an unprecedented rate with a seemingly endless range of new opportunities. We are looking for an Applied Scientist to join Monetization and Growth team in Marketplace Intelligence with a broad mandate to experiment and innovate to grow Sponsored Products. As an Applied Scientist in this team, you will help to identify unique opportunities to create customized and delightful shopping experience for our growing marketplaces worldwide. Your job will be identify big opportunities for the team that can help to grow Sponsored Products business working with retail partner teams, Product managers, Software engineers and TPMs. You will have opportunity to design, run and analyze A/B experiments to improve the experience of millions of Amazon shoppers while driving quantifiable revenue impact. More importantly, you will have the opportunity to broaden your technical skills in an environment that thrives on creativity, experimentation, and product innovation. Key job responsibilities - Be the technical leader in Machine Learning; lead efforts within this team and across other teams. - Perform hands-on analysis and modeling of enormous data sets to develop insights that increase traffic monetization and merchandise sales, without compromising the shopper experience. - Drive end-to-end Machine Learning projects that have a high degree of ambiguity, scale, complexity. - Run A/B experiments, gather data, and perform statistical analysis. - Establish scalable, efficient, automated processes for large-scale data analysis, machine-learning model development, model validation and serving. - Research new and innovative machine learning approaches. - Recruit Scientists to the team and provide mentorship. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Los Angeles, CA, USA | Santa Monica, CA, USA
US, VA, Arlington
Amazon’s mission is to be the most customer centric company in the world. The Workforce Staffing (WFS) organization is on the front line of that mission by hiring the hourly fulfillment associates who make that mission a reality. To drive the necessary growth and continued scale of Amazon’s associate needs within a constrained employment environment, Amazon has created the Workforce Intelligence (WFI) team. This team will (re)invent how Amazon attracts, communicates with, and ultimately hires its hourly associates. This team owns multi-layered research and program implementation to drive deep learning, process improvements, and strategic recommendations to global leadership. Are you passionate about data? Do you enjoy questioning the status quo? Do complex and difficult challenges excite you? If yes, this may be the team for you. The Data Scientist will be responsible for creating cutting edge algorithms, predictive and prescriptive models as well as required data models to facilitate WFS at-scale warehouse associate hiring. This role acts as an internal consultant to the marketing, biz ops and candidate experience teams covering responsibilities such as at-scale hiring process improvement, analyzing large scale candidate/associate data and being strategic to providing best candidate hiring experience to WFS warehouse associate candidates. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA
US, VA, Arlington
Amazon’s mission is to be the most customer centric company in the world. The vision of Workforce Intelligence is to design the ideal workforce to meet the customer promise anywhere. This organization leads and influences global workforce strategies that enable Amazon to scale operations more efficiently while also providing a unique voice for the hourly workforce. This is accomplished through a variety of science driven initiatives, experimentations, ML driven modeling, and data engineering. Amazon’s mission is to be the most customer-centric company in the world and we are on the front lines of that mission by providing robust research, data science and analytics to fill our jobs across the globe. You will identify data requirements, build methodology and tools that are statistically grounded. You will develop and produce actionable insights that allows our staffing teams to uncover opportunities to improve hiring. You will provide data-driven solutions that increase the efficiency of our hiring pipeline and improve candidate experience. Key responsibilities include: - As a Data Scientist (DS) in Workforce Intelligence, you will do causal data science, build predictive models, conduct simulations, create visualizations, and influence data science practice across the organization. - Provide insights by analyzing historical data from databases (Redshift, SQL Server, Oracle DW, and Salesforce). - Create experiments and prototype implementations of new learning algorithms and prediction techniques. - Research and build machine learning algorithms that improve hiring at scale. Inclusive Team Culture Here at Amazon, we embrace our differences. We are committed to furthering our culture of inclusion. We have 12 affinity groups (employee resource groups) with more than 87,000 employees across hundreds of chapters around the world. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 14 Leadership Principles, which reminds team members to seek diverse perspectives, learn and be curious, and earn trust. Flexibility It isn’t about which hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We offer flexibility and encourage you to find your own balance between your work and personal lives. Mentorship & Career Growth We care about your career growth too. Whether your goals are to explore new technologies, take on bigger opportunities, or get to the next level, we'll help you get there. Our business is growing fast and our people will grow with it. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Seattle, WA, USA