Near-linear scaling of gigantic-model training on AWS

A new distributed-training library achieves near-linear efficiency in scaling from tens to hundreds of GPUs.

State-of-the-art language models have billions of parameters. Training these models within a manageable time requires distributing the workload across a large computing cluster. Ideally, training time would decrease linearly as the cluster size scales up. However, linear scaling is difficult to achieve because the communication required to coordinate the work of the cluster nodes eats into the gains from parallelization.

Related content
Amazon researchers optimize the distributed-training tool to run efficiently on the Elastic Fabric Adapter network interface.

Recently, we put some effort into optimizing the communication efficiency of Microsoft’s DeepSpeed distributed-training library, dramatically improving performance for up to 64 GPUs. However, when we scale from tens of GPUs to hundreds, in the public cloud environment, communication overhead again begins to overwhelm efficiency gains.

In a paper that we'll present in 2023 at the International Conference on Very Large Data Bases (VLDB), we propose a method to make model training scale efficiently on hundreds of GPUs in the cloud. We call this method MiCS, because it minimizes communication scale to bring down communication overhead.

Specifically, where existing distributed-training frameworks such as DeepSpeed and FairScale divide a model state across all GPUs, MiCS makes multiple replicas of the model state and partitions each replica within a subset of GPUs. Depending on the model size, a replica may fit on a single computing node — a single machine with high-speed connections between its GPUs — or on multiple nodes.

Thus, in MiCS, frequent communication operations, like parameter gathering, are restricted to a subset of GPUs. In this way, when we scale a cluster up — by adding new replicas across new nodes — the communication latency of frequent communication operations remains fixed, rather than growing with the size of the cluster.

We also reduce the data volume transmitted between nodes in the event that a copy of the model state won’t fit in a single node. Lastly, MiCS includes a gradient synchronization schedule that amortizes expensive gradient synchronization among all workers.

Our experimental results show significant improvement in throughput and scaling efficiency on different-sized BERT models evaluated on clusters consisting of p3dn.24xlarge instances. MiCS is able to achieve near-linear scalability (denoted by the rectangular frames in the figure below) and provides up to 2.82-fold throughput compared to the second and third states of the three-stage zero-redundancy optimizer, or ZeRO, the communication management method built into DeepSpeed-v0.5.6 .

We have also compared MiCS with our earlier optimizations of ZeRO’s third stage (see figure below), demonstrating improvements even at the lower GPU counts that we investigated previously. We report all these findings in greater detail in a preprint paper on the arXiv.

MiCS results.png
A comparison of MiCS and our earlier optimizations of DeepSpeed Zero’s third stage.

AWS P4d provides up to 400Gbps networking bandwidth for high-performance computing. Unfortunately, the distributed system may not be able to fully utilize 400Gbps efficiently because of communication overhead — especially latency, which increases when adding more GPUs to the cluster.

Related content
Optimizing placement of configuration data ensures that it’s available and consistent during “network partitions”.

We have deployed MiCS to train proprietary models with up to 175 billion parameters on p4d.24xlarge (40GB A100) and p4de.24xlarge (80GB A100) instances. When training a 175-billion-parameter model with a sequence length of 2,048 on 16 p4de.24xlarge instances, we are able to achieve 169-teraflops (54.2% of the theoretical peak) performance on each GPU. When we train a 100-billion-parameter model on 64 p4d.24xlarge instances (512 A100 GPUs), MiCS maintains over 170 teraflops per GPU (54.5% of the theoretical peak).

When the size of the cluster is scaled from 128 GPUs to 512 GPUs, MiCS achieves 99.4% of the linear-scaling efficiency (as measured by the “weak scaling” metric). In contrast, DeepSpeed ZeRO’s third stage achieves only 72% weak-scaling efficiency and saturates at 62 teraflops per GPU (19.9% of the theoretical peak).

Scale-aware model partitioning

By default, DeepSpeed partitions model states across all devices, a strategy that lowers the memory consumption on each GPU in the cluster but incurs large communication overhead in training. More importantly, the overhead scales up with the size of the cluster, which causes the scalability to drop significantly at large scale.

Instead of partitioning model states to all GPUs, MiCS divides GPUs in the cluster into multiple groups and partitions model states within each group. We call these groups partition groups. Each group holds a complete replica of model states. The following figure gives an example of partition groups consisting of two consecutive GPUs. Those GPUs holding the same part of the model state form another kind of group, a replication group.

Graphic shows the relationship between partition groups and replication groups in MiCS.
The relationship between partition groups and replication groups in MiCS.

Partitioning model states within each partition group restricts the most frequent communications, parameter gathering and gradient synchronization, within a fixed number of GPUs. This strategy effectively controls the communication overhead and does not let it grow with the size of the cluster.

Hierarchical communication strategy

When the memory requirement for a single replica of the model state is larger than the total amount of GPU memory in a single node, we need to store the replica on GPUs spanning multiple nodes. In that case, we have to rely on less-efficient internode communication.

Related content
Earlier this year, we reported a speech recognition system trained on a million hours of data, a feat possible through semi-supervised learning, in which training data is annotated by machines rather than by people. These sorts of massive machine learning projects are becoming more common, and they require distributing the training process across multiple processors. Otherwise, training becomes too time consuming.

The volume of transmitted data and the latency in a collective communication are determined by the message size and the number of participants. Particularly, the communication volume is proportional to (p - 1)/p, where p denotes the number of participants, and if the participants use the standard ring-shaped communication pattern, the latency has a linear dependency on the number of participants.

The message size cannot be reduced without compromising data integrity, but we can reduce the number of participants in internode communications. This lowers the communication volume factor to (p - k)/p and latency by p/(p/k + k) times, where k is the number of GPUs on a single node.

Consider the simple example below, involving two nodes with two GPUs each. The standard ring-shaped communication pattern would aggregate data across nodes (left) by passing messages from each GPU to the next, so a single internode communication involves four GPUs.

Internode communication.png
MiCS reduces the number of GPUs that participate in any given internode communication.

MiCS, by contrast, executes these internode operations in parallel, so each internode communication involves only two GPUs (right), which exchange only half the information that we want to communicate. Each node then aggregates the internode data locally to assemble the full message. In this case, the communication volume factor is reduced from ¾ ((4-1)/4) to ½ ((4-2/4).

Two-hop gradient synchronization

Synchronizing gradients among all workers is an expensive operation, required to keep workers working on the same model states. During the training of large neural nets, batch size is typically limited by GPU memory. Gradient accumulation is a technique that splits a batch of samples into several microbatches that will be run sequentially in multiple microsteps.

Related content
“Anytime query” approach adapts to the available resources.

With MiCS, we can accumulate gradients inside each partition group in multiple microbatches until the last microbatch is processed. That is, for each microstep, we can accumulate the full set of gradients for each model replica inside a subset of GPUs (i.e., a partition group). Then, after the last microbatch is handled, each GPU synchronizes gradients with the other GPUs representing the same part of the model state.

This allows us to amortize the synchronization overhead across replication groups to multiple microsteps. The following figure gives an example of two-hop gradient synchronization for training with four microsteps.

Gradient accumulation.png
Two-hop gradient synchronization.

Because of these three techniques, MiCS shows great scalability on large clusters and delivers excellent training throughput performance, and it enables us to achieve a new state-of-the-art performance on AWS p4de.24xlarge machines.

We are working to open-source MiCS for public use, in the belief that it will greatly reduce the time and cost of large-model training on the Amazon EC2 platform. Please refer to our preprint for a more detailed explanation of our system and analysis of its performance.

Acknowledgements: Yida Wang, Justin Chiu, Roshan Makhijani, RJ, Stephen Rawls, Xin Jin

Research areas

Related content

IL, Tel Aviv
Come join the AWS Agentic AI science team in building the next generation models for intelligent automation. AWS, the world-leading provider of cloud services, has fostered the creation and growth of countless new businesses, and is a positive force for good. Our customers bring problems that will give Applied Scientists like you endless opportunities to see your research have a positive and immediate impact in the world. You will have the opportunity to partner with technology and business teams to solve real-world problems, have access to virtually endless data and computational resources, and to world-class engineers and developers that can help bring your ideas into the world. As part of the team, we expect that you will develop innovative solutions to hard problems, and publish your findings at peer reviewed conferences and workshops. We are looking for world class researchers with experience in one or more of the following areas - autonomous agents, API orchestration, Planning, large multimodal models (especially vision-language models), reinforcement learning (RL) and sequential decision making.
IL, Tel Aviv
Are you a Masters or PhD student interested in a 2026 Internship in Data Science? If so, we want to hear from you! We are looking for a customer obsessed Data Scientist Intern who can innovate in a business environment and is comfortable owning data to drive step-change innovation in the EMEA region or worldwide. If this describes you, come and join our Data Science teams at Amazon for an exciting internship opportunity. If you are insatiably curious and always want to learn more, then you’ve come to the right place. You can find more information about the Amazon Science community as well as our interview process via the links below; https://www.amazon.science/ https://amazon.jobs/content/en/career-programs/university/science Key job responsibilities As a Data Science Intern, you will have the following key job responsibilities: • Work closely with scientists and engineers to develop new algorithms to implement scientific solutions for Amazon problems • Design, run, and analyze A/B tests • Work on an interdisciplinary team on customer-obsessed research • Experience Amazon's customer-focused culture • Create and deliver projects that can be quickly applied starting locally and scaled to EMEA/worldwide • Create and share data with audiences of varying levels technical papers and presentations • Define metrics and design algorithms to estimate customer satisfaction and engagement A day in the life At Amazon, you will grow into the high impact person you know you’re ready to be. Every day will be filled with developing new skills and achieving personal growth. How often can you say that your work changes the world? At Amazon, you’ll say it often. Join us and define tomorrow. Some more benefits of an Amazon Science internship include; • All of our internships offer a competitive stipend/salary • Interns are paired with an experienced manager and mentor(s) • Interns receive invitations to different events such as intern program initiatives or site events • Interns can build their professional and personal network with other Amazon Scientists • Interns can potentially publish work at top tier conferences each year About the team Applicants will be reviewed on a rolling basis and are assigned to teams aligned with their research interests and experience prior to interviews. Start dates are available throughout the year and durations can vary in length from 3-6 months for full time internships or 6-12 months for part time internships. Please note these are not remote internships.
IN, KA, Bengaluru
Alexa+ is the world’s best Generative AI powered personal assistant / agent for consumers. We are seeking an Applied Scientist to join our newly expanding team in India focused on Alexa Conversational Ads and Personalization. In this role, you will build machine learning models that seamlessly and naturally integrate relevant advertising into the Alexa experience while deeply personalizing user interactions. You will work closely with other scientists, engineers, and product managers to take models from conception to production. Key job responsibilities Design, develop, and evaluate innovative deep learning and GenAI models for natural language processing (NLP), recommendation systems, and personalization. Conduct hands-on data analysis and build scalable ML pipelines. Design and run A/B experiments to measure the impact of new models on customer experience and ad performance. Collaborate with software development engineers to deploy models into high-scale, real-time production environments. About the team We are building a new science team in Bangalore to solve some of the most impactful problems in computational advertising. This isn't about tweaking existing models as we are rethinking how ads are ranked, priced, and personalized across voice-first and screen-first surfaces. These are problems that don't have textbook solutions. Key points to note about the team: 🧪 Greenfield team - you are not joining a mature org with rigid processes. You will shape the science roadmap, pick the problems, and define the culture from day one. 📈 Direct business impact — your models directly drive revenue. No yearly cycles to see if your work matters. 🌏 Global scope, local autonomy — collaborate with scientists and engineers across Seattle, Sunnyvale, and Bangalore, but own your problem space end-to-end. 🎓 Ship AND Publish: We encourage top-tier publications (NeurIPS, ACL, EMNLP, KDD, ICML, WWW) while ensuring your research hits production.
IN, KA, Bengaluru
Alexa+ is the world’s best Generative AI powered personal assistant / agent for consumers. We are seeking an Applied Scientist to join our newly expanding team in India focused on Alexa Conversational Ads and Personalization. In this role, you will build machine learning models that seamlessly and naturally integrate relevant advertising into the Alexa experience while deeply personalizing user interactions. You will work closely with other scientists, engineers, and product managers to take models from conception to production. Key job responsibilities Design, develop, and evaluate innovative deep learning and GenAI models for natural language processing (NLP), recommendation systems, and personalization. Conduct hands-on data analysis and build scalable ML pipelines. Design and run A/B experiments to measure the impact of new models on customer experience and ad performance. Collaborate with software development engineers to deploy models into high-scale, real-time production environments. About the team We are building a new science team in Bangalore to solve some of the most impactful problems in computational advertising. This isn't about tweaking existing models as we are rethinking how ads are ranked, priced, and personalized across voice-first and screen-first surfaces. These are problems that don't have textbook solutions. Key points to note about the team: 🧪 Greenfield team - you are not joining a mature org with rigid processes. You will shape the science roadmap, pick the problems, and define the culture from day one. 📈 Direct business impact — your models directly drive revenue. No yearly cycles to see if your work matters. 🌏 Global scope, local autonomy — collaborate with scientists and engineers across Seattle, Sunnyvale, and Bangalore, but own your problem space end-to-end. 🎓 Ship AND Publish: We encourage top-tier publications (NeurIPS, ACL, EMNLP, KDD, ICML, WWW) while ensuring your research hits production.
IN, KA, Bengaluru
Alexa+ is the world’s best Generative AI powered personal assistant / agent for consumers. We are seeking an Applied Scientist to join our newly expanding team in India focused on Alexa Conversational Ads and Personalization. In this role, you will build machine learning models that seamlessly and naturally integrate relevant advertising into the Alexa experience while deeply personalizing user interactions. You will work closely with other scientists, engineers, and product managers to take models from conception to production. Key job responsibilities Design, develop, and evaluate innovative deep learning and GenAI models for natural language processing (NLP), recommendation systems, and personalization. Conduct hands-on data analysis and build scalable ML pipelines. Design and run A/B experiments to measure the impact of new models on customer experience and ad performance. Collaborate with software development engineers to deploy models into high-scale, real-time production environments. About the team We are building a new science team in Bangalore to solve some of the most impactful problems in computational advertising. This isn't about tweaking existing models as we are rethinking how ads are ranked, priced, and personalized across voice-first and screen-first surfaces. These are problems that don't have textbook solutions. Key points to note about the team: 🧪 Greenfield team - you are not joining a mature org with rigid processes. You will shape the science roadmap, pick the problems, and define the culture from day one. 📈 Direct business impact — your models directly drive revenue. No yearly cycles to see if your work matters. 🌏 Global scope, local autonomy — collaborate with scientists and engineers across Seattle, Sunnyvale, and Bangalore, but own your problem space end-to-end. 🎓 Ship AND Publish: We encourage top-tier publications (NeurIPS, ACL, EMNLP, KDD, ICML, WWW) while ensuring your research hits production.
IN, KA, Bengaluru
Alexa+ is the world’s best Generative AI powered personal assistant / agent for consumers. We are seeking an Applied Scientist to join our newly expanding team in India focused on Alexa Conversational Ads and Personalization. In this role, you will build machine learning models that seamlessly and naturally integrate relevant advertising into the Alexa experience while deeply personalizing user interactions. You will work closely with other scientists, engineers, and product managers to take models from conception to production. Key job responsibilities - Design, develop, and evaluate innovative machine learning and deep learning models for natural language processing (NLP), recommendation systems, and personalization. - Conduct hands-on data analysis and build scalable ML pipelines. - Design and run A/B experiments to measure the impact of new models on customer experience and ad performance. - Collaborate with software development engineers to deploy models into high-scale, real-time production environments.
US, CA, San Francisco
Join Amazon's Frontier AI & Robotics team as a Member of Technical Staff, this Technical Program Manager will become the driving force behind breakthrough robotics innovation. You'll orchestrate complex, cross-functional programs that bridge AI research, software, hardware, and production deployment—managing the technical workstreams that enable robots to see, reason, and act in Amazon's warehouse environments. Your program leadership will directly accelerate our mission to build the next generation of embodied intelligence. Key job responsibilities · Establish and drive program management mechanisms and cadence for complex robotics and AI development initiatives spanning research, software engineering, hardware, and operations · Manage end-to-end program execution across the full robotics stack—including AI models, software engineering, and hardware deployment · Drive decision-making velocity by facilitating tradeoff discussions when there are conflicting priorities; determine whether decisions are one-way or two-way doors · Own program-level risk management, proactively identifying technical, schedule, and resource risks; escalate where necessary and drive mitigation strategies · Manage dependencies and scope changes across internal teams and partner organizations, ensuring alignment on commitments, timelines, and technical requirements · Create transparency through clear RACI frameworks, program dashboards, and communication mechanisms that keep stakeholders aligned on status, risks, and decisions · Exercise strong technical judgment to influence program-level decisions on deployment methodology, scalability requirements, and technical feasibility—acting as the voice back to research and engineering teams · Build sustainable program management processes that scale as our organization grows, adapting agile frameworks to the unique challenges of AI robotics A day in the life Your focus centers on driving velocity and alignment across our robotics programs. You might start your morning facilitating tradeoff decisions between AI researchers and software engineers on a critical prototype milestone, then transition to managing dependencies across hardware and operations teams to keep timelines on track. In the afternoon, you could be conducting risk assessments on supply chain constraints that impact our development roadmap, updating program dashboards to provide leadership visibility, or working with partner teams to align on deployment strategies. You'll establish the mechanisms and cadence that keep our fast-moving organization synchronized—from sprint planning rituals to cross-functional design reviews. Throughout the day, you balance hands-on program execution with strategic escalation, ensuring technical decisions align with our long-term vision while removing obstacles that slow teams down. You're the connective tissue that enables researchers, engineers, and operations specialists to move fast together. About the team At Frontier AI & Robotics, we're not just advancing robotics – we're reimagining it from the ground up. Our team is building the future of intelligent robotics through frontier foundation models and end-to-end learned systems. We tackle some of the most challenging problems in AI and robotics, from developing sophisticated perception systems to creating adaptive manipulation strategies that work in complex, real-world scenarios. What sets us apart is our unique combination of ambitious research vision and practical impact. We leverage Amazon's computational infrastructure and rich real-world datasets to train and deploy state-of-the-art foundation models. Our work spans the full spectrum of robotics intelligence – from multimodal perception using images, videos, and sensor data, to sophisticated manipulation strategies that can handle diverse real-world scenarios. We're building systems that don't just work in the lab, but scale to meet the demands of Amazon's global operations. Join us if you're excited about pushing the boundaries of what's possible in robotics, working with world-class researchers, and seeing your innovations deployed at unprecedented scale.
US, CA, San Francisco
We are seeking a hands-on Electrical Engineer to lead the design and integration of electrical systems or subsystems for high-degree-of-freedom robotic platforms. This role involves architecting the robot’s power distribution, sensor wiring, and embedded electrical infrastructure. You will be responsible for designing across the full electrical system for advanced robotics platforms including power distribution, sensing, compute, motor controllers, communication infrastructure, battery system and power electronics in close collaboration with mechanical, controls and software engineers. You’ll play a key role in ensuring high-performance, reliable operation of complex electromechanical systems under real-world conditions. Key job responsibilities * Electrical system architect / owner for power electronics, actuation, PCBAs, battery, ware harness specs and high speed electrical/communications protocols * Design, develop and integrate power distribution, embedded electronics, motor controllers and safety-critical circuits for complex robotic systems * Own board layout of PCBAs including SoCs, microcontrollers, sensors, power devices, etc. using Cadence OrCAD/Allegro or equivalent tools. Oversee bring-up and validation * Determine appropriate high speed electrical and communication protocols (e.g., CAN, EtherCAT, USB, etc) for reliable and efficient system operation * Specify and design custom power electronics and power distribution boards to meet performance, thermal, and safety requirements * Design and route all cabling and wire harnesses across the robotic platform, considering EMI, signal integrity, serviceability, and integration with mechanical structures * Architect and integrate the robot’s battery system, including protection circuitry, battery management, charging systems, and thermal considerations * Define and implement wiring and electrical interfaces for sensors (e.g., lidar, stereo cameras, IMUs, tactile) and compute modules * Ownership over prototyping and bringing up electrical designs and creation of test & validation rigs About the team At Frontier AI & Robotics, we're not just advancing robotics – we're reimagining it from the ground up. Our team is building the future of intelligent robotics through innovative foundation models and end-to-end learned systems. We tackle some of the most challenging problems in AI and robotics, from developing sophisticated perception systems to creating adaptive manipulation strategies that work in complex, real-world scenarios. What sets us apart is our unique combination of ambitious research vision and practical impact. We leverage Amazon's massive computational infrastructure and rich real-world datasets to train and deploy state-of-the-art foundation models. Our work spans the full spectrum of robotics intelligence – from multimodal perception using images, videos, and sensor data, to sophisticated manipulation strategies that can handle diverse real-world scenarios. We're building systems that don't just work in the lab, but scale to meet the demands of Amazon's global operations. Join us if you're excited about pushing the boundaries of what's possible in robotics, working with world-class researchers, and seeing your innovations deployed at unprecedented scale.
US, NY, New York
We are seeking an Applied Scientist to develop and optimize Visual Inertial Odometry (VIO) and sensor fusion systems for our intelligent robots. In this role, you will design, implement, and deploy state estimation and tracking algorithms that enable robots to understand their position and motion in real time, even in challenging and dynamic environments. You will own the full pipeline from algorithm development through embedded deployment, ensuring that perception systems run efficiently on resource-constrained robotic hardware. You will also leverage modern machine learning approaches to push the boundaries of classical perception methods, combining learned representations with geometric techniques to achieve robust, real-time performance. This is a deeply hands-on role. You will work directly with sensors, hardware, and real-world data, while prototyping, testing, and iterating in physical environments. The ideal candidate has strong foundations in VIO and sensor fusion, practical experience optimizing algorithms for embedded platforms, and familiarity with how modern deep learning is transforming perception. Key job responsibilities - Design and implement Visual Inertial Odometry algorithms for robust real-time state estimation on robotic platforms like Sprout - Develop multi-sensor fusion pipelines integrating cameras, IMUs, and other sensing modalities for accurate pose tracking - Optimize perception and tracking algorithms for deployment on embedded hardware (e.g., ARM, GPU-accelerated edge devices) under strict latency and power constraints - Apply modern ML-based perception techniques (learned features, depth estimation, neural odometry) to complement and improve classical geometric approaches - Build and maintain calibration, evaluation, and benchmarking infrastructure for perception systems - Collaborate with hardware, controls, and navigation teams to integrate perception outputs into the robot’s autonomy stack - Lead technical projects from research prototyping through production deployment
US, WA, Bellevue
The candidate in this role will own delivery of science products and solutions to help Amazon Devices Sales and Marketing org. make better decisions: product recommendations to customers, segmentation, financial incrementality of marketing initiatives, A/B testing etc. Key job responsibilities The Amazon Devices organization designs, produces and markets Echo Speakers, Kindle e-readers, Fire Tablets, Fire TV Streaming Media Players, Ring and Blink Smart Home & Security products. We are constantly looking to innovate on behalf of customers with new devices in existing or new categories or improving customer experience on existing platforms. The Devices Data Services (DDS) team provides Data Science, Analytics and Engineering support to the broader organization to enable Sales and Marketing activities across all these product lines. We are looking for an innovative, hands-on and customer-obsessed Data Scientist who can be a strategic partner to the product managers and engineers on the team. Our projects span multiple organizations and require coordination of experimentation, economic and causal analysis, and building predictive machine learning models. A successful candidate will be a problem solver who enjoys diving into data, is excited by difficult modeling challenges, is motivated to build something that will eventually become a production software system, and possesses strong communication skills to effectively interface between technical and business teams. In this role, you will be a technical expert with massive impact. You will take the lead on developing advanced ML systems that are key to reaching our customers with the right recommendations at the right time. Your work will directly impact the success of Amazon's growing Devices business. You will work across diverse science/engineering/business teams. You will work on critical data science problems, building high quality, reliable, accurate, and consistent code sets that are aligned with our business needs. Key Performance Areas - Implement statistical or machine learning methods to solve specific business problems. - Improve upon existing methodologies by developing new data sources, testing model enhancements, and fine-tuning model parameters. - Directly contribute to development of modern automated recommendation systems - Build customer-facing reporting tools to provide insights and metrics to track model performance and explain variance - Collaborate with researchers, software developers, and business leaders to define product requirements, provide analytical support, and communicate feedback A day in the life You will work with other scientists, engineers, product managers, and marketers to develop new products that benefit our customers and help us reach our business goals. You will own solutions from end to end: conceptualization, prioritization, development, delivery, and productionalization. About the team We are a full stack science team that empowers product, marketing, and other business leaders to better understand customers who use Amazon devices, make decisions on product development or optimization, and measure the effectiveness of their efforts against our customer’s expectation. Our focus area is to build analytical frameworks that help the organization either access data, better understand the decisions customers are making and why, or assess customer satisfaction.