Making DeepSpeed ZeRO run efficiently on more-affordable hardware

Amazon researchers optimize the distributed-training tool to run efficiently on the Elastic Fabric Adapter network interface.

Most modern natural-language-processing applications are built on top of pretrained language models, which encode the probabilities of word sequences for entire languages. Over time, these models have trended larger and larger, into the regime of billions or even trillions of parameters.

Training these models within a reasonable amount of time requires very large computing clusters, whose large communication volume can block computation, resulting in low or inefficient GPU utilization. Communication between the GPUs needs to be carefully managed to avoid becoming a performance bottleneck.

Related content
How SageMaker’s data-parallel and model-parallel engines make training neural networks easier, faster, and cheaper.

Microsoft’s DeepSpeed distributed-training library introduced one such management technique, called the Zero Redundancy Optimizer (ZeRO). ZeRO works by partitioning the state of a machine learning model across distributed workers and fetching the necessary model state from other workers as needed during training. ZeRO has several “stages,” each of which allows the training of larger models through reducing memory requirements, typically at the cost of additional communication volume.

While Microsoft researchers were able to achieve ideal scaling performance with this technique, they reported experiments only on a specialized hypercluster that uses expensive high-speed InfiniBand networking (specifically, an Nvidia DGX system).

To reduce costs for customers in need of high-performance computing, Amazon Web Services (AWS) uses an Elastic Fabric Adapter (EFA) network instead of InfiniBand. The EFA available on instances of AWS’s p4d.24xlarge computational infrastructure has less communication bandwidth than InfiniBand on the Nvidia DGX hypercluster, so we would expect some performance dropoff for bandwidth-intensive tasks. When we tried to reproduce Microsoft’s results, however, we found that the relative dropoff in ZeRO’s third stage was twice that of the dropoff in the second stage.

We profiled the training process to look for bottlenecks and observed that in ZeRO Stage 3, communication dominated training time. We have made a series of optimizations to ZeRO Stage 3 in order to close the performance gap relative to results obtained on InfiniBand-equipped DGX clusters. Below is a table showing the overall performance improvement conferred by our optimizations, measured when training a RoBERTa language model on AWS p4d.24xlarge instances.

ModelNumber of GPUsTFLOPS/GPU
RoBERTa-10B64Optimized: 123 teraflops
Unoptimized: 73 teraflops
RoBERTa-50B64Optimized: 154 teraflops
Unoptimized: 89 teraflops

In January, we merged our optimizations into the DeepSpeed code repository for public use.

Optimizations

Related content
Earlier this year, we reported a speech recognition system trained on a million hours of data, a feat possible through semi-supervised learning, in which training data is annotated by machines rather than by people. These sorts of massive machine learning projects are becoming more common, and they require distributing the training process across multiple processors. Otherwise, training becomes too time consuming.

Our optimizations can roughly be categorized as (1) improving overlap between communication and computation, (2) improving bandwidth utilization, and (3) improving memory efficiency.

Synchronization/Parallelism

Finer-grained synchronization between communication and computation streams

In lower-bandwidth or large clusters where communication times dominate, it is critical to mask communication costs by overlapping computation with communication. Through profiling, we found that this overlapping was limited by ZeRO’s overly coarse synchronization.

This resulted in a suboptimal level of overlapping for two distributed-computing operations: allgather, which aggregates data (in this case, model parameters) from all workers across the network, and reduce-scatter, which reduces data (in this case, summing gradients) across workers. These two operations were causing poor GPU utilization because communication was constantly blocking computation operations. In response, we made significant changes to the parameter gathering and gradient reduce-scatter paths to reduce or remove synchronization while maintaining correctness.

After these changes, we were able to achieve much better overlapping and thus much fewer and smaller computation bubbles.

Precomputation/caching of Python fetching and partitioning decisions 

Related content
“Anytime query” approach adapts to the available resources.

During training, many complex decisions need to be made, relating to which parameters should be fetched, which parameters will be used next, which parameters may be reused soon and should be kept, and which can be released. These operations were slow enough to frequently prevent the Python process from keeping GPUs fed with work, creating large computation bubbles.

We optimized this by precomputing or caching as many decisions as possible, speeding up their computation to the point that it has become a nonfactor for training throughput.

Communication/bandwidth use

Batching allgather/reduce-scatter calls

We found that batching the collective operations — allgather and reduce-scatter — uses bandwidth more efficiently and amortizes the fixed costs of running the computational kernels that execute the operations. To implement collective batching, we flatten tensor data into a single, contiguous buffer to be sent in a single transaction. Each collective requires a special interleaving scheme to ensure that each worker receives the correct data.

Allgather-interleaving.jpeg
Allgather interleaving scheme.
Reduce-scatter interleaving.jpeg
Reduce-scatter interleaving scheme.

Memory

Our implementation of ZeRO, like the Microsoft implementation, uses the Compute Unified Device Architecture (CUDA), Nvidia’s parallel-computing platform. CUDA memory allocations are both synchronous and slow (ignoring the stream-ordered alternatives cudaMallocAsync and cudaMemcpyAsync, which are not yet used in PyTorch), so PyTorch uses a caching allocator to avoid the large costs of constantly reallocating memory. If there are no cached or free blocks for an allocation request, the allocator will flush its cache. This is disastrous for a few reasons:

  • Before the flush can begin, several cudaEventSynchronize calls are necessary to allow computation on held memory to complete. This and the subsequent cudaFree calls can take multiple seconds.
  • Different workers are not guaranteed to flush their caches simultaneously. This means that for any collective, if even a single worker is currently flushing its cache, the other N-1 workers sit blocked waiting for that worker to join. As cluster size increases, so does the probability that at least one worker is flushing its cache for any given collective.
  • After the cache flush, subsequent allocations require cudaMalloc calls, which as mentioned earlier are both synchronous and slow.

For these reasons, memory efficiency is critical for performance.

Memory-efficient batched PyTorch collectives

Although our use of batched collectives significantly reduced kernel launch overhead and improved bandwidth utilization, it also increased memory consumption because of its flattening of batched tensors into an additional buffer.

Related content
New method enables two- to 14-fold speedups over best-performing predecessors.

To avoid redundant flatten operations in PyTorch collectives, we used the *_base variants of the collective operations, which accept pre-flattened tensors, avoiding the need to internally allocate additional flattened buffers. In future work, we plan to use group-based batching operations from the Nvidia Collective Communications Library (NCCL) to eliminate all flattening operations.

More aggressive initialization-time defragmentation of parameter partitions

Even with more than 10GB of free GPU memory, we continued to see evidence of allocator cache flushes, suggesting memory fragmentation. In order to reduce this, we made initialization-time defragmentation changes to move all persisted tensors into a single contiguous buffer.

Miscellaneous

In addition to the optimizations described above, we also

  • optimized gradient normalization by reducing host-device data movement and synchronization and pulling math operations out of a for-loop into a single kernel launch with parallelized computation; and
  • removed tensor operations (.norm()) that were being added to debug messages via string formatting. (These were causing copies from host to device, which meant data movement and host-device synchronization.)

By making DeepSpeed ZeRO Stage 3 performant on widely available public cloud offerings, we hope to further democratize the training of large language models.

Acknowledgments: Zhen Zhang, Stephen Rawls, Yida Wang

Related content

US, CA, Palo Alto
The Amazon Search team creates powerful, customer-focused search and advertising solutions and technologies. Whenever a customer visits an Amazon site worldwide and types in a query or browses through product categories, the Amazon Search services go to work. We design, develop, and deploy high performance, fault-tolerant distributed search systems used by millions of Amazon customers every day. Our team works to maximize the quality and effectiveness of the search experience for visitors to Amazon websites worldwide.
JP, Tokyo
The Amazon Logistics (AMZL) Team is responsible for the acquisition, design, construction, and management of all facilities in the Amazon Delivery Station Network. AMZL is looking for a talented and passionate Data Scientist to help shape its Last Mile business with technical strategies and solutions, by processing, analyzing and interpreting huge data sets. You should be comfortable with ambiguity, problem solving and enjoy working in a fast-paced, diverse and dynamic environment. Using analytical rigor and statistical methods, you mine through data to identify opportunities for Amazon and our delivery channels. And you collaborate with other scientists, engineers, Product and Program Managers to deploy new products and solutions. [More Information] Last Mile Department Data Analyst/BI Engineer Tokyo Office *Amazon is committed to a diverse and inclusive workplace. Amazon is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status. For individuals with disabilities who would like to request an accommodation, visit https://www.amazon.jobs/disability/jp Key job responsibilities Creating a roadmap of the most challenging business questions and use data to articulate possible root cause analysis and solutions Managing and executing entire projects or components of large projects from start to finish including project management, data gathering and manipulation, synthesis and modeling, problem solving, and communication of insights Partnering with Product, Program and Engineering teams to design and run models, research new algorithms, and prove incrementality and drive growth Understanding drivers, impacts, and key influences on seller growth dynamics Developing and scaling end-to-end ML Models and solutions Automating feedback loops for algorithms in production Utilizing Amazon systems and tools to effectively work with terabytes of data About the team Last Mile Execution Analytics (LMEA) team of JP works as an integral part of Amazon Logistics to ensure that its business intelligence, analytics, tools and planning needs are met. By providing information, insight, and decision support, we strive to enable success of all parts of AMZL. Our customer set includes senior management, station operations, external vendors, long-term planning, Ops technology (Voice of the Delivery Station, Voice of the Customer), network planning, and pretty much every BI and Ops teams. Voice of Employee [Work Life Harmony] We believe, it is important to spend private time such as spending time with your family or doing anything you like to spur innovation. Amazon promotes a fulfilling and flexible work style according to the work volume and lifestyle of each employee.
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
LU, Luxembourg
Are you a talented and inventive scientist with a strong passion about modern data technologies and interested to improve business processes, extracting value from the data? Would you like to be a part of an organization that is aiming to use self-learning technology to process data in order to support the management of the procurement function? The Global Procurement Technology, as a part of Global Procurement Operations, is seeking a skilled Data Scientist to help build its future data intelligence in business ecosystem, working with large distributed systems of data and providing Machine Learning (ML) and Predictive Modeling expertise. You will be a member of the Data Engineering and ML Team, joining a fast-growing global organization, with a great vision to transform the Procurement field, and become the role model in the market. This team plays a strategic role supporting the core Procurement business domains as well as it is the cornerstone of any transformation and innovation initiative. Our mission is to provide a high-quality data environment to facilitate process optimization and business digitalization, on a global scale. We are supporting business initiatives, including but not limited to, strategic supplier sourcing (e.g. contracting, negotiation, spend analysis, market research, etc.), order management, supplier performance, etc. We are seeking an individual who can thrive in a fast-paced work environment, be collaborative and share knowledge and experience with his colleagues. You are expected to deliver results, but at the same time have fun with your teammates and enjoy working in the company. In Amazon, you will find all the resources required to learn new skills, grow your career, and become a better professional. You will connect with world leaders in your field and you will be tackling Data Science challenges to ensure business continuity, by taking the right decisions for your customers. As a Data Scientist in the team, you will: -be the subject matter expert to support team strategies that will take Global Procurement Operations towards world-class predictive maintenance practices and processes, driving more effective procurement functions, e.g. supplier segmentation, negotiations, shipping supplies volume forecast, spend management, etc. -have strong analytical skills and excel in the design, creation, management, and enterprise use of large data sets, combining raw data from different sources -provide technical expertise to support the development of ML models to facilitate intelligent digital services, such as Contract Lifecycle Management (CLM) and Negotiations platform -cooperate closely with different groups of stakeholders, e.g. data/software engineers, product/program managers, analysts, senior leadership, etc. to evaluate business needs and objectives to set up the best data management environment -create and share with audiences of varying levels technical papers and presentations -deal with ambiguity, prioritizing needs, and delivering results in a dynamic environment Basic qualifications -Master’s Degree in Computer Science/Engineering, Informatics, Mathematics, or a related technical discipline -3+ years of industry experience in data engineering/science, business intelligence or related field -3+ years experience in algorithm design, engineering and implementation for very-large scale applications to solve real problems -Very good knowledge of data modeling and evaluation -Very good understanding of regression modeling, forecasting techniques, time series analysis, machine-learning concepts such as supervised and unsupervised learning, classification, random forest, etc. -SQL and query performance tuning skills Preferred qualifications -2+ years of proficiency in using R, Python, Scala, Java or any modern language for data processing and statistical analysis -Experience with various RDBMS, such as PostgreSQL, MS SQL Server, MySQL, etc. -Experience architecting Big Data and ML solutions with AWS products (Redshift, DynamoDB, Lambda, S3, EMR, SageMaker, Lex, Kendra, Forecast etc.) -Experience articulating business questions and using quantitative techniques to arrive at a solution using available data -Experience with agile/scrum methodologies and its benefits of managing projects efficiently and delivering results iteratively -Excellent written and verbal communication skills including data visualization, especially in regards to quantitative topics discussed with non-technical colleagues
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, CA, San Francisco
About Twitch Launched in 2011, Twitch is a global community that comes together each day to create multiplayer entertainment: unique, live, unpredictable experiences created by the interactions of millions. We bring the joy of co-op to everything, from casual gaming to world-class esports to anime marathons, music, and art streams. Twitch also hosts TwitchCon, where we bring everyone together to celebrate, learn, and grow their personal interests and passions. We’re always live at Twitch. Stay up to date on all things Twitch on Linkedin, Twitter and on our Blog. About the role: Twitch builds data-driven machine learning solutions across several rich problem spaces: Natural Language Processing (NLP), Recommendations, Semantic Search, Classification/Categorization, Anomaly Detection, Forecasting, Safety, and HCI/Social Computing/Computational Social Science. As an Intern, you will work with a dedicated Mentor and Manager on a project in one of these problem areas. You will also be supported by an Advisor and participate in cohort activities such as research teach backs and leadership talks. This position can also be located in San Francisco, CA or virtual. You Will: Solve large-scale data problems. Design solutions for Twitch's problem spaces Explore ML and data research
US, WA, Seattle
We are a team of doers working passionately to apply cutting-edge advances in deep learning in the life sciences to solve real-world problems. As a Senior Applied Science Manager you will participate in developing exciting products for customers. Our team rewards curiosity while maintaining a laser-focus in bringing products to market. Competitive candidates are responsive, flexible, and able to succeed within an open, collaborative, entrepreneurial, startup-like environment. At the leading edge of both academic and applied research in this product area, you have the opportunity to work together with a diverse and talented team of scientists, engineers, and product managers and collaborate with others teams. Location is in Seattle, US Embrace Diversity Here at Amazon, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 14 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust Balance Work and Life Our team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives Mentor & Grow Careers Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future. Key job responsibilities • Manage high performing engineering and science teams • Hire and develop top-performing engineers, scientists, and other managers • Develop and execute on project plans and delivery commitments • Work with business, data science, software engineer, biological, and product leaders to help define product requirements and with managers, scientists, and engineers to execute on them • Build and maintain world-class customer experience and operational excellence for your deliverables