The science behind Amazon SageMaker’s distributed-training engines
How SageMaker’s data-parallel and model-parallel engines make training neural networks easier, faster, and cheaper.
Yesterday at Amazon Web Services’ (AWS’s) annual re:Invent conference, Swami Sivasubramanian, vice president of machine learning at AWS, announced two new features that will make it cheaper and easier for AWS customers to train large, data-intensive neural networks through Amazon SageMaker — a fully managed service that makes it easy for everyday developers to build, train, and deploy machine learning models in the cloud and at the edge.
SageMaker’s data parallelism (SDP) library enables neural-network training to scale with near-linear efficiency, even when a large number of EC2 instances participate in the training. That makes training models on large data sets faster and more cost-effective for customers.
SageMaker’s model parallelism (SMP) library automatically coordinates the training of neural networks that are too large to fit on a single AWS server. Previously, distributing a large network across servers required customers to manually partition the network and hand-tune code. With SMP, all of that happens automatically.
As its name implies, SDP uses data parallelism, in which copies of the same neural network are sent to different distributed-computing nodes, and each node trains its copy on a different batch of data. The results of the separate trainings are then aggregated and distributed, so that all the nodes update their models in the same way.
SMP uses model parallelism, in which the neural network itself is broken up across GPUs. The neural network’s operations are parceled out so that each of them is executed by only one of the GPUs.
During training, the GPUs exchange activations — the input thresholds at which individual neurons “fire” — and gradients — updates to the weights of the connections between neurons. Both forward training passes, in which the network produces outputs for specific training examples, and backward passes, in which the network produces gradients, are thus done in a distributed manner.
Data-parallel training often relies on the all-reduce algorithm to aggregate the gradients computed by different GPUs, with their separate batches of training data. With all-reduce, the GPUs themselves pass gradients around, add them together, and redistribute them.
SDP instead takes advantage of the topology of the AWS network. An AWS p3dn.24xlarge machine, for instance, consists of eight Nvidia V100 GPUs and 96 virtual CPUs, all with high-speed connections.
SDP offloads most of the responsibility for aggregating gradients to the CPUs, which also transmit gradient updates to the CPUs of other computing nodes. While the CPUs are aggregating and transmitting one batch of gradients, the GPUs can get to work on the next batch. This lets distributed training scale more efficiently.
To communicate gradient updates between CPUs, SDP uses the all-reduce operation. Each virtual CPU waits until it has received a certain number of gradients from the GPUs before passing them along. This ensures that each virtual CPU participates equally in averaging the gradients across nodes, thereby using bandwidth efficiently.
In a paper presented in November at the Supercomputing Conference (SC20), AWS researchers described experiments in which they compared their data parallelism scheme to one that used all-reduce within clusters. When training a BERT language model on 512 GPUs, the scheme reduced training time by 44%.
The researchers also conducted experiments in which they used SDP to train Mask-RCNN, a neural network with roughly 44 million parameters, on a computer vision task with about 118,000 training examples. The training time was six minutes and 45 seconds on PyTorch and six minutes 12 seconds on TensorFlow, approximately 24% better than the previous record.
With model parallelism, the first question is how to divide a neural network up across computing nodes. The answer to that question should balance two objectives. The first is an even distribution of the computational burden: each node should take about as long as each of the others to do its part for the same batch of training data.
The other is a minimization of inter-node communication. In a neural network, the weights of the connections between neurons are represented as tensors, higher-dimension analogues of matrices. To minimize communication overhead, the network should be cut across smaller tensors.
To learn enough about the network to partition it in a principled way, SMP does an initial tracing run to determine both the model topology and important metadata such as the sizes of the trainable parameters, the sizes of exchanged tensors, and the time it takes to execute each component of the model.
With model parallelism, the model operations have a sequential dependency: the outputs of the first node pass to the second node, and so on. The only way to achieve parallelism, then, is through pipelining: node 1 processes a batch of inputs and sends its outputs to node 2; as node 2 begins work, node 1 starts on the next of batch of inputs; and so on.
SMP creates optimized pipeline schedules for a given partition, where forward- and backward-pass computations can be jointly pipelined. For instance, as one GPU works on the forward pass of one batch of data, another might work on the backward pass of another batch. Given the pipeline schedule, SMP orchestrates each training step under the hood, managing all the work across GPUs and transmitting the necessary tensors as needed, using a communication backend optimized for the AWS infrastructure.
Previously, training a three-billion-parameter model on 256 instances would require weeks of manual effort to split the model across GPUs. With SageMaker automating and optimizing the model partitioning, it takes six days.