How to train large graph neural networks efficiently
New method enables two- to 14-fold speedups over best-performing predecessors.
As Amazon Scholar Chandan Reddy recently observed, graph neural networks are a hot topic at this year’s Conference on Knowledge Discovery and Data Mining (KDD). Graph neural networks create embeddings, or vector representations, of the nodes and edges of a graph, enabling new analyses, such as link prediction.
The embedding of a node typically factors in information not only about that node but also about its immediate neighbors and, often, their neighbors, too. In many real-world cases — the graph of Twitter users and their followers, for instance — a given node can have thousands or even millions of connections. In such cases, it’s not practical to account for all of a node’s neighbors. Instead, researchers have developed sampling methods that select subsets of the immediate neighbors for use in embedding.
In a paper we presented at KDD, my colleagues and I describe a new sampling strategy for training graph neural network models with a combination of CPUs and GPUs. In that context — which is common in real-world applications — our method reduces the amount of data transferred from CPU to GPU, greatly improving efficiency. In experiments, our method was two to 14 times as fast as prior methods, depending on the datasets used, while achieving accuracies that were as high or even higher.
Mixed CPU-GPU training
GPUs offer the most efficient way to perform the tensor operations used to train neural networks, but they have limited memory. To train graph neural networks on graphs that are too large to fit in GPU memory, we typically use the CPU to create minibatches of randomly selected graph nodes and edges, which we send to the GPU, along with data describing each node — the node features.
To generate a minibatch, we need to sample neighbors for each target node — each node that’s being embedded — and, if necessary, neighbors of the sampled neighbors as well. This recursive neighbor sampling generates minibatches that require a large transfer of data between the CPU and the GPU. In our paper, we report a set of experiments showing that, with existing sampling strategies, copying node features of a minibatch from CPU to GPUs is the single most time-consuming aspect of model training.
Global neighbor sampling
Our sampling approach, which we call GNS, for global neighbor sampling, dramatically reduces the amount of data transferred from the CPU to the GPU.
The basic strategy is that, before creating a minibatch, we sample a set of nodes from the entire graph and load their features into GPU memory; we call this collection of node features the cache. When creating a minibatch, we sample neighbors of a node by simply retrieving the neighbors that are already in the cache. Only if the cache doesn’t contain enough neighbor nodes do we fetch additional nodes from the CPU.
To increase the likelihood that the relevant neighbors will be found in the cache, we preferentially sample nodes of high degree — that is, nodes with a large number of connections to other nodes. Sampling likelihood is proportional to the node degree, so the cache will still include a number of relatively low-degree nodes.
Because we know the probabilities of sampling the nodes in the cache, during embedding, we can weight the cached nodes to ensure a good approximation of the embedding that would have resulted from factoring in all of the neighbors.
In the paper, we prove that this approach will converge to the optimal model performance as efficiently as one that uses truly random sampling. This means that neither the bias toward high-degree nodes nor the reuse of the same cache for many minibatches should compromise model performance.
One important consideration is how to efficiently identify the nodes in the cache that are relevant for a given minibatch. Potentially, we could compute the overlap between the list of neighbors for a given node and the list of nodes in the cache. However, that computation is expensive. Instead, on the CPU, we create a subgraph that consists of all the nodes in the cache and all their immediate neighbors. When we assemble a minibatch, we simply look up the cached neighbors from the subgraph for each of its nodes.
In experiments, we compared our sampling strategy to three other methods on five datasets and found that, in the mixed CPU-GPU setting, ours was at least twice as fast as the second-best strategy on every data set. Two of the three sampling strategies were consistently an order of magnitude slower than ours when trained to achieve comparable accuracies.
In our experiments, we restricted ourselves to a single CPU and a single GPU. In ongoing work, we are considering how to generalize the method to multiple GPUs and distributed training. For instance, can we cache different sets of nodes on different GPUs and efficiently target each minibatch to the GPU whose cache offers the best match?