Amazon scientist Li Zhang and two former colleagues will be honored at INFOCOM 2021 on May 11, 2021, for a paper they wrote 11 years ago that has had a significant impact on the computer networking research community.
The paper, Improving the Scalability of Data Center Networks with Traffic-aware Virtual Machine Placement, was first published in the 2010 INFOCOM proceedings, and recently awarded the 2021 IEEE INFOCOM test of time paper award.
Zhang wrote the award-winning paper with colleagues from IBM Research, where for more than 20 years he focused his research on optimizing the performance of individual computers, clusters of machines, and eventually cloud data centers. The paper, he says, exemplifies the motto he has followed throughout his long research career: Faster, Stronger, and More Efficient. The idea for the paper emerged from a visit Zhang made to a large IBM hosting center, where he witnessed firsthand the challenge of optimizing the utilization of virtual machines.
In their paper, the authors noted that virtual machine (VM) placement on host machines within data centers was consolidated for CPU, physical memory, and power consumption savings, yet failed to consider network resources. As a result, the authors said, this could lead to situations in which VM pairs with heavy traffic among them were placed on host machines with large network costs between them.
“To understand how often this happens in practice,” the authors wrote, “we conducted a measurement study in operational data centers and observed three apparent trends: there is a low correlation between the average pairwise traffic rate and the end-to-end cost; traffic distribution for individual VMs is highly uneven; VM pairs with relatively heavier traffic rate tend to constantly exhibit the higher rate and conversely VM pairs with low traffic rate tend to exhibit the low rate. These three observations suggest that there is a great potential in optimizing VM placement to save bandwidth and realizing such potential is feasible.”
More than 10 years later, Zhang says he’s impressed with how rapidly enterprises have transitioned their workloads to the cloud, and while data center system utilization has improved significantly, “I still think there is room to improve.”
Zhang joined Amazon last year as a principal product manager technical for SageMaker JumpStart, Amazon SageMaker built-in algorithms that help data scientists and machine learning practitioners get started with training and deploying their models, and for the use of reinforcement learning (RL) with Amazon SageMaker.
Zhang says he’s enjoying his role at Amazon as it’s a natural extension of his previous research that evolved from datacenter networking, to big data analytics, machine learning, and to efficient, scale-out training of deep neural networks. Where previously his research focused more at the infrastructure level, Zhang says he’s now applying his mathematics and optimization expertise to “more at the algorithm or application level, which has more direct benefit to end users. I’m really enjoying that.”
Zhang’s co-authors were Xiaoqiao Meng, now an engineering manager at Facebook, and Vasilis Pappas, now a software engineer at Google. Guoliang Xue, professor of computer science and engineering at Arizona State University, and chair of the IEEE INFOCOM steering committee, informed the authors of their award in a January 29 email.
“This is an extraordinary achievement,” said Xue, adding that the authors will be honored during the opening session of INFOCOM 2021, which will be held virtually this year.