How to reduce communication overhead for database queries by up to 97%

Amazon researchers describe new method for distributing database tables across servers.

Relational databases are typically composed of many different tables: one table might store contact data for a company’s customers; another might store data about all the company’s retail stores; another might store individual customers’ purchase histories; another might log details about customer service calls; and so on.

Customers who use the Amazon Redshift cloud data warehouse service from Amazon Web Services often have databases that consist of thousands of tables, which are constantly being updated and expanded. These tables naturally have to be distributed across multiple servers in AWS data centers.

At the 46th International Conference on Very Large Databases (VLDB), my colleagues — Yonatan Naamad, Peter Van Bouwel, Christos Faloutsos, and Michalis Petropoulos — and I presented a new method for allocating data across servers. In experiments involving queries that retrieved data from multiple tables, our method reduced communications overhead by as much as 97% relative to the original, unoptimized configuration.

For the last year, Amazon Redshift Advisor has used this method to recommend data storage configurations to our customers, enabling them to perform more-efficient database queries.

Example of a real join multigraph
An example of a real join multigraph. The thickness of the lines indicates the data transfer required by joins on particular attributes.

To get a sense of the problem our method addresses, consider a company that wishes to inform customers about sales at their local stores. That requires a database query that pulls customer data from the customer table and sale data from a store table.

To find the right store for each customer, the query matches entries from both tables by city. The query thus performs a join operation using the attribute “city”.

One standard way to distribute database tables across servers is to use distribution keys. For each data entry in a table (that is, each row of the table), a hash function is applied to one value of the entry — the distribution key. The hash function maps that value to the address of a server on the network, which is where the table row is stored.

In our example of a join operation, if the distribution key for both the customer table and the store table is the attribute “city”, then all customer entries and store entries that share a city will be stored on the same server. Each server then contains enough information to perform the join independently and in parallel with the other servers, without the need for data reshuffling at query time.

This is the basis of our method. Essentially, we analyze the query data for a particular database and identify the attributes whose joins involve the largest data transfers; then we use those attributes as the distribution keys for the associated tables.

The join multigraph

The first step in this process is to create what we call a join multigraph. This is a graph in the graph-theoretical sense: a data structure consisting of vertices — often depicted as circles — and edges — usually depicted as line segments connecting vertices. The edges may also have numbers associated with them, known as weights.

In the join multigraph, the vertices are tables of a database. The edges connect attributes of separate tables on which join operations have been performed, and the edge weights indicate the data transfer required by joins between these attributes.

Our goal is now to partition the graph into pairs of vertices, each connected by a single edge (a single attribute pair), such that we maximize the cumulative weight of all the edges. Unfortunately, in our paper, we show that this problem is NP-complete, meaning that solving it exactly isn’t computationally practical.

Example of a simple join multigraph (left) and two different partitions of it, using different distribution keys.
An example of a simple join multigraph (left) and two different partitions of it, using different distribution keys. The nodes (circles) are tables, and the smaller letters indicate the attributes on which join operations have been performed. In the first graph, the thickness of the lines indicates the data transfer required by joins between the associated attributes. The distribution keys selected for the first partition (red circles) yield greater savings in communication overhead than those selected in the second partition (green circles).

We also show, however, that the optimization technique known as integer linear programming may, for any given instance of the problem, yield an optimal solution in a reasonable amount of time. So the first step in our method is to try to partition the graph using integer linear programming, with a limit on how much time the linear-programming solver can spend on the problem.

If the solver times out, then our next step is to use four different heuristics to partition the graph, and we select the one that yields the greatest cumulative weight. We call our method the best-of-all-worlds approach, since it canvasses five different possibilities and chooses the one that works best.

All four heuristics are approximate solutions of the maximum-weight matching problem, which we prove to be a special case of the problem we’re trying to solve (the distribution key recommendation problem).

Heuristics

We begin with two empty sets of distribution key recommendations. Then we select a vertex of the graph (a table) at random and identify its most heavily weighted edge. The attributes that define that edge become the recommended distribution key for the tables the edge connects, and that recommendation is added to the first empty set.

Then we repeat the process, with another randomly selected vertex, and add the resulting recommendation to the second empty set of clustering recommendations. We repeat this process, alternating between the two sets of recommendations, until none of the vertices in the graph remain unaccounted for.

Now we have two different sets of recommendations, with two different sets of vertices, and we select the one with the greater cumulative edge weights. The differences between our four heuristics lie in the processes we use to add back the edges missing from the recommendation set we’ve selected — processes we’ve dubbed greedy matching, random choice, random neighbor, and naïve greedy. (Details are in the paper.)

In tests on four different data sets, our method reduced communication overhead by between 80% and 97%, savings that would directly translate to performance improvements for our customers.

Related content

LU, Luxembourg
The Decision, Science and Technology (DST) team part of the global Reliability Maintenance Engineering (RME) is looking for a Senior Operations Research Scientist interested in solving challenging optimization problems in the maintenance space. Our mission is to leverage the use of data, science, and technology to improve the efficiency of RME maintenance activities, reduce costs, increase safety and promote sustainability while creating frictionless customer experiences. As a Senior OR Scientist in DST you will be focused on leading the design and development of innovative approaches and solutions by leading technical work supporting RME’s Predictive Maintenance (PdM) and Spare Parts (SP) programs. You will connect with world leaders in your field and you will be tackling customer's natural language challenges by carrying out a systematic review of existing solutions. The appropriate choice of methods and their deployment into effective tools will be the key for the success in this role. The successful candidate will be a self-starter comfortable with ambiguity, with strong attention to detail and outstanding ability in balancing technical leadership with strong business judgment to make the right decisions about model and method choices. Key job responsibilities • Provide technical expertise to support team strategies that will take EU RME towards World Class predictive maintenance practices and processes, driving better equipment up-time and lower repair costs with optimized spare parts inventory and placement • Implement an advanced maintenance framework utilizing Machine Learning technologies to drive equipment performance leading to reduced unplanned downtime • Provide technical expertise to support the development of long-term spares management strategies that will ensure spares availability at an optimal level for local sites and reduce the cost of spares A day in the life As a Senior OR Scientist in DST you will be focused on leading the design and development of innovative approaches and solutions by leading technical work supporting RME’s Predictive Maintenance (PdM) and Spare Parts (SP) programs. You will connect with world leaders in your field and you will be tackling customer's natural language challenges by carrying out a systematic review of existing solutions. The appropriate choice of methods and their deployment into effective tools will be the key for the success in this role. About the team Our mission is to leverage the use of data, science, and technology to improve the efficiency of RME maintenance activities, reduce costs, increase safety and promote sustainability while creating frictionless customer experiences. We are open to hiring candidates to work out of one of the following locations: Luxembourg, LUX
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists in the Forecasting, Macroeconomics & Finance field document, interpret and forecast Amazon business dynamics. This track is well suited for economists adept at combining cutting edge times-series statistical methods with strong economic analysis and intuition. This track could be a good fit for candidates with research experience in: macroeconometrics and/or empirical macroeconomics; international macroeconomics; time-series econometrics; forecasting; financial econometrics and/or empirical finance; and the use of micro and panel data to improve and validate traditional aggregate models. Economists at Amazon are expected to work directly with our senior management and scientists from other fields on key business problems faced across Amazon, including retail, cloud computing, third party merchants, search, Kindle, streaming video, and operations. The Forecasting, Macroeconomics & Finance field utilizes methods at the frontier of economics to develop formal models to understand the past and the present, predict the future, and identify relevant risks and opportunities. For example, we analyze the internal and external drivers of growth and profitability and how these drivers interact with the customer experience in the short, medium and long-term. We build econometric models of dynamic systems, using our world class data tools, formalizing problems using rigorous science to solve business issues and further delight customers. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists in the Forecasting, Macroeconomics & Finance field document, interpret and forecast Amazon business dynamics. This track is well suited for economists adept at combining cutting edge times-series statistical methods with strong economic analysis and intuition. This track could be a good fit for candidates with research experience in: macroeconometrics and/or empirical macroeconomics; international macroeconomics; time-series econometrics; forecasting; financial econometrics and/or empirical finance; and the use of micro and panel data to improve and validate traditional aggregate models. Economists at Amazon are expected to work directly with our senior management and scientists from other fields on key business problems faced across Amazon, including retail, cloud computing, third party merchants, search, Kindle, streaming video, and operations. The Forecasting, Macroeconomics & Finance field utilizes methods at the frontier of economics to develop formal models to understand the past and the present, predict the future, and identify relevant risks and opportunities. For example, we analyze the internal and external drivers of growth and profitability and how these drivers interact with the customer experience in the short, medium and long-term. We build econometric models of dynamic systems, using our world class data tools, formalizing problems using rigorous science to solve business issues and further delight customers. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Economists in the Forecasting, Macroeconomics & Finance field document, interpret and forecast Amazon business dynamics. This track is well suited for economists adept at combining cutting edge times-series statistical methods with strong economic analysis and intuition. This track could be a good fit for candidates with research experience in: macroeconometrics and/or empirical macroeconomics; international macroeconomics; time-series econometrics; forecasting; financial econometrics and/or empirical finance; and the use of micro and panel data to improve and validate traditional aggregate models. Economists at Amazon are expected to work directly with our senior management and scientists from other fields on key business problems faced across Amazon, including retail, cloud computing, third party merchants, search, Kindle, streaming video, and operations. The Forecasting, Macroeconomics & Finance field utilizes methods at the frontier of economics to develop formal models to understand the past and the present, predict the future, and identify relevant risks and opportunities. For example, we analyze the internal and external drivers of growth and profitability and how these drivers interact with the customer experience in the short, medium and long-term. We build econometric models of dynamic systems, using our world class data tools, formalizing problems using rigorous science to solve business issues and further delight customers. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA
US, WA, Seattle
Amazon.com strives to be Earth's most customer-centric company where customers can shop in our stores to find and discover anything they want to buy. We hire the world's brightest minds, offering them a fast paced, technologically sophisticated and friendly work environment. Economists at Amazon partner closely with senior management, business stakeholders, scientist and engineers, and economist leadership to solve key business problems ranging from Amazon Web Services, Kindle, Prime, inventory planning, international retail, third party merchants, search, pricing, labor and employment planning, effective benefits (health, retirement, etc.) and beyond. Amazon Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical health, labor, public economics and related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of either a cross-functional team embedded within a business unit or a central science and economics organization. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems around the company. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Bellevue, WA, USA | Boston, MA, USA | Los Angeles, CA, USA | New York, NY, USA | San Francisco, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA