How to reduce communication overhead for database queries by up to 97%

Amazon researchers describe new method for distributing database tables across servers.

Relational databases are typically composed of many different tables: one table might store contact data for a company’s customers; another might store data about all the company’s retail stores; another might store individual customers’ purchase histories; another might log details about customer service calls; and so on.

Customers who use the Amazon Redshift cloud data warehouse service from Amazon Web Services often have databases that consist of thousands of tables, which are constantly being updated and expanded. These tables naturally have to be distributed across multiple servers in AWS data centers.

At the 46th International Conference on Very Large Databases (VLDB), my colleagues — Yonatan Naamad, Peter Van Bouwel, Christos Faloutsos, and Michalis Petropoulos — and I presented a new method for allocating data across servers. In experiments involving queries that retrieved data from multiple tables, our method reduced communications overhead by as much as 97% relative to the original, unoptimized configuration.

For the last year, Amazon Redshift Advisor has used this method to recommend data storage configurations to our customers, enabling them to perform more-efficient database queries.

Example of a real join multigraph
An example of a real join multigraph. The thickness of the lines indicates the data transfer required by joins on particular attributes.

To get a sense of the problem our method addresses, consider a company that wishes to inform customers about sales at their local stores. That requires a database query that pulls customer data from the customer table and sale data from a store table.

To find the right store for each customer, the query matches entries from both tables by city. The query thus performs a join operation using the attribute “city”.

One standard way to distribute database tables across servers is to use distribution keys. For each data entry in a table (that is, each row of the table), a hash function is applied to one value of the entry — the distribution key. The hash function maps that value to the address of a server on the network, which is where the table row is stored.

In our example of a join operation, if the distribution key for both the customer table and the store table is the attribute “city”, then all customer entries and store entries that share a city will be stored on the same server. Each server then contains enough information to perform the join independently and in parallel with the other servers, without the need for data reshuffling at query time.

This is the basis of our method. Essentially, we analyze the query data for a particular database and identify the attributes whose joins involve the largest data transfers; then we use those attributes as the distribution keys for the associated tables.

The join multigraph

The first step in this process is to create what we call a join multigraph. This is a graph in the graph-theoretical sense: a data structure consisting of vertices — often depicted as circles — and edges — usually depicted as line segments connecting vertices. The edges may also have numbers associated with them, known as weights.

In the join multigraph, the vertices are tables of a database. The edges connect attributes of separate tables on which join operations have been performed, and the edge weights indicate the data transfer required by joins between these attributes.

Our goal is now to partition the graph into pairs of vertices, each connected by a single edge (a single attribute pair), such that we maximize the cumulative weight of all the edges. Unfortunately, in our paper, we show that this problem is NP-complete, meaning that solving it exactly isn’t computationally practical.

Example of a simple join multigraph (left) and two different partitions of it, using different distribution keys.
An example of a simple join multigraph (left) and two different partitions of it, using different distribution keys. The nodes (circles) are tables, and the smaller letters indicate the attributes on which join operations have been performed. In the first graph, the thickness of the lines indicates the data transfer required by joins between the associated attributes. The distribution keys selected for the first partition (red circles) yield greater savings in communication overhead than those selected in the second partition (green circles).

We also show, however, that the optimization technique known as integer linear programming may, for any given instance of the problem, yield an optimal solution in a reasonable amount of time. So the first step in our method is to try to partition the graph using integer linear programming, with a limit on how much time the linear-programming solver can spend on the problem.

If the solver times out, then our next step is to use four different heuristics to partition the graph, and we select the one that yields the greatest cumulative weight. We call our method the best-of-all-worlds approach, since it canvasses five different possibilities and chooses the one that works best.

All four heuristics are approximate solutions of the maximum-weight matching problem, which we prove to be a special case of the problem we’re trying to solve (the distribution key recommendation problem).

Heuristics

We begin with two empty sets of distribution key recommendations. Then we select a vertex of the graph (a table) at random and identify its most heavily weighted edge. The attributes that define that edge become the recommended distribution key for the tables the edge connects, and that recommendation is added to the first empty set.

Then we repeat the process, with another randomly selected vertex, and add the resulting recommendation to the second empty set of clustering recommendations. We repeat this process, alternating between the two sets of recommendations, until none of the vertices in the graph remain unaccounted for.

Now we have two different sets of recommendations, with two different sets of vertices, and we select the one with the greater cumulative edge weights. The differences between our four heuristics lie in the processes we use to add back the edges missing from the recommendation set we’ve selected — processes we’ve dubbed greedy matching, random choice, random neighbor, and naïve greedy. (Details are in the paper.)

In tests on four different data sets, our method reduced communication overhead by between 80% and 97%, savings that would directly translate to performance improvements for our customers.

Related content

US, CA, Santa Clara
Job summaryAmazon is looking for a passionate, talented, and inventive Applied Scientist with a strong machine learning background to help build industry-leading language technology.Our mission is to provide a delightful experience to Amazon’s customers by pushing the envelope in Natural Language Processing (NLP), Natural Language Understanding (NLU), Dialog management, conversational AI and Machine Learning (ML).As part of our AI team in Amazon AWS, you will work alongside internationally recognized experts to develop novel algorithms and modeling techniques to advance the state-of-the-art in human language technology. Your work will directly impact millions of our customers in the form of products and services, as well as contributing to the wider research community. You will gain hands on experience with Amazon’s heterogeneous text and structured data sources, and large-scale computing resources to accelerate advances in language understanding.We are hiring primarily in Conversational AI / Dialog System Development areas: NLP, NLU, Dialog Management, NLG.This role can be based in NYC, Seattle or Palo Alto.Inclusive Team CultureHere at AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences.Work/Life BalanceOur team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives.Mentorship & Career GrowthOur team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future.
US, NY, New York
Job summaryAmazon is looking for a passionate, talented, and inventive Applied Scientist with a strong machine learning background to help build industry-leading language technology.Our mission is to provide a delightful experience to Amazon’s customers by pushing the envelope in Natural Language Processing (NLP), Natural Language Understanding (NLU), Dialog management, conversational AI and Machine Learning (ML).As part of our AI team in Amazon AWS, you will work alongside internationally recognized experts to develop novel algorithms and modeling techniques to advance the state-of-the-art in human language technology. Your work will directly impact millions of our customers in the form of products and services, as well as contributing to the wider research community. You will gain hands on experience with Amazon’s heterogeneous text and structured data sources, and large-scale computing resources to accelerate advances in language understanding.We are hiring primarily in Conversational AI / Dialog System Development areas: NLP, NLU, Dialog Management, NLG.This role can be based in NYC, Seattle or Palo Alto.Inclusive Team CultureHere at AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences.Work/Life BalanceOur team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives.Mentorship & Career GrowthOur team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future.
US, CA, Santa Clara
Job summaryAWS AI/ML is looking for world class scientists and engineers to join its AI Research and Education group working on building automated ML solutions for planetary-scale sustainability and geospatial applications. Our team's mission is to develop ready-to-use and automated solutions that solve important sustainability and geospatial problems. We live in a time wherein geospatial data, such as climate, agricultural crop yield, weather, landcover, etc., has become ubiquitous. Cloud computing has made it easy to gather and process the data that describes the earth system and are generated by satellites, mobile devices, and IoT devices. Our vision is to bring the best ML/AI algorithms to solve practical environmental and sustainability-related R&D problems at scale. Building these solutions require a solid foundation in machine learning infrastructure and deep learning technologies. The team specializes in developing popular open source software libraries like AutoGluon, GluonCV, GluonNLP, DGL, Apache/MXNet (incubating). Our strategy is to bring the best of ML based automation to the geospatial and sustainability area.We are seeking an experienced Applied Scientist for the team. This is a role that combines science knowledge (around machine learning, computer vision, earth science), technical strength, and product focus. It will be your job to develop ML system and solutions and work closely with the engineering team to ship them to our customers. You will interact closely with our customers and with the academic and research communities. You will be at the heart of a growing and exciting focus area for AWS and work with other acclaimed engineers and world famous scientists. You are also expected to work closely with other applied scientists and demonstrate Amazon Leadership Principles (https://www.amazon.jobs/en/principles). Strong technical skills and experience with machine learning and computer vision are required. Experience working with earth science, mapping, and geospatial data is a plus. Our customers are extremely technical and the solutions we build for them are strongly coupled to technical feasibility.About the teamInclusive Team CultureAt AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 14 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust. Work/Life BalanceOur team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives.Mentorship & Career GrowthOur team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded scientist and enable them to take on more complex tasks in the future.Interested in this role? Reach out to the recruiting team with questions or apply directly via amazon.jobs.
US, CA, Santa Clara
Job summaryAWS AI/ML is looking for world class scientists and engineers to join its AI Research and Education group working on building automated ML solutions for planetary-scale sustainability and geospatial applications. Our team's mission is to develop ready-to-use and automated solutions that solve important sustainability and geospatial problems. We live in a time wherein geospatial data, such as climate, agricultural crop yield, weather, landcover, etc., has become ubiquitous. Cloud computing has made it easy to gather and process the data that describes the earth system and are generated by satellites, mobile devices, and IoT devices. Our vision is to bring the best ML/AI algorithms to solve practical environmental and sustainability-related R&D problems at scale. Building these solutions require a solid foundation in machine learning infrastructure and deep learning technologies. The team specializes in developing popular open source software libraries like AutoGluon, GluonCV, GluonNLP, DGL, Apache/MXNet (incubating). Our strategy is to bring the best of ML based automation to the geospatial and sustainability area.We are seeking an experienced Applied Scientist for the team. This is a role that combines science knowledge (around machine learning, computer vision, earth science), technical strength, and product focus. It will be your job to develop ML system and solutions and work closely with the engineering team to ship them to our customers. You will interact closely with our customers and with the academic and research communities. You will be at the heart of a growing and exciting focus area for AWS and work with other acclaimed engineers and world famous scientists. You are also expected to work closely with other applied scientists and demonstrate Amazon Leadership Principles (https://www.amazon.jobs/en/principles). Strong technical skills and experience with machine learning and computer vision are required. Experience working with earth science, mapping, and geospatial data is a plus. Our customers are extremely technical and the solutions we build for them are strongly coupled to technical feasibility.About the teamInclusive Team CultureAt AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 14 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust. Work/Life BalanceOur team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives.Mentorship & Career GrowthOur team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded scientist and enable them to take on more complex tasks in the future.Interested in this role? Reach out to the recruiting team with questions or apply directly via amazon.jobs.
US, NY, New York
Job summaryAmazon Web Services is looking for world class scientists to join the Security Analytics and AI Research team within AWS Security Services. This group is entrusted with researching and developing core data mining and machine learning algorithms for various AWS security services like GuardDuty (https://aws.amazon.com/guardduty/) and Macie (https://aws.amazon.com/macie/). In this group, you will invent and implement innovative solutions for never-before-solved problems. If you have passion for security and experience with large scale machine learning problems, this will be an exciting opportunity.The AWS Security Services team builds technologies that help customers strengthen their security posture and better meet security requirements in the AWS Cloud. The team interacts with security researchers to codify our own learnings and best practices and make them available for customers. We are building massively scalable and globally distributed security systems to power next generation services.Inclusive Team Culture Here at AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 14 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust. Work/Life Balance Our team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives. Mentorship & Career Growth Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. We care about your career growth and strive to assign projects based on what will help each team member develop and enable them to take on more complex tasks in the future.A day in the lifeAbout the hiring groupJob responsibilities* Rapidly design, prototype and test many possible hypotheses in a high-ambiguity environment, making use of both quantitative and business judgment.* Collaborate with software engineering teams to integrate successful experiments into large scale, highly complex production services.* Report results in a scientifically rigorous way.* Interact with security engineers, product managers and related domain experts to dive deep into the types of challenges that we need innovative solutions for.
ES, B, Barcelona
Job summaryAre you excited to help customers discover the hottest and best reviewed products?Through the enablement of intelligent campaigns that leverage machine-learning models, you will help to deliver the best possible shopping experience for Amazon’s customers all over the globe.We are looking for experienced scientist who will work with business leaders, scientists, and engineers to translate business and functional requirements into concrete deliverables. Your domain spans the design, development, testing, and deployment of data driven and highly scalable solutions using data processing and machine learning in product recommendation. You will partner with scientists, product managers, and engineers to help invent and implement scalable Data processing and ML models while inventing tools on our customers behalf. A day in the lifeThis is a unique, high visibility opportunity for someone who wants to have business impact, dive deep into large-scale problems, and work closely with scientists and engineers. We are particularly interested in candidates with experience building large scale machine learning solutions and working with distributed systems to 1) help us build robust ensemble of ML systems that can drive classification and recommendation of products with a high precision and recall utilizing various signals and scale to new marketplaces and languages and 2) design optimal or near optimal supervised and unsupervised machine learning models and solutions for moderately complex projects in business, science, or engineering.About the hiring groupThe Discovery Tech team helps customers discover and engage with new, popular and relevant products across Amazon worldwide. We do this by combining technology, science, and innovation to build new customer-facing features and experiences alongside cutting edge tools for marketers. You will be responsible for creating and building critical services that automatically generate, target, and optimize Amazon’s cross-category marketing and merchandising. Job responsibilitiesAs a Principal Applied Scientist, you bring business and industry context to science and technology decisions. You set the standard for scientific excellence and make decisions that affect the way we build and integrate algorithms. Your solutions are exemplary in terms of algorithm design, clarity, model structure, efficiency, and extensibility. You tackle intrinsically hard problems, acquiring expertise as needed. You decompose complex problems into straightforward solutions.
US, MA, Boston
Job summaryAmazon Pharmacy is a fast-growing part of Amazon. We are a passionate team working to build a best-in-class healthcare product designed to make high-quality healthcare easy to access. The economics group at is new, and there are great opportunities for high-impact projects touching many areas of the business. And as an economist at Amazon, you’ll be part of a vibrant community working on challenging applied problems in pricing, casual inference and market design (among others). We are looking for a pragmatic and driven economist with modeling skills, to drive change in how we do business. We are open to remote work. About you:Problem identifier and solver: is able to identify flaws in our current economic thinking, size them, prioritize them, and find solutions.Doer: can deliver end-to-end processes, making high-value judgements about the need for speed versus scientific rigor and state-of-the-art thinking.Collaborative: happy to work across disciplines with our business and finance teams, and able to collaborate and divide responsibilities in projects so as to leverage the different strengths of the team.Own and Simplify: able to take ownership of complex problems, strip them down to their core essentials, and deliver solutions that can be easily communicated and whose impact can be measured.
US, CA, Palo Alto
Job summaryAmazon Advertising is one of Amazon's fastest growing and most profitable businesses, responsible for defining and delivering a collection of advertising products that drive discovery and sales. Our products are strategically important to our Retail and Marketplace businesses driving long term growth. We deliver billions of ad impressions and millions of clicks and break fresh ground in product and technical innovations every day!Our Forecasting Products team builds end-to-end solutions for publishers, advertisers and Amazon DSP. This includes data pipelines, machine learning models, large scale data structures and indexes, advertiser recommendations (bids, products) and data visualizations. We match supply (human eyeballs) and demand (advertisers interests) in thousands of audience targeting dimensions, and recommend optimal prices.As a Data Scientist on this team, you will:Solve real-world problems by analyzing large amounts of business data, diving deep to identify business insights and opportunities, designing simulations and experiments, developing statistical and ML models by tailoring to business needs, and collaborating with Scientists, Engineers, BIE's, and Product Managers.Write code (Python, R, Scala, etc.) to analyze data and build statistical models to solve specific business problems.Apply statistical and machine learning knowledge to specific business problems and data.Build decision-making models and propose solution for the business problem you define.Retrieve, synthesize, and present critical data in a format that is immediately useful to answering specific questions or improving system performance.Analyze historical data to identify trends and support decision making.Provide requirements to develop analytic capabilities, platforms, and pipelines.Formalize assumptions about how our systems are expected to work, create statistical definition of the outlier, and develop methods to systematically identify outliers. Work out why such examples are outliers and define if any actions needed.Given anecdotes about anomalies or generate automatic scripts to define anomalies, deep dive to explain why they happen, and identify fixes.Conduct written and verbal presentation to share insights and recommendations to audiences of varying levels of technical sophistication. Why you will love this opportunity: Amazon is investing heavily in building a world-class advertising business. This team defines and delivers a collection of advertising products that drive discovery and sales. Our solutions generate billions in revenue and drive long-term growth for Amazon’s Retail and Marketplace businesses. We deliver billions of ad impressions, millions of clicks daily, and break fresh ground to create world-class products. We are a highly motivated, collaborative, and fun-loving team with an entrepreneurial spirit - with a broad mandate to experiment and innovate. Impact and Career Growth: You will invent new experiences and influence customer-facing shopping experiences to help suppliers grow their retail business and the auction dynamics that leverage native advertising; this is your opportunity to work within the fastest-growing businesses across all of Amazon! Define a long-term science vision for our advertising business, driven from our customers' needs, translating that direction into specific plans for research and applied scientists, as well as engineering and product teams. This role combines science leadership, organizational ability, technical strength, product focus, and business understanding. Team video https://youtu.be/zD_6Lzw8raE
US, WA, Bellevue
Job summaryAmazon is seeking a candidate to identify, develop and integrate innovative solutions and programs that lead to improvements that redefine the standards for customer experience in our North American transportation network. Amazon transportation encompasses all of the operations that deliver shipments from our fulfillment centers and third party locations to customers worldwide.As an Applied Scientist for the Middle Mile Sort Center team, you will work with business, science and analytics teams across the company to build state of the art data products to improve delivery speed for our customers. Because we strive for faster delivery to customers a successful candidate must be passionate about identifying and building solutions that will help drive a more efficient network and lower cost of operations. In this role, you will develop scalable mathematical models to derive solution to existing network structure and create evaluation methods to track the performance and identify areas of improvements.Key job responsibilitiesResearch and use of statistical techniques to create scalable solutions for business problemsAnalyze and extract relevant information from large amounts of Amazon's historical business data to help automate and optimize key features and processes and build new data analysis productsWork closely with scientists and engineering teams to create and deploy new featuresEstablish scalable, efficient, automated processes for large scale data analyses, model development, validation and implementation.Evaluate cross-team perspectives, use quantitative methods to derive justification, and build consensus on a roadmap on the required level of analyses to meet goal
US, Virtual
Job summaryWorkforce Staffing is responsible for hiring hourly associates into our global fulfillment operation. Each year we hire hundreds of thousands of associates across the globe. Market Intelligence, a subsidiary of Workforce Staffing, applies data and insights to optimize the experience and operations for Amazon’s largest candidate population who bring the magic of Amazon’s industry-leading brand promise for customer fulfillment to life. We are an interdisciplinary team that combines the talents of economics, science, analytics, and engineering to develop and deliver solutions that are continually shaping and writing the future of the hourly worker job offering landscape. We are looking for an economist with expertise in applying causal inference, experimental design, or causal machine learning techniques to topics in labor or related applied economics. They will collaborate with business partners to define and deliver economic thinking that guide strategic decisions. They will work closely with data scientists and engineers to estimate and validate their models on large scale data, and will help business partners turn the results of their analysis to business actions that have a major impact.Ideal candidates will own key inputs to all stages of research projects, including data requirements, model development, experimental design, and data analysis. They will be customer-centric, working closely with business partners to define key research questions, communicate scientific approaches and findings, listen to and incorporate partner feedback, and deliver successful solutions.**Inclusive Team Culture** Here at Amazon, we embrace our differences. We are committed to furthering our culture of inclusion. We have 12 affinity groups (employee resource groups) with more than 87,000 employees across hundreds of chapters around the world. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 14 Leadership Principles, which reminds team members to seek diverse perspectives, learn and be curious, and earn trust.*Flexibility* It isn’t about which hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We offer flexibility and encourage you to find your own balance between your work and personal lives.*Mentorship & Career Growth* We care about your career growth too. Whether your goals are to explore new technologies, take on bigger opportunities, or get to the next level, we'll help you get there. Our business is growing fast and our people will grow with it.