Amazon Redshift re-invented research paper and photos of Rahul Pathak, vice president of analytics at AWS, and Ippokratis Pandis, AWS senior principal engineer
The "Amazon Redshift re-invented" research paper will be presented at a leading database conference next month. Two of the paper's authors, Rahul Pathak (top right), vice president of analytics at AWS, and Ippokratis Pandis (bottom right), an AWS senior principal engineer, discuss the origins of Redshift, how the system has evolved in the past decade, and where they see the service evolving in the years ahead.

Amazon Redshift: Ten years of continuous reinvention

Two authors of Amazon Redshift research paper that will be presented at leading international forum for database researchers reflect on how far the first petabyte scale cloud data warehouse has advanced since it was announced ten years ago.

Nearly ten years ago, in November 2012 at the first-ever Amazon Web Services (AWS) re:Invent, Andy Jassy, then AWS senior vice president, announced the preview of Amazon Redshift, the first fully managed, petabyte-scale cloud data warehouse. The service represented a significant leap forward from traditional on-premises data warehousing solutions, which were expensive, inflexible, and required significant human and capital resources to operate.

In a blog post on November 28, 2012, Werner Vogels, Amazon chief technical officer, highlighted the news: “Today, we are excited to announce the limited preview of Amazon Redshift, a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud.”

Further in the post, Vogels added, “The result of our focus on performance has been dramatic. Amazon.com’s data warehouse team has been piloting Amazon Redshift and comparing it to their on-premise data warehouse for a range of representative queries against a two billion row data set. They saw speedups ranging from 10x – 150x!”

That’s why, on the day of the announcement, Rahul Pathak, then a senior product manager, and the entire Amazon Redshift team were confident the product would be popular.

“But we didn’t really understand how popular,” he recalls.

“At preview we asked customers to sign up and give us some indication of their data volume and workloads,” Pathak, now vice president of Relational Engines at AWS, said. “Within about three days we realized that we had ten times more demand for Redshift than we had planned for the entire first year of the service. So we scrambled right after re:Invent to accelerate our hardware orders to ensure we had enough capacity on the ground for when the product became generally available in early 2013. If we hadn’t done that preview, we would have been caught short.”

The Redshift team has been sprinting to keep apace of customer demand ever since. Today, the service is used by tens of thousands of customers to process exabytes of data daily. In June a subset of the team will present the paper “Amazon Redshift re-invented ” at a leading international forum for database researchers, practitioners, and developers, the ACM SIGMOD/PODS Conference in Philadelphia.

Related content
Amazon DynamoDB was introduced 10 years ago today; one of its key contributors reflects on its origins, and discusses the 'never-ending journey' to make DynamoDB more secure, more available and more performant.

The paper highlights four key areas where Amazon Redshift has evolved in the past decade, provides an overview of the system architecture, describes its high-performance transactional storage and compute layers, details how smart autonomics are provided, and discusses how AWS and Redshift make it easy for customers to use the best set of services to meet their needs.

Amazon Science recently connected with two of the paper’s authors, Pathak, and Ippokratis Pandis, an AWS senior principal engineer, to discuss the origins of Redshift, how the system has evolved over the past decade, and where they see the service evolving in the years ahead.

  1. Q. 

    Can you provide some background on the origin story for Redshift? What were customers seeking, and how did the initial version address those needs?

    A. 

    Rahul: We had been meeting with customers who in the years leading up to the launch of Amazon Redshift had moved just about every workload they had to the cloud except for their data warehouse. In many cases, it was the last thing they were running on premises, and they were still dealing with all of the challenges of on-premises data warehouses. They were expensive, had punitive licensing, were hard to scale, and customers couldn’t analyze all of their data. Customers told us they wanted to run data warehousing at scale in the cloud, that they didn’t want to compromise on performance or functionality, and that it had to be cost-effective enough for them to analyze all of their data.

    So, this is what we started to build, operating under the code name Cookie Monster. This was at a time when customers’ data volumes were exploding, and not just from relational databases, but from a wide variety of sources. One of our early private beta customers tried it and the results came back so fast they thought the system was broken. It was about 10 to 20 times faster than what they had been using before. Another early customer was pretty unhappy with gaps in our early functionality. When I heard about their challenges, I got in touch, understood their feedback, and incorporated it into the service before we made it generally available in February 2013. This customer soon turned into one of our biggest advocates.

    When we launched the service and announced our pricing at $1000 a terabyte per year, people just couldn’t believe we could offer a product with that much capability at such a low price point. The fact that you could provision a data warehouse in minutes instead of months also caught everyone’s attention. It was a real game-changer for this industry segment.

    Ippokratis: I was at IBM Research at the time working on database technologies there, and we recognized that providing data warehousing as a cloud service was a game changer. It was disruptive. We were working with customers’ on-premises systems where it would take us several days or weeks to resolve an issue, whereas with a cloud data warehouse like Redshift, it would take minutes. It was also apparent that the rate of innovation would accelerate in the cloud.

    In the on-premises world, it was taking months if not years to get new functionality into a software release, whereas in the cloud new capabilities could be introduced in weeks, without customers having to change a single line of code in their consuming applications. The Redshift announcement was an inflection point; I got really interested in the cloud, and cloud data warehouses, and eventually joined Amazon [Ippokratis joined the Redshift team as a principal engineer in Oct. 2015].

  2. Q. 

    How has Amazon Redshift evolved over the past decade since the launch nearly 10 years ago?

    A. 

    Ippokratis: As we highlight in the paper, the service has evolved at a rapid pace in response to customers’ needs. We focused on four main areas: 1) customers’ demand for high-performance execution of increasingly complex analytical queries; 2) our customers’ need to process more data and significantly increase the number of users who need to derive insights from that data; 3) customers’ need for us to make the system easier to use; and 4) our customers’ desire to integrate Redshift with other AWS services, and the AWS ecosystem. That’s a lot, so we’ll provide some examples across each dimension.

    Related publication
    Enterprise companies use spatial data for decision optimization and gain new insights regarding the locality of their business and services. Industries rely on efficiently combining spatial and business data from different sources, such as data warehouses, geospatial information systems, transactional systems, and data lakes, where spatial data can be found in structured or unstructured form. In this demonstration

    Offering the leading price performance has been our primary focus since Rahul first began working on what would become Redshift. From the beginning, the team has focused on making core query execution latency as low as possible so customers can run more workloads, issue more jobs into the system, and run their daily analysis. To do this, Redshift generates C++ code that is highly optimized and then sends it to the distributor in the parallel database and executes this highly optimized code. This makes Redshift unique in the way it executes queries, and it has always been the core of the service.

    We have never stopped innovating here to deliver our customers the best possible performance. Another thing that’s been interesting to me is that in the traditional business intelligence (BI) world, you optimize your system for very long-running jobs. But as we observe the behavior of our customers in aggregate, what’s surprising is that 90 percent of our queries among the billions we run daily in our service execute in less than one second. That’s not what people had traditionally expected from a data warehouse, and that has changed the areas of the code that we optimize.

    Rahul: As Ippokratis mentioned, the second area we focused on in the paper was customers’ need to process more data and to use that data to drive value throughout the organization. Analytics has always been super important, but eight or ten years ago it wasn’t necessarily mission critical for customers in the same way transactional databases were. That has definitely shifted. Today, core business processes rely on Redshift being highly available and performant. The biggest architectural change in the past decade in support of this goal was the introduction of Redshift Managed Storage, which allowed us to separate compute and storage, and focus a lot of innovation in each area.

    Diagram of the Redshift Managed Storage
    The Redshift managed storage layer (RMS) is designed for a durability of 99.999999999% and 99.99% availability over a given year, across multiple availability zones. RMS manages both user data as well as transaction metadata.

    Another big trend has been the desire of customers to query across and integrate disparate datasets. Redshift was the first data warehouse in the cloud to query Amazon S3 data, that was with Redshift Spectrum in 2017. Then we demonstrated the ability to run a query that scanned an exabyte of data in S3 as well as data in the cluster. That was a game changer.

    Customers like NASDAQ have used this extensively to query data that’s on local disk for the highest performance, but also take advantage of Redshift’s ability to integrate with the data lake and query their entire history of data with high performance. In addition to querying the data lake, integrated querying of transactional data stores like Aurora and RDS has been another big innovation, so customers can really have a high-performance analytics system that’s capable of transparently querying all of the data that matters to them without having to manage these complex integration processes that other systems require.

    Illustration of how a query flows through Redshift.
    This diagram from the research paper illustrates how a query flows through Redshift. The sequence is described in detail on pages 2 and 3 of the paper.

    Ippokratis: The third area we focused on in the paper was ease of use. One change that stands out for me is that on-premises data warehousing required IT departments to have a DBA (data base administrator) who would be responsible for maintaining the environment. Over the past decade, the expectation from customers has evolved. Now, if you are offering data warehousing as a service, the systems must be capable of auto tuning, auto healing, and auto optimizing. This has become a big area of focus for us where we incorporate machine learning and automation into the system to make it easier to use, and to reduce the amount of involvement required of administrators.

    Rahul: In terms of ease of use, three innovations come to mind. One is concurrency scaling. Similar to workload management, customers would previously have to manually tweak concurrency or reset clusters of the manually split workloads. Now, the system automatically provisions new resources and scales up and down without customers having to take any action. This is a great example of how Redshift has gotten much more dynamic and elastic.

    The second ease of use innovation is automated table optimization. This is another place where the system is able to observe workloads and data layouts and automatically suggest how data should be sorted and distributed across nodes in the cluster. This is great because it’s a continuously learning system so workloads are never static in time.

    Related publication
    How should we split data among the nodes of a distributed data warehouse in order to boost performance for a forecasted workload? In this paper, we study the effect of different data partitioning schemes on the overall network cost of pairwise joins. We describe a generally-applicable data distribution framework initially designed for Amazon Redshift, a fully-managed petabyte-scale data warehouse in the

    Customers are always adding more datasets, and adding more users, so what was optimal yesterday might not be optimal tomorrow. Redshift observes this and modifies what's happening under the covers to balance that. This was the focus of a really interesting graph optimization paper that we wrote a few years ago about how to analyze for optimal distribution keys for how data is laid out within a multi-node parallel-processing system. We've coupled this with automated optimization and then table encoding. In an analytics system, how you compress data has a big impact because the less data you scan, the faster your queries go. Customers had to reason about this in the past. Now Redshift can automatically determine how to encode data correctly to deliver the best possible performance for the data and the workload.

    The third innovation I want to highlight here is Amazon Redshift Serverless, which we launched in public preview at re:Invent last fall. Redshift Serverless removes all of the management of instances and clusters, so customers can focus on getting to insights from data faster and not spend time managing infrastructure. With Redshift Serverless, customers can simply provision an endpoint and begin to interact with their data, and Redshift Serverless will auto scale and automatically manage the system to essentially remove all of that complexity from customers.

    Customers can just focus on their data, set limits to manage their budgets, and we deliver optimal performance between those limits. This is another massive step forward in terms of ease of use because it eliminates any operations for customers. The early response to the preview has been tremendous. Thousands of customers have been excited to put Amazon Redshift Serverless through its paces over the past few months, and we’re excited about making it generally available in the near future.

    Amazon Redshift architecture diagram
    The Amazon Redshift architecture as presented in the research paper.

    Ippokratis: A fourth area of focus in the paper is on integration with other AWS services, and the AWS ecosystem. Integration is another area where customer behavior has evolved from traditional BI use cases. Today, cloud data warehouses are a central hub with tight integration with a broader set of AWS services. We provided the ability for customers to join data from the warehouse with the data lake. Then customers said they needed access to high-velocity business data in operational databases like Aurora and RDS, so we provided access to these operational data stores. Then we added support for streams, as well as integration with SageMaker and Lambda so customers can run machine learning training and inference without moving their data, and do generic compute. As a result, we’ve converted the traditional BI system into a well-integrated set of AWS services.

    Rahul: One big area of integration has been with our machine-learning ecosystem. With Redshift ML we have enabled anyone who knows SQL to take advantage of all of our machine-learning innovation. We built the ability to create a model from the SQL prompt, which gets the data into Amazon S3 and calls Amazon SageMaker, to use automated machine learning to build the most appropriate model to provide predictions on the data.

    This model is compiled efficiently and brought back into the data warehouse for customers to run very high-performance parallel inferences with no additional compute or no extra cost. The beauty of this integration is that every innovation we make within SageMaker means that Redshift ML gets better as well. This is just another means by which customers benefit from us connecting our services together.

    Related content
    Amazon researchers describe new method for distributing database tables across servers.

    Another big area for integration has been data sharing. Once we separated storage and compute layers with RA3 instances, we could enable data sharing, giving customers the ability to share data with clusters in the same account, and other accounts, or across regions. This allows us to separate consumers from producers of data, which enables things like modern data mesh architectures. Customers can share data without data copying, so they are transactionally consistent across accounts.

    For example, users within a data-science organization can securely work from the shared data, as can users within the reporting or marketing organization. We’ve also integrated data sharing with AWS Data Exchange, so now customers can search for — and subscribe to — third-party datasets that are live, up to date, and can be queried immediately in Redshift. This has been another game changer from the perspective of setting data free, enabling data monetization for third-party providers, and secure and live data access and licensing for subscribers for high-performance analytics within and across organizations. The fact that Redshift is part of an incredibly rich data ecosystem is a huge win for customers, and in keeping with customers’ desire to make data more pervasively available across the company.

  3. Q. 

    You indicate in the paper that Redshift innovation is continuing at an accelerated pace.  How do you see the cloud data warehouse segment evolving – and more specifically Redshift – over the next several years?

    A. 

    Rahul: A few things will continue to be true as we head into the future. Customers will be generating ever more amounts of data, and they’re going to want to analyze that data more cost effectively. Data volumes are growing exponentially, but obviously customers don't want their costs growing exponentially. This requires that we continue to innovate, and find new levels of performance to ensure that the cost of processing a unit of data continues to go down.

    We’ll continue innovating in software, in hardware, in silicon, and in using machine learning to make sure we deliver on that promise for customers. We’ve delivered on that promise for the past 10 years, and we’ll focus on making sure we deliver on that promise into the future.

    I’m very proud of what the team has accomplished, but equally as excited about all the things we’re going to do to improve Redshift in the future.
    Ippokratis Pandis

    Also, customers are always going to want better availability, they’re always going to want their data to be secure, and they’re always going to want more integrations with more data sources, and we intend to continue to deliver on all of those. What will stay the same is our ability to offer the-best in-segment price performance and capabilities, and the best-in-segment integration and security because they will always deliver value for customers.

    Ippokratis: It has been an incredible journey; we have been rebuilding the plane as we’ve been flying it with customers onboard, and this would not have happened without the support of AWS leadership, but most importantly the tremendous engineers, managers, and product people who have worked on the team.

    As we did in the paper, I want to recognize the contributions of Nate Binkert and Britt Johnson, who have passed, but whose words of wisdom continue to guide us. We’ve taken data warehousing, what we learned from books in school (Ippokratis earned his PhD in electrical and computer engineering from Carnegie Mellon University) and brought it to the cloud. In the process, we’ve been able to innovate, and write new pages in the book. I’m very proud of what the team has accomplished, but equally as excited about all the things we’re going to do to improve Redshift in the future.

    View from space of a connected network around planet Earth representing the Internet of Things.
    Sign up for our newsletter

Research areas

Related content

US, MA, North Reading
Are you excited about developing generative AI and foundation models to revolutionize automation, robotics and computer vision? Are you looking for opportunities to build and deploy them on real problems at truly vast scale? At Amazon Fulfillment Technologies and Robotics we are on a mission to build high-performance autonomous systems that perceive and act to further improve our world-class customer experience - at Amazon scale. We are looking for scientists, engineers and program managers for a variety of roles. The Research team at Amazon Robotics is seeking a passionate, hands-on Sr. Applied Scientist to help create the world’s first foundation model for a many-robot system. The focus of this position is how to predict the future state of our warehouses that feature a thousand or more mobile robots in constant motion making deliveries around the building. It includes designing, training, and deploying large-scale models using data from hundreds of warehouses under different operating conditions. This work spans from research such as alternative state representations of the many-robot system for training, to experimenting using simulation tools, to running large-scale A/B tests on robots in our facilities. Key job responsibilities * Research vision - Where should we be focusing our efforts * Research delivery - Proving/dis-proving strategies in offline data or in simulation * Production studies - Insights from production data or ad-hoc experimentation * Production implementation - Building key parts of deployed algorithms or models About the team You would join our multi-disciplinary science team that includes scientists with backgrounds in planning and scheduling, grasping and manipulation, machine learning, and operations research. We develop novel planning algorithms and machine learning methods and apply them to real-word robotic warehouses, including: - Planning and coordinating the paths of thousands of robots - Dynamic allocation and scheduling of tasks to thousands of robots - Learning how to adapt system behavior to varying operating conditions - Co-design of robotic logistics processes and the algorithms to optimize them Our team also serves as a hub to foster innovation and support scientists across Amazon Robotics. We also coordinate research engagements with academia, such as the Robotics section of the Amazon Research Awards. We are open to hiring candidates to work out of one of the following locations: North Reading, MA, USA | Westborough, MA, USA
US, WA, Bellevue
Are you excited about developing state-of-the-art deep learning foundation models, applied to the automation of labor for the future of Amazon’s Fulfillment network? Are you looking for opportunities to build and deploy them on real problems at truly vast scale? At Amazon Fulfillment Technologies and Robotics we are on a mission to build high-performance autonomous systems that perceive and act to further improve our world-class customer experience - at Amazon scale. To this end, we are looking for an Applied Scientist who will build and deploy models that help automate labor utilizing a wide array of multi-modal signals. Together, we will be pushing beyond the state of the art in optimization of one of the most complex systems in the world: Amazon's Fulfillment Network. Key job responsibilities In this role, you will build models that can identify potential problems with Amazon’s vast inventory, including discrepancies between the physical and virtual manifest and efficient execution of inventory audit operations. You will work with a diverse set of real world structured, unstructured and potentially multimodal datasets to train deep learning models that identify current inventory management problems and anticipate future ones. Datasets include multiple separate inventory management event streams, item images and natural language. You will face a high level of research ambiguity and problems that require creative, ambitious, and inventive solutions. About the team Amazon Fulfillment Technologies (AFT) powers Amazon’s global fulfillment network. We invent and deliver software, hardware, and data science solutions that orchestrate processes, robots, machines, and people. We harmonize the physical and virtual world so Amazon customers can get what they want, when they want it. The AFT AI team has deep expertise developing cutting edge AI solutions at scale and successfully applying them to business problems in the Amazon Fulfillment Network. These solutions typically utilize machine learning and computer vision techniques, applied to text, sequences of events, images or video from existing or new hardware. We influence each stage of innovation from inception to deployment, developing a research plan, creating and testing prototype solutions, and shepherding the production versions to launch. We are open to hiring candidates to work out of one of the following locations: Bellevue, WA, USA
US, CA, Santa Clara
About Amazon Health Amazon Health’s mission is to make it dramatically easier for customers to access the healthcare products and services they need to get and stay healthy. Towards this mission, we (Health Storefront and Shared Tech) are building the technology, products and services, that help customers find, buy, and engage with the healthcare solutions they need. Job summary We are seeking an exceptional Applied Scientist to join a team of experts in the field of machine learning, and work together to break new ground in the world of healthcare to make personalized and empathetic care accessible, convenient, and cost-effective. We leverage and train state-of-the-art large-language-models (LLMs) and develop entirely new experiences to help customers find the right products and services to address their health needs. We work on machine learning problems for intent detection, dialogue systems, and information retrieval. You will work in a highly collaborative environment where you can pursue both near-term productization opportunities to make immediate, meaningful customer impacts while pursuing ambitious, long-term research. You will work on hard science problems that have not been solved before, conduct rapid prototyping to validate your hypothesis, and deploy your algorithmic ideas at scale. You will get the opportunity to pursue work that makes people's lives better and pushes the envelop of science. Key job responsibilities - Translate product and CX requirements into science metrics and rigorous testing methodologies. - Invent and develop scalable methodologies to evaluate LLM outputs against metrics and guardrails. - Design and implement the best-in-class semantic retrieval system by creating high-quality knowledge base and optimizing embedding models and similarity measures. - Conduct tuning, training, and optimization of LLMs to achieve a compelling CX while reducing operational cost to be scalable. A day in the life In a fast-paced innovation environment, you work closely with product, UX, and business teams to understand customer's challenges. You translate product and business requirements into science problems. You dive deep into challenging science problems, enabling entirely new ML and LLM-driven customer experiences. You identify hypothesis and conduct rapid prototyping to learn quickly. You develop and deploy models at scale to pursue productizations. You mentor junior science team members and help influence our org in scientific best practices. About the team We are the ML Science and Engineering team, with a strong focus on Generative AI. The team consists of top-notch ML Scientists with diverse background in healthcare, robotics, customer analytics, and communication. We are committed to building and deploying the most advanced scientific capabilities and solutions for the products and services at Amazon Health. We are open to hiring candidates to work out of one of the following locations: Santa Clara, CA, USA
US, WA, Seattle
We are designing the future. If you are in quest of an iterative fast-paced environment, where you can drive innovation through scientific inquiry, and provide tangible benefit to hundreds of thousands of our associates worldwide, this is your opportunity. Come work on the Amazon Worldwide Fulfillment Design & Engineering Team! We are looking for an experienced and senior Research Scientist with background in Ergonomics and Industrial Human Factors, someone that is excited to work on complex real-world challenges for which a comprehensive scientific approach is necessary to drive solutions. Your investigations will define human factor / ergonomic thresholds resulting in design and implementation of safe and efficient workspaces and processes for our associates. Your role will entail assessment and design of manual material handling tasks throughout the entire Amazon network. You will identify fundamental questions pertaining to the human capabilities and tolerances in a myriad of work environments, and will initiate and lead studies that will drive decision making on an extreme scale. .You will provide definitive human factors/ ergonomics input and participate in design with every single design group in our network, including Amazon Robotics, Engineering R&D, and Operations Engineering. You will work closely with our Worldwide Health and Safety organization to gain feedback on designs and work tenaciously to continuously improve our associate’s experience. Key job responsibilities - Collaborating and designing work processes and workspaces that adhere to human factors / ergonomics standards worldwide. - Producing comprehensive and assessments of workstations and processes covering biomechanical, physiological, and psychophysical demands. - Effectively communicate your design rationale to multiple engineering and operations entities. - Identifying gaps in current human factors standards and guidelines, and lead comprehensive studies to redefine “industry best practices” based on solid scientific foundations. - Continuously strive to gain in-depth knowledge of your profession, as well as branch out to learn about intersecting fields, such as robotics and mechatronics. - Travelling to our various sites to perform thorough assessments and gain in-depth operational feedback, approximately 25%-50% of the time. We are open to hiring candidates to work out of one of the following locations: Seattle, WA, USA
US, CA, Santa Monica
Amazon Advertising is looking for a motivated and analytical self-starter to help pave the way for the next generation of insights and advertising products. You will use large-scale data, advertising effectiveness knowledge and business information needs of our advertising clients to envision new advertising measurement products and tools. You will facilitate innovation on behalf of our customers through end-to-end delivery of measurement solutions leveraging experiments, machine learning and causal inference. You will partner with our engineering teams to develop and scale successful solutions to production. This role requires strong hands-on skills in terms of effectively working with data, coding, and MLOps. However, the ideal candidate will also bring strong interpersonal and communication skills to engage with cross-functional partners, as well as to stay connected to insights needs of account teams and advertisers. This is a truly exciting and versatile position in that it allows you to apply and develop your hands-on data modeling and coding skills, to work with other scientists on research in new measurement solutions while at the same time partner with cross-functional stakeholders to deliver product impact. Key job responsibilities As an Applied Scientist on the Advertising Incrementality Measurement team you will: - Create new analytical products from conception to prototyping and scaling the product end-to-end through to production. - Scope and define new business problems in the realm of advertising effectiveness. Use machine learning and experiments to develop effective and scalable solutions. - Partner closely with the Engineering team. - Partner with Economists, Data Scientists, and other Applied Scientists to conduct research on advertising effectiveness using machine learning and causal inference. Make findings available via white papers. - Act as a liaison to product teams to help productize new measurement solutions. About the team Advertising Incrementality Measurement combines experiments with econometric analysis and machine learning to provide rigorous causal measurement of advertising effectiveness to internal and external customers. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Boulder, CO, USA | New York, NY, USA | Santa Monica, CA, USA
US, NY, New York
Amazon Advertising is one of Amazon's fastest growing and most profitable businesses, responsible for defining and delivering a collection of advertising products that drive discovery and sales. Our products are strategically important to our Retail and Marketplace businesses driving long term growth. We deliver billions of ad impressions and millions of clicks and break fresh ground in product and technical innovations every day! The Ad Measurement team develops and deploys solutions fueled by machine learning to support Amazon Advertisers in their strategic campaign planning. Leaning on rich data points, we provide measurements, predictions and diagnostics that separate Amazon Advertising from all other media. As a Data Scientist on this team, you will: - Solve real-world problems by getting and analyzing large amounts of data, diving deep to identify business insights and opportunities, design simulations and experiments, developing statistical and ML models by tailoring to business needs, and collaborating with Scientists, Engineers, BIE's, and Product Managers. - Write code (Python, R, Scala, SQL, etc.) to obtain, manipulate, and analyze data - Apply statistical and machine learning knowledge to specific business problems and data. - Build decision-making models and propose solution for the business problem you define. - Retrieve, synthesize, and present critical data in a format that is immediately useful to answering specific questions or improving system performance. - Analyze historical data to identify trends and support optimal decision making. - Formalize assumptions about how our systems are expected to work, create statistical definition of the outlier, and develop methods to systematically identify outliers. Work out why such examples are outliers and define if any actions needed. - Given anecdotes about anomalies or generate automatic scripts to define anomalies, deep dive to explain why they happen, and identify fixes. - Conduct written and verbal presentations to share insights to audiences of varying levels of technical sophistication. Why you will love this opportunity: Amazon has invested heavily in building a world-class advertising business. This team defines and delivers a collection of advertising products that drive discovery and sales. Our solutions generate billions in revenue and drive long-term growth for Amazon’s Retail and Marketplace businesses. We deliver billions of ad impressions, millions of clicks daily, and break fresh ground to create world-class products. We are a highly motivated, collaborative, and fun-loving team with an entrepreneurial spirit - with a broad mandate to experiment and innovate. Impact and Career Growth: You will invent new experiences and influence customer-facing shopping experiences to help suppliers grow their retail business and the auction dynamics that leverage native advertising; this is your opportunity to work within the fastest-growing businesses across all of Amazon! Define a long-term science vision for our advertising business, driven from our customers' needs, translating that direction into specific plans for research and applied scientists, as well as engineering and product teams. This role combines science leadership, organizational ability, technical strength, product focus, and business understanding. We are open to hiring candidates to work out of one of the following locations: New York, NY, USA
US, WA, Bellevue
At AWS, we use Artificial Intelligence to be able to identify every need of a customer across all AWS services before they have to tell us about it, and then find and seamlessly connect them to the most appropriate resolution for their need, eventually fulfilling the vision of a self-healing cloud. We are looking for Data Scientists with unfettered curiosity and drive to help build “best in the world” support (contact center) experience that customers will love! You will have an opportunity to lead, invent, and design tech that will directly impact every customer across all AWS services. We are building industry-leading technology that cuts across a wide range of ML techniques from Natural Language Processing to Deep Learning and Generative Artificial Intelligence. You will be a key driver in taking something from an idea to an experiment to a prototype and finally to a live production system. Our team packs a punch with principal level engineering, science, product, and leadership talent. We are a results focused team and you have the opportunity to lead and establish a culture for the big things to come. We combine the culture of a startup, the innovation and creativity of a R&D Lab, the work-life balance of a mature organization, and technical challenges at the scale of AWS. We offer a playground of opportunities for builders to build, have fun, and make history! Key job responsibilities Deliver real world production systems at AWS scale. Work closely with the business to understand the problem space, identify the opportunities and formulate the problems. Use machine learning, data mining, statistical techniques, Generative AI and others to create actionable, meaningful, and scalable solutions for the business problems. Analyze and extract relevant information from large amounts of data and derive useful insights. Work with software engineering teams to deliver production systems with your ML models Establish scalable, efficient, automated processes for large scale data analyses, model development, model validation and model implementation We are open to hiring candidates to work out of one of the following locations: Bellevue, WA, USA | Seattle, WA, USA
US, CA, Santa Clara
Amazon launched the Generative AI Innovation Center (GAIIC) in Jun 2023 to help AWS customers accelerate the use of Generative AI to solve business and operational problems and promote innovation in their organization (https://press.aboutamazon.com/2023/6/aws-announces-generative-ai-innovation-center). GAIIC provides opportunities to innovate in a fast-paced organization that contributes to game-changing projects and technologies that get deployed on devices and in the cloud. As an Applied Science Manager in GAIIC, you'll partner with technology and business teams to build new GenAI solutions that delight our customers. You will be responsible for directing a team of data/research/applied scientists, deep learning architects, and ML engineers to build generative AI models and pipelines, and deliver state-of-the-art solutions to customer’s business and mission problems. Your team will be working with terabytes of text, images, and other types of data to address real-world problems. The successful candidate will possess both technical and customer-facing skills that will allow you to be the technical “face” of AWS within our solution providers’ ecosystem/environment as well as directly to end customers. You will be able to drive discussions with senior technical and management personnel within customers and partners, as well as the technical background that enables them to interact with and give guidance to data/research/applied scientists and software developers. The ideal candidate will also have a demonstrated ability to think strategically about business, product, and technical issues. Finally, and of critical importance, the candidate will be an excellent technical team manager, someone who knows how to hire, develop, and retain high quality technical talent. About the team Here at AWS, it’s in our nature to learn and be curious about diverse perspectives. Our employee-led affinity groups foster a culture of inclusion that empower employees to feel proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness. We have a career path for you no matter what stage you’re in when you start here. We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career- advancing resources here to help you develop into a better-rounded professional. We are open to hiring candidates to work out of one of the following locations: San Francisco, CA, USA | San Jose, CA, USA | Santa Clara, CA, USA
GB, London
Amazon Advertising is looking for a Data Scientist to join its brand new initiative that powers Amazon’s contextual advertising products. Advertising at Amazon is a fast-growing multi-billion dollar business that spans across desktop, mobile and connected devices; encompasses ads on Amazon and a vast network of hundreds of thousands of third party publishers; and extends across US, EU and an increasing number of international geographies. The Supply Quality organization has the charter to solve optimization problems for ad-programs in Amazon and ensure high-quality ad-impressions. We develop advanced algorithms and infrastructure systems to optimize performance for our advertisers and publishers. We are focused on solving a wide variety of problems in computational advertising like traffic quality prediction (robot and fraud detection), Security forensics and research, Viewability prediction, Brand Safety, Contextual data processing and classification. Our team includes experts in the areas of distributed computing, machine learning, statistics, optimization, text mining, information theory and big data systems. We are looking for a dynamic, innovative and accomplished Data Scientist to work on data science initiatives for contextual data processing and classification that power our contextual advertising solutions. Are you an experienced user of sophisticated analytical techniques that can be applied to answer business questions and chart a sustainable vision? Are you exited by the prospect of communicating insights and recommendations to audiences of varying levels of technical sophistication? Above all, are you an innovator at heart and have a track record of resolving ambiguity to deliver result? As a data scientist, you help our data science team build cutting edge models and measurement solutions to power our contextual classification technology. As this is a new initiative, you will get an opportunity to act as a thought leader, work backwards from the customer needs, dive deep into data to understand the issues, define metrics, conceptualize and build algorithms and collaborate with multiple cross-functional teams. Key job responsibilities * Define a long-term science vision for contextual-classification tech, driven fundamentally from the needs of our advertisers and publishers, translating that direction into specific plans for the science team. Interpret complex and interrelated data points and anecdotes to build and communicate this vision. * Collaborate with software engineering teams to Identify and implement elegant statistical and machine learning solutions * Oversee the design, development, and implementation of production level code that handles billions of ad requests. Own the full development cycle: idea, design, prototype, impact assessment, A/B testing (including interpretation of results) and production deployment. * Promote the culture of experimentation and applied science at Amazon. * Demonstrated ability to meet deadlines while managing multiple projects. * Excellent communication and presentation skills working with multiple peer groups and different levels of management * Influence and continuously improve a sustainable team culture that exemplifies Amazon’s leadership principles. We are open to hiring candidates to work out of one of the following locations: London, GBR
JP, 13, Tokyo
We are seeking a Principal Economist to be the science leader in Amazon's customer growth and engagement. The wide remit covers Prime, delivery experiences, loyalty program (Amazon Points), and marketing. We look forward to partnering with you to advance our innovation on customers’ behalf. Amazon has a trailblazing track record of working with Ph.D. economists in the tech industry and offers a unique environment for economists to thrive. As an economist at Amazon, you will apply the frontier of econometric and economic methods to Amazon’s terabytes of data and intriguing customer problems. Your expertise in building reduced-form or structural causal inference models is exemplary in Amazon. Your strategic thinking in designing mechanisms and products influences how Amazon evolves. In this role, you will build ground-breaking, state-of-the-art econometric models to guide multi-billion-dollar investment decisions around the global Amazon marketplaces. You will own, execute, and expand a research roadmap that connects science, business, and engineering and contributes to Amazon's long term success. As one of the first economists outside North America/EU, you will make an outsized impact to our international marketplaces and pioneer in expanding Amazon’s economist community in Asia. The ideal candidate will be an experienced economist in empirical industrial organization, labour economics, or related structural/reduced-form causal inference fields. You are a self-starter who enjoys ambiguity in a fast-paced and ever-changing environment. You think big on the next game-changing opportunity but also dive deep into every detail that matters. You insist on the highest standards and are consistent in delivering results. Key job responsibilities - Work with Product, Finance, Data Science, and Data Engineering teams across the globe to deliver data-driven insights and products for regional and world-wide launches. - Innovate on how Amazon can leverage data analytics to better serve our customers through selection and pricing. - Contribute to building a strong data science community in Amazon Asia. We are open to hiring candidates to work out of one of the following locations: Tokyo, 13, JPN