Making DeepSpeed ZeRO run efficiently on more-affordable hardware

Amazon researchers optimize the distributed-training tool to run efficiently on the Elastic Fabric Adapter network interface.

Most modern natural-language-processing applications are built on top of pretrained language models, which encode the probabilities of word sequences for entire languages. Over time, these models have trended larger and larger, into the regime of billions or even trillions of parameters.

Training these models within a reasonable amount of time requires very large computing clusters, whose large communication volume can block computation, resulting in low or inefficient GPU utilization. Communication between the GPUs needs to be carefully managed to avoid becoming a performance bottleneck.

Related content
How SageMaker’s data-parallel and model-parallel engines make training neural networks easier, faster, and cheaper.

Microsoft’s DeepSpeed distributed-training library introduced one such management technique, called the Zero Redundancy Optimizer (ZeRO). ZeRO works by partitioning the state of a machine learning model across distributed workers and fetching the necessary model state from other workers as needed during training. ZeRO has several “stages,” each of which allows the training of larger models through reducing memory requirements, typically at the cost of additional communication volume.

While Microsoft researchers were able to achieve ideal scaling performance with this technique, they reported experiments only on a specialized hypercluster that uses expensive high-speed InfiniBand networking (specifically, an Nvidia DGX system).

To reduce costs for customers in need of high-performance computing, Amazon Web Services (AWS) uses an Elastic Fabric Adapter (EFA) network instead of InfiniBand. The EFA available on instances of AWS’s p4d.24xlarge computational infrastructure has less communication bandwidth than InfiniBand on the Nvidia DGX hypercluster, so we would expect some performance dropoff for bandwidth-intensive tasks. When we tried to reproduce Microsoft’s results, however, we found that the relative dropoff in ZeRO’s third stage was twice that of the dropoff in the second stage.

We profiled the training process to look for bottlenecks and observed that in ZeRO Stage 3, communication dominated training time. We have made a series of optimizations to ZeRO Stage 3 in order to close the performance gap relative to results obtained on InfiniBand-equipped DGX clusters. Below is a table showing the overall performance improvement conferred by our optimizations, measured when training a RoBERTa language model on AWS p4d.24xlarge instances.

ModelNumber of GPUsTFLOPS/GPU
RoBERTa-10B64Optimized: 123 teraflops
Unoptimized: 73 teraflops
RoBERTa-50B64Optimized: 154 teraflops
Unoptimized: 89 teraflops

In January, we merged our optimizations into the DeepSpeed code repository for public use.

Optimizations

Related content
Earlier this year, we reported a speech recognition system trained on a million hours of data, a feat possible through semi-supervised learning, in which training data is annotated by machines rather than by people. These sorts of massive machine learning projects are becoming more common, and they require distributing the training process across multiple processors. Otherwise, training becomes too time consuming.

Our optimizations can roughly be categorized as (1) improving overlap between communication and computation, (2) improving bandwidth utilization, and (3) improving memory efficiency.

Synchronization/Parallelism

Finer-grained synchronization between communication and computation streams

In lower-bandwidth or large clusters where communication times dominate, it is critical to mask communication costs by overlapping computation with communication. Through profiling, we found that this overlapping was limited by ZeRO’s overly coarse synchronization.

This resulted in a suboptimal level of overlapping for two distributed-computing operations: allgather, which aggregates data (in this case, model parameters) from all workers across the network, and reduce-scatter, which reduces data (in this case, summing gradients) across workers. These two operations were causing poor GPU utilization because communication was constantly blocking computation operations. In response, we made significant changes to the parameter gathering and gradient reduce-scatter paths to reduce or remove synchronization while maintaining correctness.

After these changes, we were able to achieve much better overlapping and thus much fewer and smaller computation bubbles.

Precomputation/caching of Python fetching and partitioning decisions 

Related content
“Anytime query” approach adapts to the available resources.

During training, many complex decisions need to be made, relating to which parameters should be fetched, which parameters will be used next, which parameters may be reused soon and should be kept, and which can be released. These operations were slow enough to frequently prevent the Python process from keeping GPUs fed with work, creating large computation bubbles.

We optimized this by precomputing or caching as many decisions as possible, speeding up their computation to the point that it has become a nonfactor for training throughput.

Communication/bandwidth use

Batching allgather/reduce-scatter calls

We found that batching the collective operations — allgather and reduce-scatter — uses bandwidth more efficiently and amortizes the fixed costs of running the computational kernels that execute the operations. To implement collective batching, we flatten tensor data into a single, contiguous buffer to be sent in a single transaction. Each collective requires a special interleaving scheme to ensure that each worker receives the correct data.

Allgather-interleaving.jpeg
Allgather interleaving scheme.
Reduce-scatter interleaving.jpeg
Reduce-scatter interleaving scheme.

Memory

Our implementation of ZeRO, like the Microsoft implementation, uses the Compute Unified Device Architecture (CUDA), Nvidia’s parallel-computing platform. CUDA memory allocations are both synchronous and slow (ignoring the stream-ordered alternatives cudaMallocAsync and cudaMemcpyAsync, which are not yet used in PyTorch), so PyTorch uses a caching allocator to avoid the large costs of constantly reallocating memory. If there are no cached or free blocks for an allocation request, the allocator will flush its cache. This is disastrous for a few reasons:

  • Before the flush can begin, several cudaEventSynchronize calls are necessary to allow computation on held memory to complete. This and the subsequent cudaFree calls can take multiple seconds.
  • Different workers are not guaranteed to flush their caches simultaneously. This means that for any collective, if even a single worker is currently flushing its cache, the other N-1 workers sit blocked waiting for that worker to join. As cluster size increases, so does the probability that at least one worker is flushing its cache for any given collective.
  • After the cache flush, subsequent allocations require cudaMalloc calls, which as mentioned earlier are both synchronous and slow.

For these reasons, memory efficiency is critical for performance.

Memory-efficient batched PyTorch collectives

Although our use of batched collectives significantly reduced kernel launch overhead and improved bandwidth utilization, it also increased memory consumption because of its flattening of batched tensors into an additional buffer.

Related content
New method enables two- to 14-fold speedups over best-performing predecessors.

To avoid redundant flatten operations in PyTorch collectives, we used the *_base variants of the collective operations, which accept pre-flattened tensors, avoiding the need to internally allocate additional flattened buffers. In future work, we plan to use group-based batching operations from the Nvidia Collective Communications Library (NCCL) to eliminate all flattening operations.

More aggressive initialization-time defragmentation of parameter partitions

Even with more than 10GB of free GPU memory, we continued to see evidence of allocator cache flushes, suggesting memory fragmentation. In order to reduce this, we made initialization-time defragmentation changes to move all persisted tensors into a single contiguous buffer.

Miscellaneous

In addition to the optimizations described above, we also

  • optimized gradient normalization by reducing host-device data movement and synchronization and pulling math operations out of a for-loop into a single kernel launch with parallelized computation; and
  • removed tensor operations (.norm()) that were being added to debug messages via string formatting. (These were causing copies from host to device, which meant data movement and host-device synchronization.)

By making DeepSpeed ZeRO Stage 3 performant on widely available public cloud offerings, we hope to further democratize the training of large language models.

Acknowledgments: Zhen Zhang, Stephen Rawls, Yida Wang

Related content

US, WA, Seattle
At Amazon, we strive every day to be Earth’s most customer centric company. Selling Partner Experience Science (SPeXSci) delivers on this by building AI-enhanced experiences, optimization, and automation that help third-party sellers build more successful businesses. This includes recommendations that drive growth and AI-enhanced assistance for troubleshooting issues. There are many challenges that we confront caused by the volume, diversity, and complexity of our selling partner's needs… and we are always striving to do better. Do you want to join an innovative team who creatively applies techniques ranging from statistics and traditional machine learning to deep learning and natural language processing? A team that drives our flywheel of improvement by hunting down opportunities to do better that are buried in tens of millions of solved cases? Are you interested in helping us redefine what world class support can be in an age of automation and AI, while prizing human empathy and ingenuity? The SPeXSci Team is looking for an Applied Scientist to build statistical and machine learning solutions that help us understand and solve our most challenging problems. We need to better understand our Sellers and the problems they face, to design permanent fixes to recurring problems, to anticipate problems so that we are prepared to deal with them, to measure our success at delighting our customers, and to identify opportunities to grow and improve. In this role, you will have ownership of the end-to-end development of solutions to complex problems and you will play an integral role in strategic decision-making. You will also work closely with engineers, operations teams, product owners to build ML pipelines, platforms and solutions that solve problems of defect detection, automation, and workforce optimization. We are open to hiring candidates to work out of one of the following locations: Seattle, WA, USA
US, CA, Palo Alto
We’re working to improve shopping on Amazon using the conversational capabilities of large language models, and are searching for pioneers who are passionate about technology, innovation, and customer experience, and are ready to make a lasting impact on the industry. You'll be working with talented scientists, engineers, and technical program managers (TPM) to innovate on behalf of our customers. If you're fired up about being part of a dynamic, driven team, then this is your moment to join us on this exciting journey! We are open to hiring candidates to work out of one of the following locations: Palo Alto, CA, USA
US, WA, Seattle
As a Senior Data Scientist with expertise in Machine Learning (ML), development and use of multi-model models, utilizing diverse sets of large data you will work with a team of Applied Scientists and Software Engineers to build innovative foundation models for robotic manipulation utilizing computer vision and scene perception technology. Your role will focus first on feature engineering, data collection and data usage from large data sets across Fulfillment Technologies and Robotics (FTR), with an eye on strategy going forward to unify a data strategy across organizations. This position requires high levels of analytical thinking, ability to quickly approach large ambiguous problems and apply analytics, technical and engineering expertise to rapidly analyze, validate, visualize, prototype and deliver solutions. Key job responsibilities - Utilize expertise in feature engineering on massive data sets through exploratory data analysis across existing large data sets in Fulfillment Technologies and Robotics (FTR). Help identify areas where we could create new data sources that would improve training capabilities based on understanding of how different scenes in FCs could impact the trained model and ultimately performance of robotic manipulation. - Identify data requirements, build methodology and data modeling strategy across the diverse data sets for both short-term and long-term needs - Work closely with Applied Scientists in building FM solutions, ensuring that the data strategy fits the experimentation paths, as well as contribute to the FM strategy through identifying opportunities based on the data - Work with and develop large datasets (training/fine tuning) and bring large datasets together to inform both training in FOMO as well as across FTR - Design and implement data solutions, working closely with engineers to guide on best paths for building data pipelines and infrastructure for model training - Collaborate across teams both within and outside of FOMO on data strategy A day in the life Amazon offers a full range of benefits that support you and eligible family members, including domestic partners and their children. Benefits can vary by location, the number of regularly scheduled hours you work, length of employment, and job status such as seasonal or temporary employment. The benefits that generally apply to regular, full-time employees include: 1. Medical, Dental, and Vision Coverage 2. Maternity and Parental Leave Options 3. Paid Time Off (PTO) 4. 401(k) Plan If you are not sure that every qualification on the list above describes you exactly, we'd still love to hear from you! At Amazon, we value people with unique backgrounds, experiences, and skillsets. If you’re passionate about this role and want to make an impact on a global scale, please apply! We are open to hiring candidates to work out of one of the following locations: Seattle, WA, USA
IN, KA, Bangalore
Are you interested in changing the Digital Reading Experience? We are from Kindle Books Team looking for a set of Scientists to take the reading experience in Kindle to next level with a set of innovations! We envision Kindle as the place where readers find the best manifestation of all written content optimized with features that enable them to get the most out of reading, and creators are able to realize their vision to customers quickly and at scale. Every time customers open their content, regardless of surface, they start or restart their reading in a familiar, useful and engaging place. We achieve this by building a strong foundation of core experiences and act as a force multiplier and partner for content creators (directly or indirectly) to easily innovate on top of Kindle's purpose built content experience stack in a simple and extensible way. We will achieve this by providing a best-in-class reading experience, unique content experiences, and remaining agile in meeting the evolving needs and preferences of our users. Our goal is to foster long-lasting reading habits and make us the preferred destination for enriching literary experiences. We are building a In The Book Science team and looking for Scientists, who are passionate about Reading and are willing to take Reading to the next level. Every Book is a complex structure with different entities, layout, format and semantics, with more than 17MM eBooks in our catalog. We are looking for experts in all domains like core NLP, Generative AI, CV and Deep Learning Techniques for unlocking capabilities like analysis, enhancement, curation, moderation, translation, transformation and generation in Books based on Content structure, features, Intent & Synthesis. Scientists will focus on Inside the book content and semantically learn the different entities to enhance the Reading experience overall (Kindle & beyond). They have an opportunity to influence in 2 major phases of life-cycle - Publishing (Creation of Books process) and Reading experience (building engaging features & representation in the book thereby driving reading engagement). Key job responsibilities - 3+ years of building machine learning models for business application experience - PhD, or Master's degree and 2+ years of applied research experience - Knowledge of programming languages such as C/C++, Python, Java or Perl - Experience programming in Java, C++, Python or related language - You have expertise in one of the applied science disciplines, such as machine learning, natural language processing, computer vision, Deep learning - You are able to use reasonable assumptions, data, and customer requirements to solve problems. - You initiate the design, development, execution, and implementation of smaller components with input and guidance from team members. - You work with SDEs to deliver solutions into production to benefit customers or an area of the business. - You assume responsibility for the code in your components. You write secure, stable, testable, maintainable code with minimal defects. - You understand basic data structures, algorithms, model evaluation techniques, performance, and optimality tradeoffs. - You follow engineering and scientific method best practices. You get your designs, models, and code reviewed. You test your code and models thoroughly - You participate in team design, scoping and prioritization discussions. You are able to map a business goal to a scientific problem and map business metrics to technical metrics. - You invent, refine and develop your solutions to ensure they are meeting customer needs and team goals. You keep current with research trends in your area of expertise and scrutinize your results. A day in the life You will be working with a group of talented scientists on researching algorithm and running experiments to test solutions to improve our experience. This will involve collaboration with partner teams including engineering, PMs, data annotators, and other scientists to discuss data quality, model development and productionizing the same. You will mentor other scientists, review and guide their work, help develop roadmaps for the team. We are open to hiring candidates to work out of one of the following locations: Bangalore, IND | Bangalore, KA, IND
JP, 13, Tokyo
日本の大学で機械学習や関連領域の研究に従事している学生の皆様に向けたフェローシッププログラムのご案内です。Amazon JapanのRetail Scienceチームでは、何百万人もの顧客にインパクトを与える価値あるテクノロジーに繋がるような、新しいプロトタイプやコンセプトを開発するプロジェクトに従事していただく学生を募集しています。プログラムは1ヶ月から3ヶ月の短期間のプロジェクトになります。 プロジェクトの対象となるテーマには、自然言語処理、表現学習、レコメンデーションシステム、因果推論といった領域が含まれますが、これらに限定されるわけではありません。プロジェクトは、チームのシニアサイエンティスト1名または複数名のガイダンスのもとで定義、遂行され、プロジェクト中は他のサイエンティストもメンターとしてフォローします。 学生の皆様が新しいモデルを考案したり、新しいテクノロジーを活用し実験する時間を最大化できるようにすることが目標です。そのため、プロジェクトではエンジニアリングやスケーリングよりも、プロトタイピングを行い具体的に概念実証を行うことに集中します。 また、Amazonでは論文出版も推奨しています。従事した研究開発活動の成果物として出版される論文には著者として参加することになります。 フェローシッププログラムは目黒の東京オフィスで、他のチームと一緒に行われます。Amazonは、プログラム期間中に必要なIT機器(ラップトップなど)、給与と通勤費を支給します。 Are you a current PhD student enrolled in a Japanese university researching Statistics, Machine Learning, Economics, or a related discipline? The Japan Retail Science team is looking for Fellows for short term (1-3 months) projects to develop new prototypes and concepts that can then be translated into meaningful technologies impacting millions of customers. In this position, you will be assigned a project to carry out from areas including but not limited to natural language processing, representation learning, recommender systems, or causal inference. The project will be defined and carried out under the supervision of one or more of our senior scientists, and you will be assigned another scientist as a mentor to follow you during the project. Our goal is to maximize the time you spend on inventing new models and experimenting with new techniques, so the work will concentrate on prototyping and creating a tangible proof of concept, rather than engineering and scaling. Amazon encourages publications, and you will be included as an author of any published manuscript. The fellowship will be carried out from our Tokyo office in Meguro together with the rest of the team. Amazon will provide the necessary IT equipment (laptop, etc.) for the duration of the fellowship, a salary, and commuting expenses. A day in the life - チームの多くのメンバーは、午前9時くらいから10時半くらいまでの間に仕事を始め、夕方6時から7時には仕事を終えています。出席が必要なミーティングに参加していれば、勤務時間は自由に決められます。 - パートタイムを希望する場合、勤務時間数は採用担当者とともに決定します。フルタイムの場合、労働時間は通常の契約通り週40時間となります。 - オフィスは目黒にあり、週3回の出社が必要です。残りの2日間はリモートワーク、オフィスへの出勤いずれも可能です。 - The majority of the team starts working between 9 and 10.30am until 18-19. You will have complete flexibility to determine your working hours as long as you are present for the meetings where your attendance is required. - Number of working hours will be determined together with the hiring manager in case you want to pursue the Fellowship part-time. In case of full-time, working hours will be 40/week as per a standard contract. - Our office is located in Meguro, and presence in the office is required 3 times/week. You are free to work remotely for the remaining two days or come to the office if you prefer. About the team 私たちのチームは、日本および世界のすべてのAmazonのベンダー企業に提供されるソリューションを支える製品を発明し、開発しています。私たちは、プロダクトマネージャーやビジネス関係者と協力し、科学的なモデルを開発し、インパクトのあるアプリケーションに繋げることで、Amazonのベンダー企業がより速く成長し、顧客により良いサービスを提供できるようにします。 私たちは、科学者同士のコラボレーションが重要であり、孤立した状態で仕事をしても、幸せなチームにはならないと考えています。私たちは、科学者が専門性を高め、最先端の技術についていけるよう、社内の仕組みを通じて継続的に学ぶことに重きを置いています。私たちの目標は、世界中のAmazonのベンダーソリューションの主要なサイエンスチームとなることです。 Our team invents and develops products powering the solutions offered to all Amazon vendors, in Japan and worldwide. We interact with Product Managers and Business stakeholders to develop rigorous science models that are linked to impactful applications helping Amazon vendors grow faster and better serving their customers. We believe that collaboration between scientists is paramount, and working in isolation does not lead to a happy team. We place strong emphasis on continuous learning through internal mechanisms for our scientists to keep on growing their expertise and keep up with the state of the art. Our goal is to be primary science team for vendor solutions in Amazon, worldwide. We are open to hiring candidates to work out of one of the following locations: Tokyo, 13, JPN
JP, 13, Tokyo
日本の大学で機械学習や関連領域の研究に従事している学生の皆様に向けたフェローシッププログラムのご案内です。Amazon JapanのRetail Scienceチームでは、何百万人もの顧客にインパクトを与える価値あるテクノロジーに繋がるような、新しいプロトタイプやコンセプトを開発するプロジェクトに従事していただく学生を募集しています。プログラムは1ヶ月から3ヶ月の短期間のプロジェクトになります。 プロジェクトの対象となるテーマには、自然言語処理、表現学習、レコメンデーションシステム、因果推論といった領域が含まれますが、これらに限定されるわけではありません。プロジェクトは、チームのシニアサイエンティスト1名または複数名のガイダンスのもとで定義、遂行され、プロジェクト中は他のサイエンティストもメンターとしてフォローします。 学生の皆様が新しいモデルを考案したり、新しいテクノロジーを活用し実験する時間を最大化できるようにすることが目標です。そのため、プロジェクトではエンジニアリングやスケーリングよりも、プロトタイピングを行い具体的に概念実証を行うことに集中します。 また、Amazonでは論文出版も推奨しています。従事した研究開発活動の成果物として出版される論文には著者として参加することになります。 フェローシッププログラムは目黒の東京オフィスで、他のチームと一緒に行われます。Amazonは、プログラム期間中に必要なIT機器(ラップトップなど)、給与と通勤費を支給します。 Are you a current PhD student enrolled in a Japanese university researching Statistics, Machine Learning, Economics, or a related discipline? The Japan Retail Science team is looking for Fellows for short term (1-3 months) projects to develop new prototypes and concepts that can then be translated into meaningful technologies impacting millions of customers. In this position, you will be assigned a project to carry out from areas including but not limited to natural language processing, representation learning, recommender systems, or causal inference. The project will be defined and carried out under the supervision of one or more of our senior scientists, and you will be assigned another scientist as a mentor to follow you during the project. Our goal is to maximize the time you spend on inventing new models and experimenting with new techniques, so the work will concentrate on prototyping and creating a tangible proof of concept, rather than engineering and scaling. Amazon encourages publications, and you will be included as an author of any published manuscript. The fellowship will be carried out from our Tokyo office in Meguro together with the rest of the team. Amazon will provide the necessary IT equipment (laptop, etc.) for the duration of the fellowship, a salary, and commuting expenses. A day in the life - チームの多くのメンバーは、午前9時くらいから10時半くらいまでの間に仕事を始め、夕方6時から7時には仕事を終えています。出席が必要なミーティングに参加していれば、勤務時間は自由に決められます。 - パートタイムを希望する場合、勤務時間数は採用担当者とともに決定します。フルタイムの場合、労働時間は通常の契約通り週40時間となります。 - オフィスは目黒にあり、週3回の出社が必要です。残りの2日間はリモートワーク、オフィスへの出勤いずれも可能です。 - The majority of the team starts working between 9 and 10.30am until 18-19. You will have complete flexibility to determine your working hours as long as you are present for the meetings where your attendance is required. - Number of working hours will be determined together with the hiring manager in case you want to pursue the Fellowship part-time. In case of full-time, working hours will be 40/week as per a standard contract. - Our office is located in Meguro, and presence in the office is required 3 times/week. You are free to work remotely for the remaining two days or come to the office if you prefer. About the team 私たちのチームは、日本および世界のすべてのAmazonのベンダー企業に提供されるソリューションを支える製品を発明し、開発しています。私たちは、プロダクトマネージャーやビジネス関係者と協力し、科学的なモデルを開発し、インパクトのあるアプリケーションに繋げることで、Amazonのベンダー企業がより速く成長し、顧客により良いサービスを提供できるようにします。 私たちは、科学者同士のコラボレーションが重要であり、孤立した状態で仕事をしても、幸せなチームにはならないと考えています。私たちは、科学者が専門性を高め、最先端の技術についていけるよう、社内の仕組みを通じて継続的に学ぶことに重きを置いています。私たちの目標は、世界中のAmazonのベンダーソリューションの主要なサイエンスチームとなることです。 Our team invents and develops products powering the solutions offered to all Amazon vendors, in Japan and worldwide. We interact with Product Managers and Business stakeholders to develop rigorous science models that are linked to impactful applications helping Amazon vendors grow faster and better serving their customers. We believe that collaboration between scientists is paramount, and working in isolation does not lead to a happy team. We place strong emphasis on continuous learning through internal mechanisms for our scientists to keep on growing their expertise and keep up with the state of the art. Our goal is to be primary science team for vendor solutions in Amazon, worldwide. We are open to hiring candidates to work out of one of the following locations: Tokyo, 13, JPN
US, WA, Seattle
Selling Partner Promotions is seeking a Sr. Economist to use econometric and machine learning techniques to help offer Customers high quality deals and promotions. This role will be a key member of a team of scientists supporting the Pricing and Promotions related business. The Sr. Economist will work closely with other research scientists, machine learning experts, and economists to design and run experiments, research new algorithms, and find new ways to improve Seller Pricing and Promotions to optimize the Customer experience. Key job responsibilities - Build economic models to quantify the causal impact of pricing actions and promotions on customers and sellers. - Build models to define, measure and optimize for high quality deals - Define and execute an extensive experimental roadmap to test hypotheses and validate the outputs of models. - Create models that allow an optimization of selling partner ROI and customer long-term value. - Evaluate and validate the proposed models via offline benchmark tests as well as online A/B tests in production. - Publish and present your work at internal and external scientific venues. We are open to hiring candidates to work out of one of the following locations: Seattle, WA, USA
IN, KA, Bengaluru
The Amazon Artificial Generative Intelligence (AGI) team in India is seeking a talented, self-driven Applied Scientist to work on prototyping, optimizing, and deploying ML algorithms within the realm of Generative AI. Key job responsibilities - Research, experiment and build Proof Of Concepts advancing the state of the art in AI & ML for GenAI. - Collaborate with cross-functional teams to architect and execute technically rigorous AI projects. - Thrive in dynamic environments, adapting quickly to evolving technical requirements and deadlines. - Engage in effective technical communication (written & spoken) with coordination across teams. - Conduct thorough documentation of algorithms, methodologies, and findings for transparency and reproducibility. - Publish research papers in internal and external venues of repute - Support on-call activities for critical issues We are open to hiring candidates to work out of one of the following locations: Bengaluru, KA, IND
US, WA, Seattle
We’re working to improve shopping on Amazon using the conversational capabilities of LLMs, and are searching for pioneers who are passionate about technology, innovation, and customer experience, and are ready to make a lasting impact on the industry. You'll be working with talented scientists, engineers, across the breadth of Amazon Shopping and AGI to innovate on behalf of our customers. If you're fired up about being part of a dynamic, driven team, then this is your moment to join us on this exciting journey! We are open to hiring candidates to work out of one of the following locations: Seattle, WA, USA
US, NY, New York
The Identity and Diagnostics Engineering for Advertising (IDEA) team owns two charters: Identity and Diagnostics. The Identity (a.k.a. Identity Hub) program was established to be the single source of truth for all advertiser identity dimensions, offering authoritative and comprehensive views of an advertiser across all channels (ads, retail, partners, and real world). The Diagnostics program is responsible for building comprehensive issue identification capabilities across the ads lifecycle to empower advertisers and Amazon internal teams to identify problems affecting ads, campaigns, and accounts through access to accurate, timely, and easy-to-understand diagnostics. To this end we are currently engaged in the process of architecting building and rolling out key mission critical solutions to Amazon’s Advertising product stack which will make it seamless for millions of Advertisers to access vital information impacting the performance of their digital Ads across the entire foot print of Amazon digital properties. We are currently recruiting experienced Applied Scientists who enjoy solving technical challenges at massive scale, simplifying and eliminating complexity as well as demonstrate a high sense of ownership and the ability to drive solutions from inception to delivery. We have a strong culture of innovation, solving problems in creative ways and our track record of accomplishment’s speaks for itself. You will be joining a high-performing team at the core of the Amazon Ad’s ecosystem. You will be responsible for the entire development lifecycle, from feature ideation to experimentation and development to operational excellence. We will be counting on you to have strong technical skills - you should be able to lead the design, implementation, and successful delivery of solutions for scientifically-complex problems and systems in production, which can be brand new or evolving from existing ones. You will heavily influence the design and write a significant portion of critical-path code. We are open to hiring candidates to work out of one of the following locations: New York, NY, USA