Automated evaluation of RAG pipelines with exam generation

The fight against hallucination in retrieval-augmented-generation models starts with a method for accurately assessing it.

In the swiftly evolving domain of large language models (LLMs), the accurate evaluation of retrieval-augmented-generation (RAG) models is paramount. In this blog, we introduce a pioneering methodology that employs an automated exam generation process, enhanced by item response theory (IRT), to evaluate the factual accuracy of RAG models on specific tasks. Our approach is not only robust and interpretable but also cost efficient, strategically identifying model strengths and refining exams to optimize their evaluative utility. We describe our methodology in a paper we will present in July at the 2024 International Conference on Machine Learning (ICML).

Exam generation process

RAG is a method for handling natural-language queries by retrieving relevant documents and using text from them to seed the response generated by an LLM. The expectation is that factual assertions from reliable documents will curb the LLM’s tendency to “hallucinate”, or generate reasonable-sounding but false sentences.

To evaluate a RAG model on a particular task, we use an LLM to generate multiple-choice questions from a task-specific knowledge corpus. Our method is agnostic to the retriever and generative model used in both the RAG system and the exam generation task.

RAG diagram.png
Summary of the proposed exam generation, evaluation, and iterative-improvement processes.

Our approach has two steps. For each document in the knowledge corpus, we use an LLM and several prompt-engineering strategies to create candidate questions. Then we use several natural-language-processing filters to remove low-quality questions along various axes, such as length, incorrectness, and self-containment.

We note an interesting asymmetry: given a document corpus, it is relatively easy for an LLM to generate a question and the correct answer, as the content of both is contained in the prompt. However, it is considerably more difficult to create high-quality incorrect answers, commonly referred to as discriminators.

To filter out degenerate questions, we use the Jaccard similarity coefficient and embedding-based similarity metrics.

Here is the prompt that we used for exam generation:

Human: Here is some documentation from {task_domain}: {documentation}.\n
From this generate a difficult multi-form question for an exam.
It should have 4 candidates, 1 correct answer, and explanations.

Syntax should be Question: {question}\n
A){candidate A}\n
B){candidate B}\n
C){candidate C}\n
D){candidate D}

Correct Answer: {correct answer}\n
### Assistant:"

In our research, we analyzed several RAG pipeline variants, including closed-book (no knowledge from the document corpus is provided to the LLM), oracle (the exam taker has access to the specific document used to generate the question-and-answer pair, in addition to the question itself and all possible candidate answers), and classical retrieval models such as MultiQA embeddings, Siamese network embeddings, and BM25. Our evaluations also extended to different scales of language models, from 7 billion parameters to 70 billion, to understand the impact of model scale on performance.

To demonstrate the practical utility of this methodology, we deployed it across a wide range of domains. These include Amazon Web Services (AWS) DevOps, where troubleshooting guides for cloud-based services tests the models' operational effectiveness; arXiv abstracts, which challenge the models' ability to parse and generate insights from dense scientific texts; StackExchange questions, which probe the models' responsiveness and accuracy; and SEC filings, where the complexity of financial reporting tests the models’ capacity to extract nuanced information from structured corporate documents. This multi-domain approach not only enhances the robustness of our evaluations but also ensures that our models are versatile and reliable across various real-world applications.

Evaluating the exam generation model

The following figure shows granular results of our evaluation method for the task of AWS DevOps troubleshooting. We report accuracy for different retrieval approaches and retriever sizes, on a percentage scale. Labels on the diameter show the AWS resources we’re using. Colors correspond to different retrieval approaches (Oracle, DPRV2, MultiQA, ClosedBook), and solid and broken lines correspond to different base LLM sizes (7B, 13B, and 70B). For instance, we observe that a small model such as Mistral-7B with MultiQA embeddings has an accuracy of around 80% for the AWS resource Relational Database Service (RDS).

Granular results of our exam evaluation for the task of AWS DevOps troubleshooting.png
A comparison of several different models, at a range of sizes, on the task of DevOps troubleshooting for eight different AWS resources.

Our experiments yielded four key findings. First, there’s no one-size-fits-all solution; the optimal choice of retrieval method, and to a lesser extent LLM, is typically task dependent. For example, in tasks such as SEC filings and arXiv abstracts, BM25 outperforms MultiQA and Siamese network embeddings, indicating that sparse retrieval is generally more effective than dense retrieval. This could be because such tasks often contain easily identifiable terms (e.g., AWS service names in AWS DevOps) that can be retrieved with keyword search, while other tasks, such as StackExchange, mostly contain common words.

Second, the right choice of retrieval method can lead to greater performance improvements than simply using larger LLMs. For instance, in SEC filings, we observed a greater performance gain from switching from Siamese network embeddings to DPRV2 than from switching to larger LLMs.

Third, for tasks involving closed-source knowledge, the accuracy bottleneck is typically the LLM rather than the retrieval method. Finally, a poorly aligned retriever component can result in worse accuracy than having no retrieval at all.

Exam enhancements through item response theory

Integrating item response theory (IRT) into our process has significantly improved the quality of the exams. IRT models the likelihood of a correct response based on characteristics of a question and the capabilities of a model. It uses three factors — difficulty, discrimination, and guessing chance — to create exams that more accurately reflect and predict model performance.

IRT posits that a model’s probability of correctly answering a question is correlated with a latent variable known as ability, and it provides a method for estimating the value of that variable. As such, it offers a way to quantify a model’s ability level.

Our process begins with an initial exam assessment, identifying and removing questions that contribute minimally to discriminative insights. The exam is then refined iteratively, based on updated IRT parameters, which helps it accurately gauge nuanced model behaviors.

By continuously analyzing and adjusting exams based on IRT parameters, we have seen substantial improvements in the exams’ ability to discriminate among models. For instance, we use Fisher information to quantify the informativeness of exam questions. Fisher information measures the amount of information that an observable random variable provides about an unknown parameter, offering a way to gauge the precision of statistical estimators in parameter estimation theory.

During iterative improvements for the arXiv task, the Fisher information function consistently showed progress, marking a considerable enhancement of the exams' capacity to differentiate model capabilities. This iterative process ensures that each new version of the exam is more informative than the last and effectively evaluates the RAG model’s abilities.

Evaluating the generated exams

To further enhance the assessment of RAG models, we categorize exam questions using both semantic analysis and Bloom’s revised taxonomy, devised by the University of Chicago psychologist Benjamin Bloom. Bloom’s taxonomy helps classify questions by cognitive complexity — from basic recall to analytical tasks — enabling structured evaluation of model capabilities.

Different levels in Bloom's taxonomy differentiate between the knowledge dimension (factual, conceptual, procedural, and meta-cognitive) and the cognitive-process dimension (remember, understand, apply, analyze, evaluate, and create). Additionally, we classify questions semantically by identifying keywords like “what” and “which.” These additional classifications allow us to assess how well models perform at different ability levels.

Bloom's Taxonomy.png
Average Fisher information for each category in Bloom’s taxonomy category (left) and semantic category (right) for the StackExchange task.

The above two figures present the average Fisher information value for each Bloom category (left) and semantic category (right) for the StackExchange task. For this specific task, we observe that “evaluating” and “understanding” are the most discriminate dimensions in Bloom’s taxonomy across different ability levels, while “remembering” is the least discriminatory.

On the semantic categories, we observe that “what” and “which” were the most discriminatory terms for lower ability levels, and “when” discriminated more at higher ability levels. One interpretation is that “what” and “how” questions tend to be more factual and syntax-based in the StackExchange domain, so at lower ability levels, RAG struggles more with these genres of questions.

The following figure illustrates the maximization process for the arXiv task as the exam and IRT estimation evolve. We show the results for three incremental steps. We observe a 0.05 increase in Fisher information even with a single iteration. This progress reaches a 0.1 increase in the subsequent steps.

Exam Information Curve.png
The maximization process, as the exam and IRT estimation evolve, for the task of generating abstracts for arXiv papers.

To expand our approach beyond Q&A applications, our future research will focus on domains such as summarization, translation, and sentiment analysis. We are also addressing the complex task of meta-evaluation, comparing and refining our evaluation methods to account for the multidimensional nature of LLM performance. Additionally, we will continuously update our methodologies to accommodate the rapid evolution of LLM technology, ensuring robust and comprehensive assessment of emerging models.

Acknowledgments: Laurent Callot

Research areas

Related content

US, WA, Redmond
Project Kuiper is an initiative to launch a constellation of Low Earth Orbit satellites that will provide low-latency, high-speed broadband connectivity to unserved and under-served communities around the world. We are looking for an accomplished Applied Scientist who will deliver science applications such as anomaly detection, advanced calibration methods, space engineering simulations, and performance analytics -- to name a few. Key job responsibilities • Translate ambiguous problems into well defined mathematical problems • Prototype, test, and implement state-of-the-art algorithms for antenna pointing calibration, anomaly detection, predictive failure models, and ground terminal performance evaluation • Provide actionable recommendations for system design/definition by defining, running, and summarizing physically-accurate simulations of ground terminal functionality • Collaborate closely with engineers to deploy performant, scalable, and maintainable applications in the cloud Export Control Requirement: Due to applicable export control laws and regulations, candidates must be a U.S. citizen or national, U.S. permanent resident (i.e., current Green Card holder), or lawfully admitted into the U.S. as a refugee or granted asylum. A day in the life In this role as an Applied Scientist, you will design, implement, optimize, and operate systems critical to the uptime and performance of Kuiper ground terminals. Your contributions will have a direct impact on customers around the world. About the team This role will be part of the Ground Software & Analytics team, part of Ground Systems Engineering. Our team is responsible for: • Design, development, deployment, and support of a Tier-1 Monitoring and Remediation System (MARS) needed to maintain high availability of hundreds of ground terminals deployed around the world • Ground systems integration/test (I&T) automation • Ground terminal configuration, provisioning, and acceptance automation • Systems analysis • Algorithm development (pointing/tracking/calibration/monitoring) • Software interface definition for supplier-provided hardware and development of software test automation
US, VA, Arlington
Are you looking to work at the forefront of Machine Learning and AI? Would you be excited to apply Generative AI algorithms to solve real world problems with significant impact? The Generative AI Innovation Center helps AWS customers implement Generative AI solutions and realize transformational business opportunities. This is a team of strategists, scientists, engineers, and architects working step-by-step with customers to build bespoke solutions that harness the power of generative AI. Starting in 2024, the Innovation Center launched a new Custom Model and Optimization program to help customers develop and scale highly customized generative AI solutions. The team helps customers imagine and scope bespoke use cases that will create the greatest value for their businesses, define paths to navigate technical or business challenges, develop and optimize models to power their solutions, and make plans for launching solutions at scale. The GenAI Innovation Center team provides guidance on best practices for applying generative AI responsibly and cost efficiently. You will work directly with customers and innovate in a fast-paced organization that contributes to game-changing projects and technologies. You will design and run experiments, research new algorithms, and find new ways of optimizing risk, profitability, and customer experience. We’re looking for Applied Scientists capable of using GenAI and other techniques to design, evangelize, and implement state-of-the-art solutions for never-before-solved problems. Key job responsibilities • Collaborate with AI/ML scientists and architects to research, design, develop, and evaluate generative AI solutions to address real-world challenges • Interact with customers directly to understand their business problems, aid them in implementation of generative AI solutions, brief customers and guide them on adoption patterns and paths to production • Help customers optimize their solutions through approaches such as model selection, training or tuning, right-sizing, distillation, and hardware optimization • Provide customer and market feedback to product and engineering teams to help define product direction
US, CA, Culver City
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video technologist, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! We are forming a new organization within Prime Video to redefine our operational landscape through the power of artificial intelligence. As a Applied Scientist within this initiative, you will be a technical leader helping to design and build the intelligent systems that power our vision. You will tackle complex and ambiguous problems, designing and delivering scalable and resilient agentic AI and ML solutions from the ground up. You will not only write high-quality, maintainable software and models, but also mentor other scientists, influence our technical strategy, and drive engineering best practices across the team. Your work will directly contribute to making Prime Video's operations more efficient and will set the technical foundation for years to come. Key job responsibilities • Lead the design and architecture of highly scalable, available, and resilient services for our AI automation platform. • Write high-quality, maintainable, and robust code to solve complex business problems, building flexible systems without over-engineering. • Act as a technical leader and mentor for other engineers on the team, assisting with career growth and encouraging excellence. • Work through ambiguous requirements, cut through complexity, and translate business needs into scalable technical solutions. • Take ownership of the full software development lifecycle, including design, testing, deployment, and operations. • Work closely with product managers, scientists, and other engineers to build and launch new features and systems. About the team This role offers a unique opportunity to shape the future of one of Amazon's most exciting businesses through the application of AI technologies. If you're passionate about leveraging AI to drive real-world impact at massive scale, we want to hear from you.
US, CA, Sunnyvale
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video technologist, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! We are looking for a self-motivated, passionate and resourceful Applied Scientist to bring diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. You will spend your time as a hands-on machine learning practitioner and a research leader. You will play a key role on the team, building and guiding machine learning models from the ground up. At the end of the day, you will have the reward of seeing your contributions benefit millions of Amazon.com customers worldwide. Key job responsibilities - Develop AI solutions for various Prime Video Search systems using Deep learning, GenAI, Reinforcement Learning, and optimization methods; - Work closely with engineers and product managers to design, implement and launch AI solutions end-to-end; - Design and conduct offline and online (A/B) experiments to evaluate proposed solutions based on in-depth data analyses; - Effectively communicate technical and non-technical ideas with teammates and stakeholders; - Stay up-to-date with advancements and the latest modeling techniques in the field; - Publish your research findings in top conferences and journals. About the team Prime Video Search Science team owns science solution to power search experience on various devices, from sourcing, relevance, ranking, to name a few. We work closely with the engineering teams to launch our solutions in production.
US, WA, Seattle
The People eXperience and Technology Central Science (PXTCS) team uses economics, behavioral science, statistics, and machine learning to proactively identify mechanisms and process improvements which simultaneously improve Amazon and the lives, wellbeing, and the value of work to Amazonians. PXTCS is an interdisciplinary team that combines the talents of science and engineering to develop and deliver solutions that measurably achieve this goal. PXTCS is looking for an economist who can apply economic methods to address business problems. The ideal candidate will work with engineers and computer scientists to estimate models and algorithms on large scale data, design pilots and measure impact, and transform successful prototypes into improved policies and programs at scale. PXTCS is looking for creative thinkers who can combine a strong technical economic toolbox with a desire to learn from other disciplines, and who know how to execute and deliver on big ideas as part of an interdisciplinary technical team. Ideal candidates will work in a team setting with individuals from diverse disciplines and backgrounds. They will work with teammates to develop scientific models and conduct the data analysis, modeling, and experimentation that is necessary for estimating and validating models. They will work closely with engineering teams to develop scalable data resources to support rapid insights, and take successful models and findings into production as new products and services. They will be customer-centric and will communicate scientific approaches and findings to business leaders, listening to and incorporate their feedback, and delivering successful scientific solutions. A day in the life The Economist will work with teammates to apply economic methods to business problems. This might include identifying the appropriate research questions, writing code to implement a DID analysis or estimate a structural model, or writing and presenting a document with findings to business leaders. Our economists also collaborate with partner teams throughout the process, from understanding their challenges, to developing a research agenda that will address those challenges, to help them implement solutions. About the team PXTCS is a multidisciplinary science team that develops innovative solutions to make Amazon Earth's Best Employer
US, NY, New York
The Ads Measurement Science team in the Measurement, Ad Tech, and Data Science (MADS) team of Amazon Ads serves a centralized role developing solutions for a multitude of performance measurement products. We create solutions which measure the comprehensive impact of advertiser's ad spend, including sales impacts both online and offline and across timescales, and provide actionable insights that enable our advertisers to optimize their media portfolios. We also own the science solutions for AI tools that unlock new insights and automate high-effort customer workflows, such as custom query and report generation based on natural language user requests. We leverage a host of scientific technologies to accomplish this mission, including Generative AI, classical ML, Causal Inference, Natural Language Processing, and Computer Vision. As an Applied Scientist on the team, you will lead measurement solutions end-to-end from inception to production. You will propose, design, analyze, and productionize models to provide novel measurement insights to our customers. Key job responsibilities - Leverage deep expertise in one or more scientific disciplines to invent solutions to ambiguous ads measurement problems - Disambiguate problems to propose clear evaluation frameworks and success criteria - Work autonomously and write high quality technical documents - Implement a significant portion of critical-path code, and partner with engineers to directly carry solutions into production - Partner closely with other scientists to deliver large, multi-faceted technical projects - Share and publish works with the broader scientific community through meetings and conferences - Communicate clearly to both technical and non-technical audiences - Contribute new ideas that shape the direction of the team's work - Mentor more junior scientists and participate in the hiring process About the team We are a team of scientists across Applied, Research, Data Science and Economist disciplines. You will work with colleagues with deep expertise in ML, NLP, CV, Gen AI, and Causal Inference with a diverse range of backgrounds. We partner closely with top-notch engineers, product managers, sales leaders, and other scientists with expertise in the ads industry and on building scalable modeling and software solutions.
US, WA, Bellevue
Are you inspired by invention? Do you like the idea of seeing how your work impacts the bigger picture? Answer yes to any of these and you’ll fit right in here at Amazon Last Mile Simulations and Analytics Engineering team. WW AMZL Simulations and Analytics Engineering team is looking to build out our Simulation team to drive innovation across our Last Mile network. We start with the customer and work backwards in everything we do. If you’re interested in joining a rapidly growing team working to build a unique, solutions advisory group with a relentless focus on the customer, you’ve come to the right place. This is a blue-sky role that gives you a chance to roll up your sleeves and dive into big data sets in order to build discrete event 3D simulations using tools like Flexsim, Anylogic, Emulate 3D etc and experimentation systems at scale, build optimization algorithms and leverage cutting-edge technologies across Amazon. This is an opportunity to think big about how to solve a challenging problem for the customers. As a Simulation Scientist, you are expected to deep dive into complex problems and drive relentlessly towards innovative solutions working with cross functional teams. Be comfortable interfacing and influencing various functional teams and individuals at all levels of the organization in order to be successful. Lead strategic modelling and simulation projects related to drive process design decisions. Your expertise in synthesizing and communicating insights and recommendations to audiences of varying levels of technical sophistication will enable you to answer specific business questions and innovate for the future. You will apply cutting edge designs and methodologies for complex use cases across Last Mile network to drive innovation. In addition, you will contribute to the end state vision for simulation and experimentation of future delivery stations at Amazon. Key job responsibilities Key job responsibilities • Lead the design, implementation, and delivery of the simulation data science solutions to perform system of systems discrete event simulations for significantly complex operational processes that have a long-term impact on a product, business, or function using FlexSim, Demo 3D, AnyLogic or any other Discrete Event Simulation (DES) software packages • Lead strategic modeling and simulation research projects to drive process design decisions • Be an exemplary practitioner in simulation science discipline to establish best practices and simplify problems to develop discrete event simulations faster with higher standards • Identify and tackle intrinsically hard process flow simulation problems (e.g., highly complex, ambiguous, undefined, with less existing structure, or having significant business risk or potential for significant impact • Deliver artifacts that set the standard in the organization for excellence, from process flow control algorithm design to validation to implementations to technical documents using simulations • Be a pragmatic problem solver by applying judgment and simulation experience to balance cross-organization trade-offs between competing interests and effectively influence, negotiate, and communicate with internal and external business partners, contractors and vendors for multiple simulation projects • Provide simulation data and measurements that influence the business strategy of an organization. Write effective white papers and artifacts while documenting your approach, simulation outcomes, recommendations, and arguments • Lead and actively participate in reviews of simulation research science solutions. You bring clarity to complexity, probe assumptions, illuminate pitfalls, and foster shared understanding within simulation data science discipline • Pay a significant role in the career development of others, actively mentoring and educating the larger simulation data science community on trends, technologies, and best practices • Use advanced statistical /simulation tools and develop codes (python or another object oriented language) for data analysis , simulation, and developing modeling algorithms • Lead and coordinate simulation efforts between internal teams and outside vendors to develop optimal solutions for the network, including equipment specification, material flow control logic, process design, and site layout • Deliver results according to project schedules and quality A day in the life If you are not sure that every qualification on the list above describes you exactly, we'd still love to hear from you! At Amazon, we value people with unique backgrounds, experiences, and skillsets. If you’re passionate about this role and want to make an impact on a global scale, please apply!
IN, KA, Bengaluru
Do you want to join an innovative team of scientists who use machine learning and statistical techniques to create state-of-the-art solutions for providing better value to Amazon’s customers? Do you want to build and deploy advanced ML systems that help optimize millions of transactions every day? Are you excited by the prospect of analyzing and modeling terabytes of data to solve real-world problems? Do you like to own end-to-end business problems/metrics and directly impact the profitability of the company? Do you like to innovate and simplify? If yes, then you may be a great fit to join the Machine Learning team for India Consumer Businesses. Machine Learning, Big Data and related quantitative sciences have been strategic to Amazon from the early years. Amazon has been a pioneer in areas such as recommendation engines, ecommerce fraud detection and large-scale optimization of fulfillment center operations. As Amazon has rapidly grown and diversified, the opportunity for applying machine learning has exploded. We have a very broad collection of practical problems where machine learning systems can dramatically improve the customer experience, reduce cost, and drive speed and automation. These include product bundle recommendations for millions of products, safeguarding financial transactions across by building the risk models, improving catalog quality via extracting product attribute values from structured/unstructured data for millions of products, enhancing address quality by powering customer suggestions We are developing state-of-the-art machine learning solutions to accelerate the Amazon India growth story. Amazon India is an exciting place to be at for a machine learning practitioner. We have the eagerness of a fresh startup to absorb machine learning solutions, and the scale of a mature firm to help support their development at the same time. As part of the India Machine Learning team, you will get to work alongside brilliant minds motivated to solve real-world machine learning problems that make a difference to millions of our customers. We encourage thought leadership and blue ocean thinking in ML. Key job responsibilities Use machine learning and analytical techniques to create scalable solutions for business problems Analyze and extract relevant information from large amounts of Amazon’s historical business data to help automate and optimize key processes Design, develop, evaluate and deploy, innovative and highly scalable ML models Work closely with software engineering teams to drive real-time model implementations Work closely with business partners to identify problems and propose machine learning solutions Establish scalable, efficient, automated processes for large scale data analyses, model development, model validation and model maintenance Work proactively with engineering teams and product managers to evangelize new algorithms and drive the implementation of large-scale complex ML models in production Leading projects and mentoring other scientists, engineers in the use of ML techniques About the team International Machine Learning Team is responsible for building novel ML solutions that attack India first (and other Emerging Markets across MENA and LatAm) problems and impact the bottom-line and top-line of India business. Learn more about our team from https://www.amazon.science/working-at-amazon/how-rajeev-rastogis-machine-learning-team-in-india-develops-innovations-for-customers-worldwide
US, WA, Bellevue
We are seeking a passionate, talented, and inventive individual to join the Applied AI team and help build industry-leading technologies that customers will love. This team offers a unique opportunity to make a significant impact on the customer experience and contribute to the design, architecture, and implementation of a highly innovative product. The mission of the Applied AI team is to enable organizations within Worldwide Amazon.com Stores to accelerate the adoption of AI technologies across various parts of our business. We are looking for a Senior Applied Science manager to join our Applied AI team and lead a cross-functional team of scientists and engineers who work on LLM-based solutions. On our team you will push the boundaries of ML and Generative AI techniques to scale the inputs for hundreds of billions of dollars of annual revenue for our eCommerce business. If you have a passion for AI technologies, a drive to innovate and a desire to make a meaningful impact, we invite you to become a valued member of our team. You will be responsible for leading a cross functional team of scientists and engineer and developing and maintaining the systems and tools that enable us to accelerate knowledge operations and work in the intersection of Science and Engineering. You will push the boundaries of ML and Generative AI techniques to scale the inputs for hundreds of billions of dollars of annual revenue for our eCommerce business. If you have a passion for AI technologies, a drive to innovate and a desire to make a meaningful impact, we invite you to become a valued member of our team. We are seeking an experienced Senior Applied Science Manager who combines superb technical, research, analytical and leadership capabilities with a demonstrated ability to get the right things done quickly and effectively. This person must be comfortable working with a team of top-notch developers and collaborating with our research teams. We’re looking for someone who innovates, and loves solving hard problems. You will be expected to have an established background in leading teams that build highly scalable systems and system design, have excellent project management skills, great communication skills, and a motivation to achieve results in a fast-paced environment. You should be somebody who enjoys working on complex problems, is customer-centric, and feels strongly about building good software as well as making that software achieve its operational goals. You will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in artificial intelligence. Your work will directly impact our customers in the form of novel products and services.
US, WA, Seattle
The Sponsored Products and Brands (SPB) team at Amazon Ads is re-imagining the advertising landscape through generative AI technologies, revolutionizing how millions of customers discover products and engage with brands across Amazon.com and beyond. We are at the forefront of re-inventing advertising experiences, bridging human creativity with artificial intelligence to transform every aspect of the advertising lifecycle from ad creation and optimization to performance analysis and customer insights. We are a passionate group of innovators dedicated to developing responsible and intelligent AI technologies that balance the needs of advertisers, enhance the shopping experience, and strengthen the marketplace. If you're energized by solving complex challenges and pushing the boundaries of what's possible with AI, join us in shaping the future of advertising. This position will be part of the Conversational Ad Experiences team within the Amazon Advertising organization. Our cross-functional team focuses on designing, developing and launching innovative ad experiences delivered to shoppers in conversational contexts. We utilize leading-edge engineering and science technologies in generative AI to help shoppers discover new products and brands through intuitive, conversational, multi-turn interfaces. We also empower advertisers to reach shoppers, using their own voice to explain and demonstrate how their products meet shoppers' needs. We collaborate with various teams across multiple Amazon organizations to push the boundary of what's possible in these fields. We are seeking a science leader for our team within the Sponsored Products & Brands organization. You'll be working with talented scientists, engineers, and product managers to innovate on behalf of our customers. An ideal candidate is able to navigate through ambiguous requirements, working with various partner teams, and has experience in generative AI, large language models (LLMs), information retrieval, and ads recommendation systems. Using a combination of generative AI and online experimentation, our scientists develop insights and optimizations that enable the monetization of Amazon properties while enhancing the experience of hundreds of millions of Amazon shoppers worldwide. If you're fired up about being part of a dynamic, driven team, then this is your moment to join us on this exciting journey! Key job responsibilities - Serve as a tech lead for defining the science roadmap for multiple projects in the conversational ad experiences space powered by LLMs. - Build POCs, optimize and deploy models into production, run experiments, perform deep dives on experiment data to gather actionable learnings and communicate them to senior leadership - Work closely with software engineers on detailed requirements, technical designs and implementation of end-to-end solutions in production. - Work closely with product managers to contribute to our mission, and proactively identify opportunities where science can help improve customer experience - Research new machine learning approaches to drive continued scientific innovation - Be a member of the Amazon-wide machine learning community, participating in internal and external meetups, hackathons and conferences - Help attract and recruit technical talent, mentor scientists and engineers in the team