Automated evaluation of RAG pipelines with exam generation

The fight against hallucination in retrieval-augmented-generation models starts with a method for accurately assessing it.

In the swiftly evolving domain of large language models (LLMs), the accurate evaluation of retrieval-augmented-generation (RAG) models is paramount. In this blog, we introduce a pioneering methodology that employs an automated exam generation process, enhanced by item response theory (IRT), to evaluate the factual accuracy of RAG models on specific tasks. Our approach is not only robust and interpretable but also cost efficient, strategically identifying model strengths and refining exams to optimize their evaluative utility. We describe our methodology in a paper we will present in July at the 2024 International Conference on Machine Learning (ICML).

Exam generation process

RAG is a method for handling natural-language queries by retrieving relevant documents and using text from them to seed the response generated by an LLM. The expectation is that factual assertions from reliable documents will curb the LLM’s tendency to “hallucinate”, or generate reasonable-sounding but false sentences.

To evaluate a RAG model on a particular task, we use an LLM to generate multiple-choice questions from a task-specific knowledge corpus. Our method is agnostic to the retriever and generative model used in both the RAG system and the exam generation task.

RAG diagram.png
Summary of the proposed exam generation, evaluation, and iterative-improvement processes.

Our approach has two steps. For each document in the knowledge corpus, we use an LLM and several prompt-engineering strategies to create candidate questions. Then we use several natural-language-processing filters to remove low-quality questions along various axes, such as length, incorrectness, and self-containment.

We note an interesting asymmetry: given a document corpus, it is relatively easy for an LLM to generate a question and the correct answer, as the content of both is contained in the prompt. However, it is considerably more difficult to create high-quality incorrect answers, commonly referred to as discriminators.

To filter out degenerate questions, we use the Jaccard similarity coefficient and embedding-based similarity metrics.

Here is the prompt that we used for exam generation:

Human: Here is some documentation from {task_domain}: {documentation}.\n
From this generate a difficult multi-form question for an exam.
It should have 4 candidates, 1 correct answer, and explanations.

Syntax should be Question: {question}\n
A){candidate A}\n
B){candidate B}\n
C){candidate C}\n
D){candidate D}

Correct Answer: {correct answer}\n
### Assistant:"

In our research, we analyzed several RAG pipeline variants, including closed-book (no knowledge from the document corpus is provided to the LLM), oracle (the exam taker has access to the specific document used to generate the question-and-answer pair, in addition to the question itself and all possible candidate answers), and classical retrieval models such as MultiQA embeddings, Siamese network embeddings, and BM25. Our evaluations also extended to different scales of language models, from 7 billion parameters to 70 billion, to understand the impact of model scale on performance.

To demonstrate the practical utility of this methodology, we deployed it across a wide range of domains. These include Amazon Web Services (AWS) DevOps, where troubleshooting guides for cloud-based services tests the models' operational effectiveness; arXiv abstracts, which challenge the models' ability to parse and generate insights from dense scientific texts; StackExchange questions, which probe the models' responsiveness and accuracy; and SEC filings, where the complexity of financial reporting tests the models’ capacity to extract nuanced information from structured corporate documents. This multi-domain approach not only enhances the robustness of our evaluations but also ensures that our models are versatile and reliable across various real-world applications.

Evaluating the exam generation model

The following figure shows granular results of our evaluation method for the task of AWS DevOps troubleshooting. We report accuracy for different retrieval approaches and retriever sizes, on a percentage scale. Labels on the diameter show the AWS resources we’re using. Colors correspond to different retrieval approaches (Oracle, DPRV2, MultiQA, ClosedBook), and solid and broken lines correspond to different base LLM sizes (7B, 13B, and 70B). For instance, we observe that a small model such as Mistral-7B with MultiQA embeddings has an accuracy of around 80% for the AWS resource Relational Database Service (RDS).

Granular results of our exam evaluation for the task of AWS DevOps troubleshooting.png
A comparison of several different models, at a range of sizes, on the task of DevOps troubleshooting for eight different AWS resources.

Our experiments yielded four key findings. First, there’s no one-size-fits-all solution; the optimal choice of retrieval method, and to a lesser extent LLM, is typically task dependent. For example, in tasks such as SEC filings and arXiv abstracts, BM25 outperforms MultiQA and Siamese network embeddings, indicating that sparse retrieval is generally more effective than dense retrieval. This could be because such tasks often contain easily identifiable terms (e.g., AWS service names in AWS DevOps) that can be retrieved with keyword search, while other tasks, such as StackExchange, mostly contain common words.

Second, the right choice of retrieval method can lead to greater performance improvements than simply using larger LLMs. For instance, in SEC filings, we observed a greater performance gain from switching from Siamese network embeddings to DPRV2 than from switching to larger LLMs.

Third, for tasks involving closed-source knowledge, the accuracy bottleneck is typically the LLM rather than the retrieval method. Finally, a poorly aligned retriever component can result in worse accuracy than having no retrieval at all.

Exam enhancements through item response theory

Integrating item response theory (IRT) into our process has significantly improved the quality of the exams. IRT models the likelihood of a correct response based on characteristics of a question and the capabilities of a model. It uses three factors — difficulty, discrimination, and guessing chance — to create exams that more accurately reflect and predict model performance.

IRT posits that a model’s probability of correctly answering a question is correlated with a latent variable known as ability, and it provides a method for estimating the value of that variable. As such, it offers a way to quantify a model’s ability level.

Our process begins with an initial exam assessment, identifying and removing questions that contribute minimally to discriminative insights. The exam is then refined iteratively, based on updated IRT parameters, which helps it accurately gauge nuanced model behaviors.

By continuously analyzing and adjusting exams based on IRT parameters, we have seen substantial improvements in the exams’ ability to discriminate among models. For instance, we use Fisher information to quantify the informativeness of exam questions. Fisher information measures the amount of information that an observable random variable provides about an unknown parameter, offering a way to gauge the precision of statistical estimators in parameter estimation theory.

During iterative improvements for the arXiv task, the Fisher information function consistently showed progress, marking a considerable enhancement of the exams' capacity to differentiate model capabilities. This iterative process ensures that each new version of the exam is more informative than the last and effectively evaluates the RAG model’s abilities.

Evaluating the generated exams

To further enhance the assessment of RAG models, we categorize exam questions using both semantic analysis and Bloom’s revised taxonomy, devised by the University of Chicago psychologist Benjamin Bloom. Bloom’s taxonomy helps classify questions by cognitive complexity — from basic recall to analytical tasks — enabling structured evaluation of model capabilities.

Different levels in Bloom's taxonomy differentiate between the knowledge dimension (factual, conceptual, procedural, and meta-cognitive) and the cognitive-process dimension (remember, understand, apply, analyze, evaluate, and create). Additionally, we classify questions semantically by identifying keywords like “what” and “which.” These additional classifications allow us to assess how well models perform at different ability levels.

Bloom's Taxonomy.png
Average Fisher information for each category in Bloom’s taxonomy category (left) and semantic category (right) for the StackExchange task.

The above two figures present the average Fisher information value for each Bloom category (left) and semantic category (right) for the StackExchange task. For this specific task, we observe that “evaluating” and “understanding” are the most discriminate dimensions in Bloom’s taxonomy across different ability levels, while “remembering” is the least discriminatory.

On the semantic categories, we observe that “what” and “which” were the most discriminatory terms for lower ability levels, and “when” discriminated more at higher ability levels. One interpretation is that “what” and “how” questions tend to be more factual and syntax-based in the StackExchange domain, so at lower ability levels, RAG struggles more with these genres of questions.

The following figure illustrates the maximization process for the arXiv task as the exam and IRT estimation evolve. We show the results for three incremental steps. We observe a 0.05 increase in Fisher information even with a single iteration. This progress reaches a 0.1 increase in the subsequent steps.

Exam Information Curve.png
The maximization process, as the exam and IRT estimation evolve, for the task of generating abstracts for arXiv papers.

To expand our approach beyond Q&A applications, our future research will focus on domains such as summarization, translation, and sentiment analysis. We are also addressing the complex task of meta-evaluation, comparing and refining our evaluation methods to account for the multidimensional nature of LLM performance. Additionally, we will continuously update our methodologies to accommodate the rapid evolution of LLM technology, ensuring robust and comprehensive assessment of emerging models.

Acknowledgments: Laurent Callot

Research areas

Related content

US, CA, San Francisco
If you are interested in this position, please apply on Twitch's Career site About Us: Twitch is the world’s biggest live streaming service, with global communities built around gaming, entertainment, music, sports, cooking, and more. It is where thousands of communities come together for whatever, every day. We’re about community, inside and out. You’ll find coworkers who are eager to team up, collaborate, and smash (or elegantly solve) problems together. We’re on a quest to empower live communities, so if this sounds good to you, see what we’re up to on LinkedIn and Twitter, and discover the projects we’re solving on our Blog. Be sure to explore our Interviewing Guide to learn how to ace our interview process. About the Role: We are looking for an Applied Scientist to solve challenging and open-ended problems in the domain of recommendations, search, ranking and information retrieval. As an Applied Scientist on Twitch's Community team, you will use ML to help viewers find streamers and communities they’ll love. You will collaborate with a team of passionate scientists and engineers to develop these models and put them into production, where they can help Twitch's creators and viewers succeed and build communities. You will report to the Applied Science Manager on the Community Discovery Team. This position is located in San Francisco, CA. You Will: - Develop and Productionize ML algorithms for recommendations, ranking and search problems that can improve discovery on Twitch. - Collaborate with our Product and Engineering teams to work backwards from customer discovery problems, to determine the ML solution (algorithm and pipeline) to have the biggest impact on our user base in the real world. - Participate in the scientific community at Twitch, Amazon, and the broader ML and risk community. Perks - Medical, Dental, Vision & Disability Insurance - 401(k) - Maternity & Parental Leave - Flexible PTO - Amazon Employee Discount
US, WA, Bellevue
We are building a world-class last mile delivery ecosystem with Amazon Flex as a cornerstone of this strategy. Amazon Flex works directly with independent contractors, to make deliveries to our customers. With Amazon Flex, delivery partners are their own boss, build their own schedule, and choose from different types of delivery opportunities (e.g. Amazon Fresh, Whole Foods Market, and Amazon Logistics). Amazon Flex is powered by a mobile app that works in sync with our advanced systems and processes, allowing delivery partners to secure delivery offers, track their delivery progress, and more. Economists at Amazon Flex partner closely with senior management, business stakeholders, scientists and engineers, and economist leadership to solve key business problems including pricing, promotions, offer optimization, recruiting, capacity planning, and beyond. Amazon Flex Economists build econometric models using our world class data systems and apply approaches from a variety of skillsets – applied macro/time series, applied micro, econometric theory, empirical IO, empirical labor, or related fields are all highly valued skillsets at Amazon. You will work in a fast moving environment to solve business problems as a member of a cross-functional team that supports all of Amazon Last Mile Delivery Tech. You will be expected to develop techniques that apply econometrics to large data sets, address quantitative problems, and contribute to the design of automated systems across the business.
US, GA, Atlanta
Machine learning (ML) has been strategic to Amazon from the early years. We are pioneers in areas such as recommendation engines, product search, eCommerce fraud detection, and large-scale optimization of fulfillment center operations. The Generative AI team helps AWS customers accelerate the use of Generative AI to solve business and operational challenges and promote innovation in their organization. As an applied scientist, you are proficient in designing and developing advanced ML models to solve diverse challenges and opportunities. You will be working with terabytes of text, images, and other types of data to solve real- world problems. You'll design and run experiments, research new algorithms, and find new ways of optimizing risk, profitability, and customer experience. We’re looking for talented scientists capable of applying ML algorithms and cutting-edge deep learning (DL) and reinforcement learning approaches to areas such as drug discovery, customer segmentation, fraud prevention, capacity planning, predictive maintenance, pricing optimization, call center analytics, player pose estimation, event detection, and virtual assistant among others. AWS Sales, Marketing, and Global Services (SMGS) is responsible for driving revenue, adoption, and growth from the largest and fastest growing small- and mid-market accounts to enterprise-level customers including public sector. The AWS Global Support team interacts with leading companies and believes that world-class support is critical to customer success. AWS Support also partners with a global list of customers that are building mission-critical applications on top of AWS services. Key job responsibilities The primary responsibilities of this role are to: Design, develop, and evaluate innovative ML models to solve diverse challenges and opportunities across industries Interact with customer directly to understand their business problems, and help them with defining and implementing scalable Generative AI solutions to solve them Work closely with account teams, research scientist teams, and product engineering teams to drive model implementations and new solution A day in the life N/A About the team Diverse Experiences AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness. Mentorship & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why flexible work hours and arrangements are part of our culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud.
US, WA, Bellevue
We are a part of Amazon Alexa Devices organization with the mission “delight customers through contextual and personalized proactive experiences that keep customers informed, engaged, and productive without cognitive burden”. We are developing an advanced system using Large Language Model (LLM) technologies to deliver engaging, intuitive, and adaptive content recommendations across all Amazon surfaces. We aim to facilitate seamless reasoning and customer experiences, surpassing the capabilities of previous machine learning models. We are looking for a passionate, talented, and resourceful Applied Scientist in the field of Natural Language Processing (NLP), Recommender Systems and/or Information Retrieval, to invent and build scalable solutions for a state-of-the-art context-aware speech assistant. A successful candidate will have strong machine learning background and a desire to push the envelope in one or more of the above areas. The ideal candidate would also enjoy operating in dynamic environments, be self-motivated to take on challenging problems to deliver big customer impact, shipping solutions via rapid experimentation and then iterating on user feedback and interactions. Key job responsibilities As an Applied Scientist on the team, you will collaborate with other applied scientists and engineers to develop novel algorithms to enable timely, relevant and delightful recommendations and conversations. Your work will directly impact our customers in the form of products and services that make use of various machine learning, deep learning and language model technologies. You will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in the state of art.
US, WA, Bellevue
The Fulfillment by Amazon (FBA) team is looking for a passionate, curious, and creative Research Scientist, with expertise and experience in operations research, operations management, supply chains, and revenue management, to join our top-notch cross-domain FBA science team. As a research scientist you will be responsible for designing and implementing cutting edge optimization models and machine learning models and building automated inventory management system to solve key challenges facing the worldwide FBA Seller business, including 1) improving FBA Seller inventory efficiency, 2) efficiently balancing the supply and demand of FBA Seller capacity, 3) closing worldwide selection gap by enabling global selling profitability, and 4) driving out costs across the FBA supply chain to spin the flywheel. Unlike many companies who buy existing off-the-shelf planning systems, we are responsible for studying, designing, and building systems to suit Amazon’s needs. Our team members have an opportunity to be on the forefront of thought leadership by working on some of the most difficult problems in the industry with some of the best product managers, research scientists/statisticians/economists and software developers in the business. This role will work with other senior and principal scientists, and partner with engineering and product teams to integrate scientific work into production systems. Key job responsibilities • Interact with engineering, operations, science and business teams to develop an understanding and domain knowledge of processes, system structures, and business requirements • Apply domain knowledge and business judgment to identify opportunities and quantify the impact aligning research direction to business requirements and make the right judgment on research project prioritization • Develop scalable mathematical models to derive optimal or near-optimal solutions to existing and new inventory planning challenges • Create prototypes and simulations to test devised solutions • Advocate technical solutions to business stakeholders, engineering teams, as well as executive level decision makers • Work closely with engineers to integrate prototypes into production systems • Create policy evaluation methods to track the actual performance of devised solutions in production systems, identify areas with potential for improvement and work with internal teams to improve the solution with new features A day in the life As a Research Scientist, you will solve real world large inventory problems by analyzing large amounts of business data, defining new metrics and business cases, designing simulations and experiments, applying supply chain modeling techniques, creating optimization models, and collaborating with teammates in business, software, and research. The successful candidate has solid research experience in Operations Research preferably with focus on Operations Management or other closely related areas or in area of Machine Learning. He or she will lead the research where we are responsible for developing solutions to better manage and optimize worldwide FBA inventory capacity, while providing the best experience to our Sellers to growth their business. About the team Fulfillment by Amazon (FBA) is a service that allows sellers to outsource order fulfillment to Amazon, allowing sellers to leverage Amazon’s world-class facilities to provide customers Prime delivery promise. Sellers gain access to Prime members worldwide, see their sales lift, and are free to focus their time and resources on what they do best while Amazon manages fulfillment. Over the last several years, sellers have enjoyed strong business growth with FBA shipping more than half of all products offered by Amazon. FBA focuses on helping sellers with automating and optimizing the third-party supply chain. FBA sellers leverage Amazon’s expertise in machine learning, optimization, data analytics, econometrics, and market design to deliver the best inventory management experience to sellers. We work full-stack, from foundational backend systems to future-forward user interfaces. Our culture is centered on rapid prototyping, rigorous experimentation, and data-driven decision-making.
US, WA, Seattle
Are you passionate about delighting hundreds of millions of customers and building the best search experience to help customers make well-informed purchase decisions on Amazon? Are you passionate about building the next generation product shopping and search experience? The Search and Discover experience on Amazon is central to every customer’s shopping mission and purchasing journey. Amazon Search is looking for a self-driven, customer obsessed, and seasoned research scientist to drive the overall search customer insights efforts and measure customer perceptions for Amazon Search. If you are passionate about using user research & customer insights to influence the future direction of Amazon Search and building a small but top notch user research science team, this is a job for you. In this highly visible role, you will work across cross-functional teams and collaborate with partners to drive user research planning, align research goals to the product roadmap, and own user research execution and final deliverables to make sure that we are always positioned to exceed customer expectations. You will present the search customer insights to various stakeholders including senior executives. Key job responsibilities * Design and conduct significantly complex research studies that impact long-term product strategy and the future of customer experience. * Build customer perception measurements for Amazon search experience and develop the methods to correlate customer perception with search experience improvements. * Define search customer insights research strategy, own the research roadmap and prioritize research opportunities across different areas. * Identify customer segments and latent customer needs, define and improve methodologies, data collection, analysis/synthesis, and identify opportunities to improve customer experience. * Manage multiple customer insights research project execution, prioritization, and ensure research projects timely delivery at the highest quality levels. * Adapt and/or create new customer insights research methodologies and workflows to support product goals at scale and work effectively with agencies and vendors. * Work cross functionally and collaborate with technical product managers, technical program managers, UX designers, science, and engineering teams to proactively plan research and align research goals to the product roadmap. * Work with data analysts/data scientists to correlate qualitative research with quantitative data analysis, and interpret complicated data across quantitive and behavioral analysis. * Own customer insights research results and prioritization communication with all stakeholders including senior executives. * Build, manage, and grow a small team of research scientists. About the team Our team operate in a friendly, fast-paced, and diverse and inclusive work environment. We are driven by the excitement of inventing products, building technologies, and providing services that change lives. We embrace new ways of doing things, make decisions quickly, and are not afraid to fail. We have the scope, benefits, and support of a large company and the spirit and heart of a small startup. At Amazon, our mission is to be Earth’s most customer-centric company. Our actions, goals, projects, programs, and inventions begin and end with the customer top of mind.
US, WA, Seattle
Amazon Advertising is one of Amazon's fastest growing and most profitable businesses. Amazon's advertising portfolio helps merchants, retail vendors, and brand owners succeed via native advertising, which grows incremental sales of their products sold through Amazon. The primary goals are to help shoppers discover new products they love, be the most efficient way for advertisers to meet their business objectives, and build a sustainable business that continuously innovates on behalf of customers. Our products and solutions are strategically important to enable our Retail and Marketplace businesses to drive long-term growth. We deliver billions of ad impressions and millions of clicks and break fresh ground in product and technical innovations every day! The Creative X org within Amazon Advertising aims to democratize access to high-quality creative assets, including copy, images and video, by building and productizing generative AI-driven tools for advertisers. We are investing in latent-diffusion and DiT models, LLMs, computer vision, reinforcement learning, and image + video synthesis. The solutions we develop will be deployed for use by self-service advertisers and agencies, as well as available to premium brands that advertise on Amazon. We are seeking an experienced science leader who is adept at a variety of skills; especially in generative AI, computer vision, and large language models that will accelerate our plans to generate high-quality creatives on behalf of advertisers. The right candidate will be an inventor at heart, provide science leadership, establish the right direction and vision, build team mechanisms, foster the spirit of collaboration and innovation within the org, and execute against a roadmap. The leader will provide both technical direction as well as manage a sizable team of scientists. They will need to be adept at recruiting, launching AI models into production, writing vision/direction documents, and building team mechanisms that will foster innovation and execution. Key job responsibilities * Drive end-to-end applied science projects that have a high degree of ambiguity, scale, complexity * Provide technical / science leadership related to computer vision, large language models, and generative image + video. * Research new and innovative machine learning approaches. * Recruit high performing Applied Scientists to the team and provide mentorship. * Establish team mechanisms, including team building, planning, and document reviews.
CA, BC, Vancouver
Technology is giving the beauty industry a makeover! Are you interested to disrupt and redefine the way customers buy Beauty products online? Are you interested in using the latest advances in machine learning, computer vision, and big-data technologies to build online customer experiences for Beauty products that can equal or even surpass an in-store experience? Amazon Beauty is reinventing the shopping experience for all beauty customers across the largest selection of brands to become the most trusted beauty destination. Beauty is unique in retail with a diverse customer set along with products that are emotional, fun, and creative. This is your chance to get in on the ground floor to build something entirely new and transform an industry! To achieve our vision, we think big and tackle technological challenges every day. We need builders and disruptors who are not afraid to innovate! Our architecture and development processes support rapid experimentation, global deployments, and self-service capabilities that allow us to scale better. We build: - Amazon scale systems: All our technology needs to work at Amazon scale, serving millions of customers with millisecond-level latency. - Immersive customer experiences: We will create elevated and immersive customer experiences that using cutting-edge UI-technologies and user-centric design patterns. - Computer Vision and augmented reality (AR) experiences: We bring exciting experiences directly to the customer's mobile phone using their cameras and combinations of computer vision and AR. - Personalization using machine learning: We use latest advances in ML and GenAI to provide better-personalized shopping experiences. - Data & analytics pipelines: Amazon is data-driven, and a robust data backbone is necessary for our systems. We build on core AWS services such as EC2, S3, DynamoDB, SageMaker, StepFunctions, etc. - Multi-device support: We build for all traditional surfaces - desktop browsers, mobile browsers, and mobile applications. Key job responsibilities We are looking for talented and innovation-driven scientists who are passionate about leveraging the latest advances in Generative AI, Diffusion Models, Computer Vision (CV), Graphics, AR/VR, Virtual Try-On, Image Processing, and related technologies, to solve customer problems in the Beauty space. You will have an opportunity to revolutionize the customer shopping experience across the world's most extensive catalog of beauty products. You will be directly responsible for leading the ideation, design, prototyping, development, and launch of innovative scientific solutions that address customer problem in the beauty and shopping space. You will closely partner with product managers, UX designers, engineers, and the broader Amazon scientific community to pioneer state-of-the-art solutions to extremely challenging problems in machine learning and CV. You will be our organization's Tech Evangelist and represent our organization in key internal and external AI, ML, or Vision conferences. About the team Amazon Beauty Tech is a key and essential part of the Consumables organization and North America Stores. We are a passionate group of engineers, scientists, product managers, and designers who drive technological innovation to improve the customer shopping experience. We have a startup-like work culture where innovation is encouraged; we are never afraid to propose big ideas for fear of failing!
US, CA, Sunnyvale
The Artificial General Intelligence (AGI) team is looking for a highly skilled and experienced Senior Applied Scientist, to lead the development and implementation of cutting-edge algorithms and models for supervised fine-tuning and reinforcement learning through human feedback; with a focus across text, image, and video modalities. As a Senior Applied Scientist, you will play a critical role in driving the development of Generative Artificial Intelligence (GenAI) technologies that can handle Amazon-scale use cases and have a significant impact on our customers' experiences. Key job responsibilities - Collaborate with cross-functional teams of engineers, product managers, and scientists to identify and solve complex problems in GenAI - Design and execute experiments to evaluate the performance of different algorithms and models, and iterate quickly to improve results - Think big about the arc of development of GenAI over a multi-year horizon, and identify new opportunities to apply these technologies to solve real-world problems - Communicate results and insights to both technical and non-technical audiences, including through presentations and written reports - Mentor and guide junior scientists and engineers, and contribute to the overall growth and development of the team
US, CA, Santa Clara
Amazon AI is looking for world class scientists to join its Amazon Q Builder CodeGen team. Amazon Q Builder CodeGen is an LLM-based AWS service that makes developers more productive by providing them code recommendations. Amazon Q Builder CodeGen leverages large language models, program analysis, responsible AI, robustness, efficient inference techniques and a lot more in building this technology. You will invent, implement, and deploy state of the art algorithms and systems, and be at the heart of a growing and exciting focus area for AWS. Candidate experiences of interest include but are not limited to: LLM, RAG, model training and inference, trustworthy AI, responsible AI, program analysis and program synthesis in general. The Amazon Web Services (AWS) Next Gen DevX (NGDE) team uses generative AI and foundation models to reimagine the experience of all builders on AWS. From the IDE to web-based tools and services, AI will help engineers work on large and small applications. We explore new technologies and find creative solutions. Curiosity and an explorative mindset can find a place here to impact the life of engineers around the world. If you are excited about this space and want to enlighten your peers with new capabilities, this is the team for you. About the team AWS Utility Computing (UC) provides product innovations — from foundational services such as Amazon’s Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2), to consistently released new product innovations that continue to set AWS’s services and features apart in the industry. As a member of the UC organization, you’ll support the development and management of Compute, Database, Storage, Internet of Things (Iot), Platform, and Productivity Apps services in AWS, including support for customers who require specialized security solutions for their cloud services. Diverse Experiences AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness. Mentorship & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud. Hybrid Work We value innovation and recognize this sometimes requires uninterrupted time to focus on a build. We also value in-person collaboration and time spent face-to-face. Our team affords employees options to work in the office every day or in a flexible, hybrid work model near one of our U.S. Amazon offices.