Automated evaluation of RAG pipelines with exam generation

The fight against hallucination in retrieval-augmented-generation models starts with a method for accurately assessing it.

In the swiftly evolving domain of large language models (LLMs), the accurate evaluation of retrieval-augmented-generation (RAG) models is paramount. In this blog, we introduce a pioneering methodology that employs an automated exam generation process, enhanced by item response theory (IRT), to evaluate the factual accuracy of RAG models on specific tasks. Our approach is not only robust and interpretable but also cost efficient, strategically identifying model strengths and refining exams to optimize their evaluative utility. We describe our methodology in a paper we will present in July at the 2024 International Conference on Machine Learning (ICML).

Exam generation process

RAG is a method for handling natural-language queries by retrieving relevant documents and using text from them to seed the response generated by an LLM. The expectation is that factual assertions from reliable documents will curb the LLM’s tendency to “hallucinate”, or generate reasonable-sounding but false sentences.

To evaluate a RAG model on a particular task, we use an LLM to generate multiple-choice questions from a task-specific knowledge corpus. Our method is agnostic to the retriever and generative model used in both the RAG system and the exam generation task.

RAG diagram.png
Summary of the proposed exam generation, evaluation, and iterative-improvement processes.

Our approach has two steps. For each document in the knowledge corpus, we use an LLM and several prompt-engineering strategies to create candidate questions. Then we use several natural-language-processing filters to remove low-quality questions along various axes, such as length, incorrectness, and self-containment.

We note an interesting asymmetry: given a document corpus, it is relatively easy for an LLM to generate a question and the correct answer, as the content of both is contained in the prompt. However, it is considerably more difficult to create high-quality incorrect answers, commonly referred to as discriminators.

To filter out degenerate questions, we use the Jaccard similarity coefficient and embedding-based similarity metrics.

Here is the prompt that we used for exam generation:

Human: Here is some documentation from {task_domain}: {documentation}.\n
From this generate a difficult multi-form question for an exam.
It should have 4 candidates, 1 correct answer, and explanations.

Syntax should be Question: {question}\n
A){candidate A}\n
B){candidate B}\n
C){candidate C}\n
D){candidate D}

Correct Answer: {correct answer}\n
### Assistant:"

In our research, we analyzed several RAG pipeline variants, including closed-book (no knowledge from the document corpus is provided to the LLM), oracle (the exam taker has access to the specific document used to generate the question-and-answer pair, in addition to the question itself and all possible candidate answers), and classical retrieval models such as MultiQA embeddings, Siamese network embeddings, and BM25. Our evaluations also extended to different scales of language models, from 7 billion parameters to 70 billion, to understand the impact of model scale on performance.

To demonstrate the practical utility of this methodology, we deployed it across a wide range of domains. These include Amazon Web Services (AWS) DevOps, where troubleshooting guides for cloud-based services tests the models' operational effectiveness; arXiv abstracts, which challenge the models' ability to parse and generate insights from dense scientific texts; StackExchange questions, which probe the models' responsiveness and accuracy; and SEC filings, where the complexity of financial reporting tests the models’ capacity to extract nuanced information from structured corporate documents. This multi-domain approach not only enhances the robustness of our evaluations but also ensures that our models are versatile and reliable across various real-world applications.

Evaluating the exam generation model

The following figure shows granular results of our evaluation method for the task of AWS DevOps troubleshooting. We report accuracy for different retrieval approaches and retriever sizes, on a percentage scale. Labels on the diameter show the AWS resources we’re using. Colors correspond to different retrieval approaches (Oracle, DPRV2, MultiQA, ClosedBook), and solid and broken lines correspond to different base LLM sizes (7B, 13B, and 70B). For instance, we observe that a small model such as Mistral-7B with MultiQA embeddings has an accuracy of around 80% for the AWS resource Relational Database Service (RDS).

Granular results of our exam evaluation for the task of AWS DevOps troubleshooting.png
A comparison of several different models, at a range of sizes, on the task of DevOps troubleshooting for eight different AWS resources.

Our experiments yielded four key findings. First, there’s no one-size-fits-all solution; the optimal choice of retrieval method, and to a lesser extent LLM, is typically task dependent. For example, in tasks such as SEC filings and arXiv abstracts, BM25 outperforms MultiQA and Siamese network embeddings, indicating that sparse retrieval is generally more effective than dense retrieval. This could be because such tasks often contain easily identifiable terms (e.g., AWS service names in AWS DevOps) that can be retrieved with keyword search, while other tasks, such as StackExchange, mostly contain common words.

Second, the right choice of retrieval method can lead to greater performance improvements than simply using larger LLMs. For instance, in SEC filings, we observed a greater performance gain from switching from Siamese network embeddings to DPRV2 than from switching to larger LLMs.

Third, for tasks involving closed-source knowledge, the accuracy bottleneck is typically the LLM rather than the retrieval method. Finally, a poorly aligned retriever component can result in worse accuracy than having no retrieval at all.

Exam enhancements through item response theory

Integrating item response theory (IRT) into our process has significantly improved the quality of the exams. IRT models the likelihood of a correct response based on characteristics of a question and the capabilities of a model. It uses three factors — difficulty, discrimination, and guessing chance — to create exams that more accurately reflect and predict model performance.

IRT posits that a model’s probability of correctly answering a question is correlated with a latent variable known as ability, and it provides a method for estimating the value of that variable. As such, it offers a way to quantify a model’s ability level.

Our process begins with an initial exam assessment, identifying and removing questions that contribute minimally to discriminative insights. The exam is then refined iteratively, based on updated IRT parameters, which helps it accurately gauge nuanced model behaviors.

By continuously analyzing and adjusting exams based on IRT parameters, we have seen substantial improvements in the exams’ ability to discriminate among models. For instance, we use Fisher information to quantify the informativeness of exam questions. Fisher information measures the amount of information that an observable random variable provides about an unknown parameter, offering a way to gauge the precision of statistical estimators in parameter estimation theory.

During iterative improvements for the arXiv task, the Fisher information function consistently showed progress, marking a considerable enhancement of the exams' capacity to differentiate model capabilities. This iterative process ensures that each new version of the exam is more informative than the last and effectively evaluates the RAG model’s abilities.

Evaluating the generated exams

To further enhance the assessment of RAG models, we categorize exam questions using both semantic analysis and Bloom’s revised taxonomy, devised by the University of Chicago psychologist Benjamin Bloom. Bloom’s taxonomy helps classify questions by cognitive complexity — from basic recall to analytical tasks — enabling structured evaluation of model capabilities.

Different levels in Bloom's taxonomy differentiate between the knowledge dimension (factual, conceptual, procedural, and meta-cognitive) and the cognitive-process dimension (remember, understand, apply, analyze, evaluate, and create). Additionally, we classify questions semantically by identifying keywords like “what” and “which.” These additional classifications allow us to assess how well models perform at different ability levels.

Bloom's Taxonomy.png
Average Fisher information for each category in Bloom’s taxonomy category (left) and semantic category (right) for the StackExchange task.

The above two figures present the average Fisher information value for each Bloom category (left) and semantic category (right) for the StackExchange task. For this specific task, we observe that “evaluating” and “understanding” are the most discriminate dimensions in Bloom’s taxonomy across different ability levels, while “remembering” is the least discriminatory.

On the semantic categories, we observe that “what” and “which” were the most discriminatory terms for lower ability levels, and “when” discriminated more at higher ability levels. One interpretation is that “what” and “how” questions tend to be more factual and syntax-based in the StackExchange domain, so at lower ability levels, RAG struggles more with these genres of questions.

The following figure illustrates the maximization process for the arXiv task as the exam and IRT estimation evolve. We show the results for three incremental steps. We observe a 0.05 increase in Fisher information even with a single iteration. This progress reaches a 0.1 increase in the subsequent steps.

Exam Information Curve.png
The maximization process, as the exam and IRT estimation evolve, for the task of generating abstracts for arXiv papers.

To expand our approach beyond Q&A applications, our future research will focus on domains such as summarization, translation, and sentiment analysis. We are also addressing the complex task of meta-evaluation, comparing and refining our evaluation methods to account for the multidimensional nature of LLM performance. Additionally, we will continuously update our methodologies to accommodate the rapid evolution of LLM technology, ensuring robust and comprehensive assessment of emerging models.

Acknowledgments: Laurent Callot

Research areas

Related content

US, CA, Santa Clara
Amazon is looking for a passionate, talented, and inventive Applied Scientist with a strong machine learning background to help build industry-leading language technology. Our mission is to provide a delightful experience to Amazon’s customers by pushing the envelope in Natural Language Processing (NLP), Generative AI, Large Language Model (LLM), Natural Language Understanding (NLU), Machine Learning (ML), Retrieval-Augmented Generation, Responsible AI, Agent, Evaluation, and Model Adaptation. As part of our AI team in Amazon AWS, you will work alongside internationally recognized experts to develop novel algorithms and modeling techniques to advance the state-of-the-art in human language technology. Your work will directly impact millions of our customers in the form of products and services, as well as contributing to the wider research community. You will gain hands on experience with Amazon’s heterogeneous text and structured data sources, and large-scale computing resources to accelerate advances in language understanding. The Science team at AWS Bedrock builds science foundations of Bedrock, which is a fully managed service that makes high-performing foundation models available for use through a unified API. We are adamant about continuously learning state-of-the-art NLP/ML/LLM technology and exploring creative ways to delight our customers. In our daily job we are exposed to large scale NLP needs and we apply rigorous research methods to respond to them with efficient and scalable innovative solutions. At AWS Bedrock, you’ll experience the benefits of working in a dynamic, entrepreneurial environment, while leveraging AWS resources, one of the world’s leading cloud companies and you’ll be able to publish your work in top tier conferences and journals. We are building a brand new team to help develop a new NLP service for AWS. You will have the opportunity to conduct novel research and influence the science roadmap and direction of the team. Come join this greenfield opportunity! About the team AWS Bedrock Science Team is a part of AWS Utility Computing AWS Utility Computing (UC) provides product innovations — from foundational services such as Amazon’s Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2), to consistently released new product innovations that continue to set AWS’s services and features apart in the industry. As a member of the UC organization, you’ll support the development and management of Compute, Database, Storage, Internet of Things (Iot), Platform, and Productivity Apps services in AWS, including support for customers who require specialized security solutions for their cloud services. About the team Diverse Experiences AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness. Mentorship & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud.
GB, MLN, Edinburgh
We’re looking for a Machine Learning Scientist in the Personalization team for our Edinburgh office experienced in generative AI and large models. You will be responsible for developing and disseminating customer-facing personalized recommendation models. This is a hands-on role with global impact working with a team of world-class engineers and scientists across the Edinburgh offices and wider organization. You will lead the design of machine learning models that scale to very large quantities of data, and serve high-scale low-latency recommendations to all customers worldwide. You will embody scientific rigor, designing and executing experiments to demonstrate the technical efficacy and business value of your methods. You will work alongside a science team to delight customers by aiding in recommendations relevancy, and raise the profile of Amazon as a global leader in machine learning and personalization. Successful candidates will have strong technical ability, focus on customers by applying a customer-first approach, excellent teamwork and communication skills, and a motivation to achieve results in a fast-paced environment. Our position offers exceptional opportunities for every candidate to grow their technical and non-technical skills. If you are selected, you have the opportunity to make a difference to our business by designing and building state of the art machine learning systems on big data, leveraging Amazon’s vast computing resources (AWS), working on exciting and challenging projects, and delivering meaningful results to customers world-wide. Key job responsibilities Develop machine learning algorithms for high-scale recommendations problems. Rapidly design, prototype and test many possible hypotheses in a high-ambiguity environment, making use of both quantitative analysis and business judgement. Collaborate with software engineers to integrate successful experimental results into large-scale, highly complex Amazon production systems capable of handling 100,000s of transactions per second at low latency. Report results in a manner which is both statistically rigorous and compellingly relevant, exemplifying good scientific practice in a business environment.
US, MA, Westborough
Amazon is looking for talented Postdoctoral Scientists to join our Fulfillment Technology and Robotics team for a one-year, full-time research position. The Innovation Lab in BOS27 is a physical space in which new ideas can be explored, hands-on. The Lab provides easier access to tools and equipment our inventors need while also incubating critical technologies necessary for future robotic products. The Lab is intended to not only develop new technologies that can be used in future Fulfillment, Technology, and Robotics products but additionally promote deeper technical collaboration with universities from around the world. The Lab’s research efforts are focused on highly autonomous systems inclusive of robotic manipulation of packages and ASINs, multi-robot systems utilizing vertical space, Amazon integrated gantries, advancements in perception, and collaborative robotics. These five areas of research represent an impactful set of technical capabilities that when realized at a world class level will unlock our desire for a highly automated and adaptable fulfillment supply chain. As a Postdoctoral Scientist you will be developing a coordinated multi-agent system to achieve optimized trajectories under realistic constraints. The project will explore the utility of state-of-the-art methods to solve multi-agent, multi-objective optimization problems with stochastic time and location constraints. The project is motivated by a new technology being developed in the Innovation Lab to introduce efficiencies in the last-mile delivery systems. Key job responsibilities In this role you will: * Work closely with a senior science advisor, collaborate with other scientists and engineers, and be part of Amazon’s diverse global science community. * Publish your innovation in top-tier academic venues and hone your presentation skills. * Be inspired by challenges and opportunities to invent new techniques in your area(s) of expertise.
US, WA, Seattle
Here at Amazon, we embrace our differences. We are committed to furthering our culture of diversity and inclusion of our teams within the organization. We’re working on the future. If you are seeking an iterative fast-paced environment where you can drive innovation, apply state-of-the-art technologies to solve extreme-scale real world delivery challenges, and provide visible benefit to end-users, this is your opportunity. Come work on the Amazon Prime Air team We're looking for outstanding scientists and engineers who combine superb technical, research and analytical capabilities with a demonstrated ability architect complex hardware, software, embedded, mobile and mission-critical systems to ensure they can be found compliant to DO-178C. This person must be comfortable working with a team of top-notch software, hardware and applied science Engineers. We’re looking for people who innovate and love solving hard problems. You will work hard, have fun, and of course, make history! Export License Control This position may require a deemed export control license for compliance with applicable laws and regulations. Placement is contingent on Amazon’s ability to apply for and obtain an export control license on your behalf. Key job responsibilities The manager of the High Fidelity Modeling group will lead a group of engineers and scientists that provide computational fluid dynamics modeling, as well as aerodynamic and other surrogate models used in flight simulation of the Prime Air drones.
US, WA, Seattle
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video technologist, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! In Prime Video READI, our mission is to automate infrastructure scaling and operational readiness. We are growing a team specialized in time series modeling, forecasting, and release safety. This team will invent and develop algorithms for forecasting multi-dimensional related time series. The team will develop forecasts on key business dimensions with optimization recommendations related to performance and efficiency opportunities across our global software environment. As a founding member of the core team, you will apply your deep coding, modeling and statistical knowledge to concrete problems that have broad cross-organizational, global, and technology impact. Your work will focus on retrieving, cleansing and preparing large scale datasets, training and evaluating models and deploying them to production where we continuously monitor and evaluate. You will work on large engineering efforts that solve significantly complex problems facing global customers. You will be trusted to operate with complete independence and are often assigned to focus on areas where the business and/or architectural strategy has not yet been defined. You must be equally comfortable digging in to business requirements as you are drilling into design with development teams and developing production ready learning models. You consistently bring strong, data-driven business and technical judgment to decisions. You will work with internal and external stakeholders, cross-functional partners, and end-users around the world at all levels. Our team makes a big impact because nothing is more important to us than delivering for our customers, continually earning their trust, and thinking long term. You are empowered to bring new technologies to your solutions. If you crave a sense of ownership, this is the place to be.
US, WA, Seattle
Amazon Advertising operates at the intersection of eCommerce and advertising, and is investing heavily in building a world-class advertising business. We are defining and delivering a collection of self-service performance advertising products that drive discovery and sales. Our products are strategically important to our Retail and Marketplace businesses driving long-term growth. We deliver billions of ad impressions and millions of clicks daily and are breaking fresh ground to create world-class products to improve both shopper and advertiser experience. With a broad mandate to experiment and innovate, we grow at an unprecedented rate with a seemingly endless range of new opportunities. The Ad Response Prediction team in Sponsored Products organization build advanced deep-learning models, large-scale machine-learning pipelines, and real-time serving infra to match shoppers’ intent to relevant ads on all devices, for all contexts and in all marketplaces. Through precise estimation of shoppers’ interaction with ads and their long-term value, we aim to drive optimal ads allocation and pricing, and help to deliver a relevant, engaging and delightful ads experience to Amazon shoppers. As the business and the complexity of various new initiatives we take continues to grow, we are looking for talented Applied Scientists to join the team. Key job responsibilities As a Applied Scientist II, you will: * Conduct hands-on data analysis, build large-scale machine-learning models and pipelines * Work closely with software engineers on detailed requirements, technical designs and implementation of end-to-end solutions in production * Run regular A/B experiments, gather data, perform statistical analysis, and communicate the impact to senior management * Establish scalable, efficient, automated processes for large-scale data analysis, machine-learning model development, model validation and serving * Provide technical leadership, research new machine learning approaches to drive continued scientific innovation * Be a member of the Amazon-wide Machine Learning Community, participating in internal and external MeetUps, Hackathons and Conferences
IN, HR, Gurugram
Do you want to join an innovative team of scientists who use machine learning and statistical techniques to create state-of-the-art solutions for providing better value to Amazon’s customers? Do you want to build and deploy advanced ML systems that help optimize millions of transactions every day? Are you excited by the prospect of analyzing and modeling terabytes of data to solve real-world problems? Do you like to own end-to-end business problems/metrics and directly impact the profitability of the company? Do you like to innovate and simplify? If yes, then you may be a great fit to join the Machine Learning team for India Consumer Businesses. Machine Learning, Big Data and related quantitative sciences have been strategic to Amazon from the early years. Amazon has been a pioneer in areas such as recommendation engines, ecommerce fraud detection and large-scale optimization of fulfillment center operations. As Amazon has rapidly grown and diversified, the opportunity for applying machine learning has exploded. We have a very broad collection of practical problems where machine learning systems can dramatically improve the customer experience, reduce cost, and drive speed and automation. These include product bundle recommendations for millions of products, safeguarding financial transactions across by building the risk models, improving catalog quality via extracting product attribute values from structured/unstructured data for millions of products, enhancing address quality by powering customer suggestions We are developing state-of-the-art machine learning solutions to accelerate the Amazon India growth story. Amazon India is an exciting place to be at for a machine learning practitioner. We have the eagerness of a fresh startup to absorb machine learning solutions, and the scale of a mature firm to help support their development at the same time. As part of the India Machine Learning team, you will get to work alongside brilliant minds motivated to solve real-world machine learning problems that make a difference to millions of our customers. We encourage thought leadership and blue ocean thinking in ML. Key job responsibilities Use machine learning and analytical techniques to create scalable solutions for business problems Analyze and extract relevant information from large amounts of Amazon’s historical business data to help automate and optimize key processes Design, develop, evaluate and deploy, innovative and highly scalable ML models Work closely with software engineering teams to drive real-time model implementations Work closely with business partners to identify problems and propose machine learning solutions Establish scalable, efficient, automated processes for large scale data analyses, model development, model validation and model maintenance Work proactively with engineering teams and product managers to evangelize new algorithms and drive the implementation of large-scale complex ML models in production Leading projects and mentoring other scientists, engineers in the use of ML techniques About the team International Machine Learning Team is responsible for building novel ML solutions that attack India first (and other Emerging Markets across MENA and LatAm) problems and impact the bottom-line and top-line of India business. Learn more about our team from https://www.amazon.science/working-at-amazon/how-rajeev-rastogis-machine-learning-team-in-india-develops-innovations-for-customers-worldwide
US, VA, Arlington
Are you passionate about programming languages, applying formal verification, program analysis, constraint-solving, and/or theorem proving to real world problems? Do you want to create products that help customers? If so, then we have an exciting opportunity for you. In this role, you will interact with internal teams and external customers to understand their requirements. You will apply your knowledge to propose innovative solutions, create software prototypes, and productize prototypes into production systems using software development tools and methodologies. In addition, you will support and scale your solutions to meet the ever growing demand of customer use. Technical Responsibilities: - Interact with various teams to develop an understanding of their security and safety requirements. - Apply the acquired knowledge to build tools find problems, or show the absence of security/safety problems. - Implement these tools through the use of SAT, SMT, and various concepts from programming languages, theorem proving, formal verification and constraint solving. - Perform analysis of the customer systems using tools developed in-house or externally provided - Create software prototypes to verify and validate the devised solutions methodologies; integrate the prototypes into production systems using standard software development tools and methodologies. Leadership Responsibilities: - Can present and defend company-wide technical decisions to the internal technical community and represent the company effectively at technical conferences. - Functional thought leader, sought after for key tech decisions. Can successfully sell ideas to an executive level decision maker. - Mentors and trains the research scientist community on complex technical issues. AWS has the most services and more features within those services, than any other cloud provider–from infrastructure technologies like compute, storage, and databases–to emerging technologies, such as machine learning and artificial intelligence, data lakes and analytics, and Internet of Things. Whether its Identity features such as access management and sign on, cryptography, console, builder & developer tools, and even projects like automating all of our contractual billing systems, AWS Platform is always innovating with the customer in mind. The AWS Platform team sustains over 750 million transactions per second. We have a formal mentor search application that lets you find a mentor that works best for you based on location, job family, job level etc. Your manager can also help you find a mentor or two, because two is better than one. In addition to formal mentors, we work and train together so that we are always learning from one another, and we celebrate and support the career progression of our team members. Key job responsibilities Technical Responsibilities: - Interact with various teams to develop an understanding of their security and safety requirements. - Apply the acquired knowledge to build tools find problems, or show the absence of security/safety problems. - Implement these tools through the use of SAT, SMT, BDDs, and various concepts from programming languages, theorem proving, formal verification and constraint solving. - Perform analysis of the customer systems using tools developed in-house or externally provided - Create software prototypes to verify and validate the devised solutions methodologies; integrate the prototypes into production systems using standard software development tools and methodologies. Leadership Responsibilities: - Can present and defend company-wide technical decisions to the internal technical community and represent the company effectively at technical conferences. - Functional thought leader, sought after for key tech decisions. Can successfully sell ideas to an executive level decision maker. - Mentors and trains the research scientist community on complex technical issues. A day in the life You will be working on cutting edge technology related to formal methods, automated reasoning, automated testing, and adjacent areas. You will work with fellow applied scientists to solve challenging problems that provide value to customers by improving the quality of software. You will have an opportunity to publish your work. Why AWS Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. AWS Utility Computing (UC) provides product innovations — from foundational services such as Amazon’s Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2), to consistently released new product innovations that continue to set AWS’s services and features apart in the industry. As a member of the UC organization, you’ll support the development and management of Compute, Database, Storage, Internet of Things (Iot), Platform, and Productivity Apps services in AWS, including support for customers who require specialized security solutions for their cloud services. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud. Mentorship and Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. Diverse Experiences Amazon values diverse experiences. Even if you do not meet all of the preferred qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. About the team The Automated Reasoning in Identity (ARI) team is growing fast. It works on applying automated reasoning techniques to services within AWS's Identity organization, building on initial successes of the Zelkova and Access Analyzer projects. The reach of AR within Identity is growing, with more scientists joining all the time.
US, CA, Sunnyvale
The Artificial General Intelligence (AGI) team is looking for a highly skilled and experienced Senior Applied Scientist, to lead the development and implementation of cutting-edge algorithms and models for supervised fine-tuning and reinforcement learning through human feedback; with a focus across text, image, and video modalities. As a Senior Applied Scientist, you will play a critical role in driving the development of Generative AI (GenAI) technologies that can handle Amazon-scale use cases and have a significant impact on our customers' experiences. Key job responsibilities - Collaborate with cross-functional teams of engineers, product managers, and scientists to identify and solve complex problems in GenAI - Design and execute experiments to evaluate the performance of different algorithms and models, and iterate quickly to improve results - Think big about the arc of development of GenAI over a multi-year horizon, and identify new opportunities to apply these technologies to solve real-world problems - Communicate results and insights to both technical and non-technical audiences, including through presentations and written reports - Mentor and guide junior scientists and engineers, and contribute to the overall growth and development of the team
US, CA, Sunnyvale
The Artificial General Intelligence (AGI) team is looking for a highly skilled and experienced Senior Applied Scientist, to lead the development and implementation of cutting-edge algorithms and models for supervised fine-tuning and reinforcement learning through human feedback; with a focus across text, image, and video modalities. As a Senior Applied Scientist, you will play a critical role in driving the development of Generative AI (GenAI) technologies that can handle Amazon-scale use cases and have a significant impact on our customers' experiences. Key job responsibilities - Collaborate with cross-functional teams of engineers, product managers, and scientists to identify and solve complex problems in GenAI - Design and execute experiments to evaluate the performance of different algorithms and models, and iterate quickly to improve results - Think big about the arc of development of GenAI over a multi-year horizon, and identify new opportunities to apply these technologies to solve real-world problems - Communicate results and insights to both technical and non-technical audiences, including through presentations and written reports - Mentor and guide junior scientists and engineers, and contribute to the overall growth and development of the team