Automated evaluation of RAG pipelines with exam generation

The fight against hallucination in retrieval-augmented-generation models starts with a method for accurately assessing it.

In the swiftly evolving domain of large language models (LLMs), the accurate evaluation of retrieval-augmented-generation (RAG) models is paramount. In this blog, we introduce a pioneering methodology that employs an automated exam generation process, enhanced by item response theory (IRT), to evaluate the factual accuracy of RAG models on specific tasks. Our approach is not only robust and interpretable but also cost efficient, strategically identifying model strengths and refining exams to optimize their evaluative utility. We describe our methodology in a paper we will present in July at the 2024 International Conference on Machine Learning (ICML).

Exam generation process

RAG is a method for handling natural-language queries by retrieving relevant documents and using text from them to seed the response generated by an LLM. The expectation is that factual assertions from reliable documents will curb the LLM’s tendency to “hallucinate”, or generate reasonable-sounding but false sentences.

To evaluate a RAG model on a particular task, we use an LLM to generate multiple-choice questions from a task-specific knowledge corpus. Our method is agnostic to the retriever and generative model used in both the RAG system and the exam generation task.

RAG diagram.png
Summary of the proposed exam generation, evaluation, and iterative-improvement processes.

Our approach has two steps. For each document in the knowledge corpus, we use an LLM and several prompt-engineering strategies to create candidate questions. Then we use several natural-language-processing filters to remove low-quality questions along various axes, such as length, incorrectness, and self-containment.

We note an interesting asymmetry: given a document corpus, it is relatively easy for an LLM to generate a question and the correct answer, as the content of both is contained in the prompt. However, it is considerably more difficult to create high-quality incorrect answers, commonly referred to as discriminators.

To filter out degenerate questions, we use the Jaccard similarity coefficient and embedding-based similarity metrics.

Here is the prompt that we used for exam generation:

Human: Here is some documentation from {task_domain}: {documentation}.\n
From this generate a difficult multi-form question for an exam.
It should have 4 candidates, 1 correct answer, and explanations.

Syntax should be Question: {question}\n
A){candidate A}\n
B){candidate B}\n
C){candidate C}\n
D){candidate D}

Correct Answer: {correct answer}\n
### Assistant:"

In our research, we analyzed several RAG pipeline variants, including closed-book (no knowledge from the document corpus is provided to the LLM), oracle (the exam taker has access to the specific document used to generate the question-and-answer pair, in addition to the question itself and all possible candidate answers), and classical retrieval models such as MultiQA embeddings, Siamese network embeddings, and BM25. Our evaluations also extended to different scales of language models, from 7 billion parameters to 70 billion, to understand the impact of model scale on performance.

To demonstrate the practical utility of this methodology, we deployed it across a wide range of domains. These include Amazon Web Services (AWS) DevOps, where troubleshooting guides for cloud-based services tests the models' operational effectiveness; arXiv abstracts, which challenge the models' ability to parse and generate insights from dense scientific texts; StackExchange questions, which probe the models' responsiveness and accuracy; and SEC filings, where the complexity of financial reporting tests the models’ capacity to extract nuanced information from structured corporate documents. This multi-domain approach not only enhances the robustness of our evaluations but also ensures that our models are versatile and reliable across various real-world applications.

Evaluating the exam generation model

The following figure shows granular results of our evaluation method for the task of AWS DevOps troubleshooting. We report accuracy for different retrieval approaches and retriever sizes, on a percentage scale. Labels on the diameter show the AWS resources we’re using. Colors correspond to different retrieval approaches (Oracle, DPRV2, MultiQA, ClosedBook), and solid and broken lines correspond to different base LLM sizes (7B, 13B, and 70B). For instance, we observe that a small model such as Mistral-7B with MultiQA embeddings has an accuracy of around 80% for the AWS resource Relational Database Service (RDS).

Granular results of our exam evaluation for the task of AWS DevOps troubleshooting.png
A comparison of several different models, at a range of sizes, on the task of DevOps troubleshooting for eight different AWS resources.

Our experiments yielded four key findings. First, there’s no one-size-fits-all solution; the optimal choice of retrieval method, and to a lesser extent LLM, is typically task dependent. For example, in tasks such as SEC filings and arXiv abstracts, BM25 outperforms MultiQA and Siamese network embeddings, indicating that sparse retrieval is generally more effective than dense retrieval. This could be because such tasks often contain easily identifiable terms (e.g., AWS service names in AWS DevOps) that can be retrieved with keyword search, while other tasks, such as StackExchange, mostly contain common words.

Second, the right choice of retrieval method can lead to greater performance improvements than simply using larger LLMs. For instance, in SEC filings, we observed a greater performance gain from switching from Siamese network embeddings to DPRV2 than from switching to larger LLMs.

Third, for tasks involving closed-source knowledge, the accuracy bottleneck is typically the LLM rather than the retrieval method. Finally, a poorly aligned retriever component can result in worse accuracy than having no retrieval at all.

Exam enhancements through item response theory

Integrating item response theory (IRT) into our process has significantly improved the quality of the exams. IRT models the likelihood of a correct response based on characteristics of a question and the capabilities of a model. It uses three factors — difficulty, discrimination, and guessing chance — to create exams that more accurately reflect and predict model performance.

IRT posits that a model’s probability of correctly answering a question is correlated with a latent variable known as ability, and it provides a method for estimating the value of that variable. As such, it offers a way to quantify a model’s ability level.

Our process begins with an initial exam assessment, identifying and removing questions that contribute minimally to discriminative insights. The exam is then refined iteratively, based on updated IRT parameters, which helps it accurately gauge nuanced model behaviors.

By continuously analyzing and adjusting exams based on IRT parameters, we have seen substantial improvements in the exams’ ability to discriminate among models. For instance, we use Fisher information to quantify the informativeness of exam questions. Fisher information measures the amount of information that an observable random variable provides about an unknown parameter, offering a way to gauge the precision of statistical estimators in parameter estimation theory.

During iterative improvements for the arXiv task, the Fisher information function consistently showed progress, marking a considerable enhancement of the exams' capacity to differentiate model capabilities. This iterative process ensures that each new version of the exam is more informative than the last and effectively evaluates the RAG model’s abilities.

Evaluating the generated exams

To further enhance the assessment of RAG models, we categorize exam questions using both semantic analysis and Bloom’s revised taxonomy, devised by the University of Chicago psychologist Benjamin Bloom. Bloom’s taxonomy helps classify questions by cognitive complexity — from basic recall to analytical tasks — enabling structured evaluation of model capabilities.

Different levels in Bloom's taxonomy differentiate between the knowledge dimension (factual, conceptual, procedural, and meta-cognitive) and the cognitive-process dimension (remember, understand, apply, analyze, evaluate, and create). Additionally, we classify questions semantically by identifying keywords like “what” and “which.” These additional classifications allow us to assess how well models perform at different ability levels.

Bloom's Taxonomy.png
Average Fisher information for each category in Bloom’s taxonomy category (left) and semantic category (right) for the StackExchange task.

The above two figures present the average Fisher information value for each Bloom category (left) and semantic category (right) for the StackExchange task. For this specific task, we observe that “evaluating” and “understanding” are the most discriminate dimensions in Bloom’s taxonomy across different ability levels, while “remembering” is the least discriminatory.

On the semantic categories, we observe that “what” and “which” were the most discriminatory terms for lower ability levels, and “when” discriminated more at higher ability levels. One interpretation is that “what” and “how” questions tend to be more factual and syntax-based in the StackExchange domain, so at lower ability levels, RAG struggles more with these genres of questions.

The following figure illustrates the maximization process for the arXiv task as the exam and IRT estimation evolve. We show the results for three incremental steps. We observe a 0.05 increase in Fisher information even with a single iteration. This progress reaches a 0.1 increase in the subsequent steps.

Exam Information Curve.png
The maximization process, as the exam and IRT estimation evolve, for the task of generating abstracts for arXiv papers.

To expand our approach beyond Q&A applications, our future research will focus on domains such as summarization, translation, and sentiment analysis. We are also addressing the complex task of meta-evaluation, comparing and refining our evaluation methods to account for the multidimensional nature of LLM performance. Additionally, we will continuously update our methodologies to accommodate the rapid evolution of LLM technology, ensuring robust and comprehensive assessment of emerging models.

Acknowledgments: Laurent Callot

Research areas

Related content

GB, MLN, Edinburgh
We’re looking for a Machine Learning Scientist in the Personalization team for our Edinburgh office experienced in generative AI and large models. You will be responsible for developing and disseminating customer-facing personalized recommendation models. This is a hands-on role with global impact working with a team of world-class engineers and scientists across the Edinburgh offices and wider organization. You will lead the design of machine learning models that scale to very large quantities of data, and serve high-scale low-latency recommendations to all customers worldwide. You will embody scientific rigor, designing and executing experiments to demonstrate the technical efficacy and business value of your methods. You will work alongside a science team to delight customers by aiding in recommendations relevancy, and raise the profile of Amazon as a global leader in machine learning and personalization. Successful candidates will have strong technical ability, focus on customers by applying a customer-first approach, excellent teamwork and communication skills, and a motivation to achieve results in a fast-paced environment. Our position offers exceptional opportunities for every candidate to grow their technical and non-technical skills. If you are selected, you have the opportunity to make a difference to our business by designing and building state of the art machine learning systems on big data, leveraging Amazon’s vast computing resources (AWS), working on exciting and challenging projects, and delivering meaningful results to customers world-wide. Key job responsibilities Develop machine learning algorithms for high-scale recommendations problems. Rapidly design, prototype and test many possible hypotheses in a high-ambiguity environment, making use of both quantitative analysis and business judgement. Collaborate with software engineers to integrate successful experimental results into large-scale, highly complex Amazon production systems capable of handling 100,000s of transactions per second at low latency. Report results in a manner which is both statistically rigorous and compellingly relevant, exemplifying good scientific practice in a business environment.
IN, TS, Hyderabad
Welcome to the Worldwide Returns & ReCommerce team (WWR&R) at Amazon.com. WWR&R is an agile, innovative organization dedicated to ‘making zero happen’ to benefit our customers, our company, and the environment. Our goal is to achieve the three zeroes: zero cost of returns, zero waste, and zero defects. We do this by developing products and driving truly innovative operational excellence to help customers keep what they buy, recover returned and damaged product value, keep thousands of tons of waste from landfills, and create the best customer returns experience in the world. We have an eye to the future – we create long-term value at Amazon by focusing not just on the bottom line, but on the planet. We are building the most sustainable re-use channel we can by driving multiple aspects of the Circular Economy for Amazon – Returns & ReCommerce. Amazon WWR&R is comprised of business, product, operational, program, software engineering and data teams that manage the life of a returned or damaged product from a customer to the warehouse and on to its next best use. Our work is broad and deep: we train machine learning models to automate routing and find signals to optimize re-use; we invent new channels to give products a second life; we develop highly respected product support to help customers love what they buy; we pilot smarter product evaluations; we work from the customer backward to find ways to make the return experience remarkably delightful and easy; and we do it all while scrutinizing our business with laser focus. You will help create everything from customer-facing and vendor-facing websites to the internal software and tools behind the reverse-logistics process. You can develop scalable, high-availability solutions to solve complex and broad business problems. We are a group that has fun at work while driving incredible customer, business, and environmental impact. We are backed by a strong leadership group dedicated to operational excellence that empowers a reasonable work-life balance. As an established, experienced team, we offer the scope and support needed for substantial career growth. Amazon is earth’s most customer-centric company and through WWR&R, the earth is our customer too. Come join us and innovate with the Amazon Worldwide Returns & ReCommerce team!
US, WA, Seattle
Amazon Advertising operates at the intersection of eCommerce and advertising, and is investing heavily in building a world-class advertising business. We are defining and delivering a collection of self-service performance advertising products that drive discovery and sales. Our products are strategically important to our Retail and Marketplace businesses driving long-term growth. We deliver billions of ad impressions and millions of clicks daily and are breaking fresh ground to create world-class products to improve both shopper and advertiser experience. With a broad mandate to experiment and innovate, we grow at an unprecedented rate with a seemingly endless range of new opportunities. The Ad Response Prediction team in Sponsored Products organization build advanced deep-learning models, large-scale machine-learning pipelines, and real-time serving infra to match shoppers’ intent to relevant ads on all devices, for all contexts and in all marketplaces. Through precise estimation of shoppers’ interaction with ads and their long-term value, we aim to drive optimal ads allocation and pricing, and help to deliver a relevant, engaging and delightful ads experience to Amazon shoppers. As the business and the complexity of various new initiatives we take continues to grow, we are looking for talented Applied Scientists to join the team. Key job responsibilities As a Applied Scientist II, you will: * Conduct hands-on data analysis, build large-scale machine-learning models and pipelines * Work closely with software engineers on detailed requirements, technical designs and implementation of end-to-end solutions in production * Run regular A/B experiments, gather data, perform statistical analysis, and communicate the impact to senior management * Establish scalable, efficient, automated processes for large-scale data analysis, machine-learning model development, model validation and serving * Provide technical leadership, research new machine learning approaches to drive continued scientific innovation * Be a member of the Amazon-wide Machine Learning Community, participating in internal and external MeetUps, Hackathons and Conferences
US, WA, Seattle
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video technologist, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! In Prime Video READI, our mission is to automate infrastructure scaling and operational readiness. We are growing a team specialized in time series modeling, forecasting, and release safety. This team will invent and develop algorithms for forecasting multi-dimensional related time series. The team will develop forecasts on key business dimensions with optimization recommendations related to performance and efficiency opportunities across our global software environment. As a founding member of the core team, you will apply your deep coding, modeling and statistical knowledge to concrete problems that have broad cross-organizational, global, and technology impact. Your work will focus on retrieving, cleansing and preparing large scale datasets, training and evaluating models and deploying them to production where we continuously monitor and evaluate. You will work on large engineering efforts that solve significantly complex problems facing global customers. You will be trusted to operate with complete independence and are often assigned to focus on areas where the business and/or architectural strategy has not yet been defined. You must be equally comfortable digging in to business requirements as you are drilling into design with development teams and developing production ready learning models. You consistently bring strong, data-driven business and technical judgment to decisions. You will work with internal and external stakeholders, cross-functional partners, and end-users around the world at all levels. Our team makes a big impact because nothing is more important to us than delivering for our customers, continually earning their trust, and thinking long term. You are empowered to bring new technologies to your solutions. If you crave a sense of ownership, this is the place to be.
IL, Tel Aviv
Come join the AWS Agentic AI science team in building the next generation models for intelligent automation. AWS, the world-leading provider of cloud services, has fostered the creation and growth of countless new businesses, and is a positive force for good. Our customers bring problems that will give Applied Scientists like you endless opportunities to see your research have a positive and immediate impact in the world. You will have the opportunity to partner with technology and business teams to solve real-world problems, have access to virtually endless data and computational resources, and to world-class engineers and developers that can help bring your ideas into the world. As part of the team, we expect that you will develop innovative solutions to hard problems, and publish your findings at peer reviewed conferences and workshops. We are looking for world class researchers with experience in one or more of the following areas - autonomous agents, API orchestration, Planning, large multimodal models (especially vision-language models), reinforcement learning (RL) and sequential decision making. Key job responsibilities PhD, or Master's degree and 4+ years of CS, CE, ML or related field experience 3+ years of building models for business application experience Experience in patents or publications at top-tier peer-reviewed conferences or journals Experience programming in Java, C++, Python or related language Experience in any of the following areas: algorithms and data structures, parsing, numerical optimization, data mining, parallel and distributed computing, high-performance computing
US, WA, Seattle
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video team member, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! Key job responsibilities As an Applied Scientist in the Content Understanding Team, you will lead the end-to-end research and deployment of video and multi-modal models applied to a variety of downstream applications. More specifically, you will: - Work backwards from customer problems to research and design scientific approaches for solving them - Work closely with other scientists, engineers and product managers to expand the depth of our product insights with data, create a variety of experiments to determine the high impact projects to include in planning roadmaps - Stay up-to-date with advancements and the latest modeling techniques in the field - Publish your research findings in top conferences and journals About the team Our Prime Video Content Understanding team builds holistic media representations (e.g. descriptions of scenes, semantic embeddings) and apply them to new customer experiences supply chain problems. Our technology spans the entire Prime Video catalogue globally, and we enable instant recaps, skip intro timing, ad placement, search, and content moderation.
US, WA, Seattle
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video team member, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! Key job responsibilities As an Applied Scientist in the Content Understanding Team, you will lead the end-to-end research and deployment of video and multi-modal models applied to a variety of downstream applications. More specifically, you will: - Work backwards from customer problems to research and design scientific approaches for solving them - Work closely with other scientists, engineers and product managers to expand the depth of our product insights with data, create a variety of experiments to determine the high impact projects to include in planning roadmaps - Stay up-to-date with advancements and the latest modeling techniques in the field - Publish your research findings in top conferences and journals About the team Our Prime Video Content Understanding team builds holistic media representations (e.g. descriptions of scenes, semantic embeddings) and apply them to new customer experiences supply chain problems. Our technology spans the entire Prime Video catalogue globally, and we enable instant recaps, skip intro timing, ad placement, search, and content moderation.
IN, HR, Gurugram
We're on a journey to build something new a green field project! Come join our team and build new discovery and shopping products that connect customers with their vehicle of choice. We're looking for a talented Senior Applied Scientist to join our team of product managers, designers, and engineers to design, and build innovative automotive-shopping experiences for our customers. This is a great opportunity for an experienced engineer to design and implement the technology for a new Amazon business. We are looking for a Applied Scientist to design, implement and deliver end-to-end solutions. We are seeking passionate, hands-on, experienced and seasoned Senior Applied Scientist who will be deep in code and algorithms; who are technically strong in building scalable computer vision machine learning systems across item understanding, pose estimation, class imbalanced classifiers, identification and segmentation.. You will drive ideas to products using paradigms such as deep learning, semi supervised learning and dynamic learning. As a Senior Applied Scientist, you will also help lead and mentor our team of applied scientists and engineers. You will take on complex customer problems, distill customer requirements, and then deliver solutions that either leverage existing academic and industrial research or utilize your own out-of-the-box but pragmatic thinking. In addition to coming up with novel solutions and prototypes, you will directly contribute to implementation while you lead. A successful candidate has excellent technical depth, scientific vision, project management skills, great communication skills, and a drive to achieve results in a unified team environment. You should enjoy the process of solving real-world problems that, quite frankly, haven’t been solved at scale anywhere before. Along the way, we guarantee you’ll get opportunities to be a bold disruptor, prolific innovator, and a reputed problem solver—someone who truly enables AI and robotics to significantly impact the lives of millions of consumers. Key job responsibilities Architect, design, and implement Machine Learning models for vision systems on robotic platforms Optimize, deploy, and support at scale ML models on the edge. Influence the team's strategy and contribute to long-term vision and roadmap. Work with stakeholders across , science, and operations teams to iterate on design and implementation. Maintain high standards by participating in reviews, designing for fault tolerance and operational excellence, and creating mechanisms for continuous improvement. Prototype and test concepts or features, both through simulation and emulators and with live robotic equipment Work directly with customers and partners to test prototypes and incorporate feedback Mentor other engineer team members. A day in the life - 6+ years of building machine learning models for retail application experience - PhD, or Master's degree and 6+ years of applied research experience - Experience programming in Java, C++, Python or related language - Experience with neural deep learning methods and machine learning - Demonstrated expertise in computer vision and machine learning techniques.
US, WA, Seattle
Do you want to re-invent how millions of people consume video content on their TVs, Tablets and Alexa? We are building a free to watch streaming service called Fire TV Channels (https://techcrunch.com/2023/08/21/amazon-launches-fire-tv-channels-app-400-fast-channels/). Our goal is to provide customers with a delightful and personalized experience for consuming content across News, Sports, Cooking, Gaming, Entertainment, Lifestyle and more. You will work closely with engineering and product stakeholders to realize our ambitious product vision. You will get to work with Generative AI and other state of the art technologies to help build personalization and recommendation solutions from the ground up. You will be in the driver's seat to present customers with content they will love. Using Amazon’s large-scale computing resources, you will ask research questions about customer behavior, build state-of-the-art models to generate recommendations and run these models to enhance the customer experience. You will participate in the Amazon ML community and mentor Applied Scientists and Software Engineers with a strong interest in and knowledge of ML. Your work will directly benefit customers and you will measure the impact using scientific tools.
IN, HR, Gurugram
Our customers have immense faith in our ability to deliver packages timely and as expected. A well planned network seamlessly scales to handle millions of package movements a day. It has monitoring mechanisms that detect failures before they even happen (such as predicting network congestion, operations breakdown), and perform proactive corrective actions. When failures do happen, it has inbuilt redundancies to mitigate impact (such as determine other routes or service providers that can handle the extra load), and avoids relying on single points of failure (service provider, node, or arc). Finally, it is cost optimal, so that customers can be passed the benefit from an efficiently set up network. Amazon Shipping is hiring Applied Scientists to help improve our ability to plan and execute package movements. As an Applied Scientist in Amazon Shipping, you will work on multiple challenging machine learning problems spread across a wide spectrum of business problems. You will build ML models to help our transportation cost auditing platforms effectively audit off-manifest (discrepancies between planned and actual shipping cost). You will build models to improve the quality of financial and planning data by accurately predicting ship cost at a package level. Your models will help forecast the packages required to be pick from shipper warehouses to reduce First Mile shipping cost. Using signals from within the transportation network (such as network load, and velocity of movements derived from package scan events) and outside (such as weather signals), you will build models that predict delivery delay for every package. These models will help improve buyer experience by triggering early corrective actions, and generating proactive customer notifications. Your role will require you to demonstrate Think Big and Invent and Simplify, by refining and translating Transportation domain-related business problems into one or more Machine Learning problems. You will use techniques from a wide array of machine learning paradigms, such as supervised, unsupervised, semi-supervised and reinforcement learning. Your model choices will include, but not be limited to, linear/logistic models, tree based models, deep learning models, ensemble models, and Q-learning models. You will use techniques such as LIME and SHAP to make your models interpretable for your customers. You will employ a family of reusable modelling solutions to ensure that your ML solution scales across multiple regions (such as North America, Europe, Asia) and package movement types (such as small parcel movements and truck movements). You will partner with Applied Scientists and Research Scientists from other teams in US and India working on related business domains. Your models are expected to be of production quality, and will be directly used in production services. You will work as part of a diverse data science and engineering team comprising of other Applied Scientists, Software Development Engineers and Business Intelligence Engineers. You will participate in the Amazon ML community by authoring scientific papers and submitting them to Machine Learning conferences. You will mentor Applied Scientists and Software Development Engineers having a strong interest in ML. You will also be called upon to provide ML consultation outside your team for other problem statements. If you are excited by this charter, come join us!