Understanding the training dynamics of transformers

Theoretical analysis provides insight into the optimization process during model training and reveals that for some optimizations, the Gaussian attention kernel may work better than softmax.

Most of today’s breakthrough AI models are based on the transformer architecture, which is distinguished by its use of an attention mechanism. In a large language model (LLM), for instance, the transformer decides which words in the text string to pay particular attention to when generating the next word; in a vision-language model, it might decide which words of an instruction to attend to when computing pixel values.

Given the increasing importance of transformer models, we naturally want to better understand their dynamics — whether the training process will converge on a useful model, for instance, and how fast, or which architectural variations work best for what purposes. The complexity of the attention mechanism, however, makes traditional analytic tools difficult to apply.

Last week, at the 2024 Conference on Neural Information Processing Systems (NeurIPS), we presented a new analysis of the transformer architecture. First, we identified hyperparameters and initialization conditions that provide a probabilistic guarantee of convergence to a globally optimal solution.

Through ablation studies, we also showed that the choice of attention kernel — the function used to compute the attention weights — influences the convergence rate. Specifically, the Gaussian kernel will sometimes enable convergence when the more common softmax kernel will not. Finally, we conducted an empirical study that showed that, in some specific settings, models trained using the Gaussian kernel converged more rapidly than models trained using the softmax kernel, due to a smoother optimization landscape.

Loss landscapes.png
The optimization landscapes of both the Gaussian kernel and the softmax kernel for two different machine learning tasks. Because of their smoother optimization landscapes, models trained using the Gaussian kernel converged more rapidly than models trained using the softmax kernel.

A tale of three matrices

In a transformer, the attention weight computation involves three matrices: the query matrix, the key matrix, and the value matrix. All three are used to produce encodings of input data. In a self-attention mechanism, which compares an input to itself (as in the case of LLMs), the query and key matrices are applied to the same input. In a cross-attention mechanism, they’re applied to different inputs: in a multimedia model, for instance, one matrix may be used to encode texts, while the other is used to encode images.

The attention kernel defines an operation performed on the query and key encodings; the result of the operation indicates the relevance of one of set of inputs to another (or to itself). The encoding produced by the value matrix represents semantic properties of the data. The result of the kernel operation is multiplied by the encodings produced by the value matrix, emphasizing some semantic features and deemphasizing others. The result is, essentially, a recipe for the semantic content of the model’s next output.

Typically, during model training, all three matrices are updated together. But we analyzed the results of updating only subsets of the matrices while the others remain fixed. This enabled us to identify which matrices and kernel functions exert the largest influence on convergence rate. The results were as follows:

  • If all three matrices can be updated, ordinary gradient descent (GD) can achieve global optimality, with either Gaussian or softmax attention kernels;
  • If only the value matrix can be updated, GD is still optimal, with either kernel;
  • If only the query matrix can be updated, GD convergence is guaranteed only with the Gaussian kernel.

This suggests that in some cases, the commonly used softmax kernel may have drawbacks, and we conducted a set of experiments that bear that intuition out. On two different datasets — one for a text classification task and one for an image interpretation and segmentation task — we trained pairs of transformer models, one with a Gaussian kernel and one with a softmax kernel. On both tasks, the Gaussian kernel enabled faster convergence rates and higher accuracy in the resulting model.

Test results.png
Results of experiments on a text classification task (top) and an image interpretation and segmentation task (bottom). Results with the Gaussian kernel are in blue, results with the softmax kernel in red. In the accuracy measurements (left), higher scores are better; in the training loss measurements (right), lower scores are better.

Our analysis also indicates that, theoretically, convergence depends centrally on updates to the value matrix, since the multiplication of the value matrix and the results of the kernel operation is a linear operation, whereas the kernel operation is nonlinear.

Finally, our paper also sets out a group of initialization conditions that are necessary to guarantee convergence. These include the requirements that the matrix of the kernel operations have full rank — that is, that its columns are linearly independent — and that the ratio of the query and key matrices’ eigenvalues to the value matrix’s eigenvalue fall above a specified threshold.

Further details can be found in our paper. We hope that other members of the AI community will expand on our analyses, expanding our understanding of transformers as they play a larger and larger role in our everyday lives.

Research areas

Related content

US, WA, Seattle
This is an exciting opportunity to shape the future of AI and make a real impact on our customers' generative AI journeys. Join the Generative AI Innovation Center to help customers shape the future of Responsible Generative AI while prioritizing security, privacy, and ethical AI practices. In this role, you will play a pivotal role in guiding AWS customers on the responsible and secure adoption of Generative AI, with a focus on Amazon Bedrock, our fully managed service for building generative AI applications. AWS Generative AI Innovation Center is looking for a Generative AI Data Scientist, who will guide customers on operationalizing Generative AI workloads with appropriate guardrails and responsible AI best practices, including techniques for mitigating bias, ensuring fairness, vulnerability assessments, red teaming, model evaluations, hallucinations, grounding model responses, and maintaining transparency in generative AI models. You'll evangelize Responsible AI (RAI), help customers shape RAI policies, develop technical assets to support RAI policies including demonstrating guardrails for content filtering, redacting sensitive data, blocking inappropriate topics, and implementing customer-specific AI safety policies. The assets you develop, will equip AWS teams, partners, and customers to responsibly operationalize generative AI, from PoCs to production workloads. You will engage with policy makers, customers, AWS product owners to influence product direction and help our customers tap into new markets by utilizing GenAI along with AWS Services. As part of the Generative AI Worldwide Specialist organization, Innovation Center, you will interact with AI/ML scientists and engineers, develop white papers, blogs, reference implementations, and presentations to enable customers and partners to fully leverage Generative AI services on Amazon Web Services. You may also create enablement materials for the broader technical field population, to help them understand RAI and how to integrate AWS services into customer architectures. You must have deep understanding of Generative AI models, including their strengths, limitations, and potential risks. You should have expertise in Responsible AI practices, such as bias mitigation, fairness evaluation, and ethical AI principles. In addition you should have hands on experience with AI security best practices, including vulnerability assessments, red teaming, and fine grained data access controls. Candidates must have great communication skills and be very technical, with the ability to impress Amazon Web Services customers at any level, from executive to developer. Previous experience with Amazon Web Services is desired but not required, provided you have experience building large scale solutions. You will get the opportunity to work directly with senior ML engineers and Data Scientists at customers, partners and Amazon Web Services service teams, influencing their roadmaps and driving innovation. Travel up to 40% may be possible. AWS Sales, Marketing, and Global Services (SMGS) is responsible for driving revenue, adoption, and growth from the largest and fastest growing small- and mid-market accounts to enterprise-level customers including public sector. The AWS Global Support team interacts with leading companies and believes that world-class support is critical to customer success. AWS Support also partners with a global list of customers that are building mission-critical applications on top of AWS services. Key job responsibilities - Guide customers on Responsible AI and Generative AI Security: Act as a trusted advisor to our customers, helping them navigate the complex world of Generative AI and ensure they are using it responsibly and securely. - Operationalize generative AI workloads: Support customers in taking their generative AI projects from proof-of-concept to production, implementing appropriate guardrails and best practices. - Demonstrate Generative AI Risks and Mitigations: Develop technical assets and content to educate customers on the risks of generative AI, including bias, offensive content, cyber threats, prompt hacking, and hallucinations. - Collaborate with GenAI Product/Engineering and Customer-Facing Builder Teams: Work closely with the Amazon Bedrock product and engineering teams and customer-facing builders to launch new services, support beta customers, and develop technical assets. - Thought Leadership and External Representation: Serve as a thought leader in the Generative AI space, representing AWS at industry events and conferences, such as AWS re:Invent. - Develop technical content, workshops, and thought leadership to enable the broader technical community, including Solution Architects, Data Scientists, and Technical Field Community members. About the team About the team Diverse Experiences AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness. Mentorship & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why flexible work hours and arrangements are part of our culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud.
US, IL, Chicago
Do you want to use your expertise in translating innovative science into impactful products to improve the lives and work of over a million people worldwide? If you do, People eXperience Technology Central Science (PXTCS) would love to talk to you about how to make that a reality. PXTCS is an interdisciplinary team that uses economics, behavioral science, statistics, and machine learning to identify products, mechanisms, and process improvements that both improve Amazonian’s wellbeing and their ability to deliver value for Amazon’s customers. We work with HR teams across Amazon to make Amazon PXT the most scientific human resources organization in the world. As an applied scientist on our team, you will work with business leaders, scientists, and economists to translate business and functional requirements into concrete deliverables, define the science vision and translate it into specific plans for applied scientists, as well as engineering and product teams. You will partner with scientists, economists, and engineers on the design, development, testing, and deployment of scalable ML and econometric models. This is a unique, high visibility opportunity for someone who wants to have impact, dive deep into large-scale solutions, enable measurable actions on the employee experience, and work closely with scientists and economists. This role combines science leadership, organizational ability, and technical strength. Key job responsibilities As an Applied Scientist, ML Applications, you will: • Design, develop, and evaluate innovative machine learning solutions to solve diverse challenges and opportunities for Amazon customers • Advance the team's engineering craftsmanship and drive continued scientific innovation as a thought leader and practitioner. • Partner with the engineering team to deploy your models in production. • Partner with scientists from across PXTCS to solve complex problems and use your team’s expertise to accelerate their ability get their work into production. • Work directly with Amazonians from across the company to understand their business problems and help define and implement scalable ML solutions to solve them.
US, VA, Arlington
Are you looking to work at the forefront of Machine Learning and AI? Would you be excited to apply cutting edge Generative AI algorithms to solve real world problems with significant impact? The Generative AI Innovation Center at AWS is a new strategic team that helps AWS customers implement Generative AI solutions and realize transformational business opportunities. This is a team of strategists, data scientists, engineers, and solution architects working step-by-step with customers to build bespoke solutions that harness the power of generative AI. The team helps customers imagine and scope the use cases that will create the greatest value for their businesses, select and train and fine tune the right models, define paths to navigate technical or business challenges, develop proof-of-concepts, and make plans for launching solutions at scale. The GenAI Innovation Center team provides guidance on best practices for applying generative AI responsibly and cost efficiently. You will work directly with customers and innovate in a fast-paced organization that contributes to game-changing projects and technologies. You will design and run experiments, research new algorithms, and find new ways of optimizing risk, profitability, and customer experience. We’re looking for Data Scientists capable of using GenAI and other techniques to design, evangelize, and implement state-of-the-art solutions for never-before-solved problems. This position requires that the candidate selected be a US Citizen. Key job responsibilities As an Data Scientist, you will - Collaborate with AI/ML scientists and architects to Research, design, develop, and evaluate cutting-edge generative AI algorithms to address real-world challenges - Interact with customers directly to understand the business problem, help and aid them in implementation of generative AI solutions, deliver briefing and deep dive sessions to customers and guide customer on adoption patterns and paths to production - Create and deliver best practice recommendations, tutorials, blog posts, sample code, and presentations adapted to technical, business, and executive stakeholder - Provide customer and market feedback to Product and Engineering teams to help define product direction About the team ABOUT AWS: Diverse Experiences Amazon values diverse experiences. Even if you do not meet all of the preferred qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why AWS Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness. Mentorship and Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional.
US, CA, Sunnyvale
Are you fueled by a passion for computer vision, machine learning and AI, and are eager to leverage your skills to enrich the lives of millions across the globe? Join us at Ring AI team, where we're not just offering a job, but an opportunity to revolutionize safety and convenience in our neighborhoods through innovation. You will be part of a dynamic team dedicated to pushing the boundaries of computer vision, machine learning and AI to deliver an unparalleled user experience for our neighbors. This position presents an exceptional opportunity for you to pioneer and innovate in AI, making a profound impact on millions of customers worldwide. You will partner with world-class AI scientists, engineers, product managers and other experts to develop industry-leading AI algorithms and systems for a diverse array of Ring and Blink products, enhancing the lives of millions of customers globally. Join us in shaping the future of AI innovation at Ring and Blink, where exciting challenges await! Key job responsibilities * Research and implement the state-of-the-art computer vision and machine learning methods to deliver high-quality artifacts that meets product specifications. * Establish scalable, efficient, automated processes for large-scale data analysis, machine-learning model development, model validation and gap analysis. Provide technical leadership and research new machine learning approaches to drive continued scientific innovation. * Work in a collaborative environment with other scientists, engineers, product managers and cross-functional teams. * Mentor and develop junior scientists on the team
US, CA, San Francisco
If you are interested in this position, please apply on Twitch's Career site https://www.twitch.tv/jobs/en/ About Us: Twitch is the world’s biggest live streaming service, with global communities built around gaming, entertainment, music, sports, cooking, and more. It is where thousands of communities come together for whatever, every day. We’re about community, inside and out. You’ll find coworkers who are eager to team up, collaborate, and smash (or elegantly solve) problems together. We’re on a quest to empower live communities, so if this sounds good to you, see what we’re up to on LinkedIn and X, and discover the projects we’re solving on our Blog. Be sure to explore our Interviewing Guide to learn how to ace our interview process. About the Role We are looking for applied scientists to solve challenging and open-ended problems in the domain of user and content safety. As an applied scientist on Twitch's Community team, you will use machine learning to develop data products tackling problems such as harassment, spam, and illegal content. You will use a wide toolbox of ML tools to handle multiple types of data, including user behavior, metadata, and user generated content such as text and video. You will collaborate with a team of passionate scientists and engineers to develop these models and put them into production, where they can help Twitch's creators and viewers succeed and build communities. You will report to our Senior Applied Science Manager. This position is located in San Francisco, CA. You Will -Build machine learning products to protect Twitch and its users from abusive behavior such as harassment, spam, and violent or illegal content. -Work backwards from customer problems to develop the right solution for the job, whether a classical ML model or a state-of-the-art one. -Collaborate with Community Health's engineering and product management team to productionize your models into flexible data pipelines and ML-based services. -Continue to learn and experiment with new techniques in ML, software engineering, or safety so that we can better help communities on Twitch grow and stay safe. Perks - Medical, Dental, Vision & Disability Insurance - 401(k) - Maternity & Parental Leave - Flexible PTO - Amazon Employee Discount
US, CA, San Francisco
If you are interested in this position, please apply on Twitch's Career site https://www.twitch.tv/jobs/en/ About Us: Twitch is the world’s biggest live streaming service, with global communities built around gaming, entertainment, music, sports, cooking, and more. It is where thousands of communities come together for whatever, every day. We’re about community, inside and out. You’ll find coworkers who are eager to team up, collaborate, and smash (or elegantly solve) problems together. We’re on a quest to empower live communities, so if this sounds good to you, see what we’re up to on LinkedIn and X, and discover the projects we’re solving on our Blog. Be sure to explore our Interviewing Guide to learn how to ace our interview process. About the Role Data is central to Twitch's decision-making process, and data scientists are a critical component to evangelize data-driven decision making in all of our operations. As a data scientist at Twitch, you will be on the ground floor with your team, shaping the way product performance is measured, defining what questions should be asked, and scaling analytics methods and tools to support our growing business, leading the way for high quality, high velocity decisions for your team. As part of the Community Health team at Twitch, you will work directly with product teams to support the safety and well-being of our creators, viewers, and moderators. You will help shape the way we build operational processes, delivering formative insights about the health and safety of our communities, measuring the impact of product improvements and policy changes, and charting a course for future product design and strategy. In a typical week or month, you will contribute to instrumentation, dashboard/report-building, metrics reviews, and ad hoc analysis. You will report to the Data Science Manager for Community Health and Customer Trust and your work will pave the way for high-quality, high-velocity product development that will lead to safer, more rewarding community interactions across the platform. You Will - Become a domain expert in the design of product features to support safer and more rewarding interactions within online communities. - Distill ambiguous product or strategy questions, find clever ways to answer them, and to measure the uncertainty; translate product and strategy questions into metrics, and work with data engineers to dashboard these metrics. - Design and evaluate A/B tests and experiments to measure the effectiveness of front-end product improvements and algorithmic machine learning systems. - Produce ad-hoc reports and insights that help teams move forward with time-sensitive product and strategy decisions. - Maintain a culture of high-quality output and engagement with team members; communicate technical information to technical and non-technical partners; manage ad hoc requests and unexpected obstacles. Perks - Medical, Dental, Vision & Disability Insurance - 401(k) - Maternity & Parental Leave - Flexible PTO - Amazon Employee Discount
IN, KA, Bengaluru
Amazon is looking for a passionate, talented, and inventive Scientist with a strong machine learning background to help build industry-leading Speech and Language technology. Our mission is to push the envelope in Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), and Audio Signal Processing, in order to provide the best-possible experience for our customers. As a Speech and Language Scientist, you will work with talented peers to develop novel algorithms and modeling techniques to advance the state of the art in spoken language understanding. Your work will directly impact our customers in the form of products and services that make use of speech and language technology. You will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in spoken language understanding. We are hiring in the area of speech and audio understanding technologies including ASR.
CA, ON, Toronto
Are you motivated to explore research in ambiguous spaces? Are you interested in conducting research that will improve associate, employee and manager experiences at Amazon? Do you want to work on an interdisciplinary team of scientists that collaborate rather than compete? Join us at PXT Central Science! The People eXperience and Technology Central Science Team (PXTCS) uses economics, behavioral science, statistics, and machine learning to proactively identify mechanisms and process improvements which simultaneously improve Amazon and the lives, wellbeing, and the value of work to Amazonians. We are an interdisciplinary team that combines the talents of science and engineering to develop and deliver solutions that measurably achieve this goal. Key job responsibilities As an Applied Scientist for People Experience and Technology (PXT) Central Science, you will be working with our science and engineering teams, specifically on re-imagining Generative AI Applications and Generative AI Infrastructure for HR. Applying Generative AI to HR has unique challenges such as privacy, fairness, and seamlessly integrating Enterprise Knowledge and World Knowledge and knowing which to use when. In addition, the team works on some of Amazon’s most strategic technical investments in the people space and support Amazon’s efforts to be Earth’s Best Employer. In this role you will have a significant impact on 1.5 million Amazonians and the communities Amazon serves and ample scope to demonstrate scientific thought leadership and scientific impact in addition to business impact. You will also play a critical role in the organization's business planning, work closely with senior leaders to develop goals and resource requirements, influence our long-term technical and business strategy, and help hire and develop science and engineering talent. You will also provide support to business partners, helping them use the best scientific methods and science-driven tools to solve current and upcoming challenges and deliver efficiency gains in a changing marke About the team The AI/ML team in PXTCS is working on building Generative AI solutions to reimagine Corp employee and Ops associate experience. Examples of state-of-the-art solutions are Coaching for Amazon employees (available on AZA) and reinventing Employee Recruiting and Employee Listening.
US, WA, Seattle
Amazon Advertising is one of Amazon's fastest growing and most profitable businesses, responsible for defining and delivering a collection of advertising products that drive discovery and sales. Amazon Advertising is at the forefront of shaping the future of advertising technology, and our Auction team in Sponsored Brands is pivotal in driving this innovation. SB Auction team's role is to develop optimized and fair auction systems for sponsored brands that deliver value for advertisers while enhancing the shopping experience for customers. We collaborate with different teams across the Amazon Ads to build scalable online and offline ML infrastructure systems to accelerate science innovations, facilitate business growth and promote technology innovation. Key job responsibilities As a Senior Applied Scientist on this team, you typically play a key role in optimizing ad delivery, improving targeting accuracy, and maximizing revenue generation for advertisers, all while maintaining a seamless user experience, you will: - Develop optimization techniques (e.g., multi-objective optimization) to balance multiple goals, such as maximizing revenue for advertisers, increasing user engagement, and maintaining fair ad distribution. - Build machine learning models, perform proof-of-concept, experiment, optimize, and deploy your models into production; work closely with software engineers to assist in productionizing your ML models. - Run A/B experiments, fine-tune the models for real-world effectiveness, ensuring that the ad auction system works optimally in production environments. - Run large-scale experiments to test different auction strategies, bidding algorithms, and ad targeting techniques, using methodologies like multi-arm bandit or reinforcement learning. - Establish scalable, efficient, automated processes for large-scale data analysis, machine-learning model development, model validation and serving - Communicate results and insights clearly to non-technical stakeholders, including product managers, advertisers, and executives, helping them understand the impact of data-driven decisions. - Research new and innovative machine learning approaches. - Recruit Applied Scientists to the team and provide mentorship.
US, WA, Seattle
Join us in the evolution of Amazon’s Seller business! The Seller Growth Science organization is the growth and development engine for our Store. Partnering with business, product, and engineering, we catalyze SP growth with comprehensive and accurate data, unique insights, and actionable recommendations and collaborate with WW SP facing teams to drive adoption and create feedback loops. We strongly believe that any motivated SP should be able to grow their businesses and reach their full potential supported by Amazon tools and resources. We are looking for an Applied Scientist II to lead us to identify data-driven insight and opportunities to improve our SP growth strategy and drive seller success. As a successful applied scientist on our talented team of scientists and engineers, you will solve complex problems to identify actionable opportunities, and collaborate with engineering, research, and business teams for future innovation. You need to be a sophisticated user and builder of statistical models and put them in production to answer specific business questions. You are an expert at synthesizing and communicating insights and recommendations to audiences of varying levels of technical sophistication. You will continue to contribute to the research community, by working with scientists across Amazon, as well as collaborating with academic researchers and publishing papers (www.aboutamazon.com/research). Key job responsibilities As an Applied Scientist on the team, you will: - Identify opportunities to improve SP growth and development process and translate those opportunities into science problems via principled statistical solutions (e.g. ML, causal, RL). - Mentor and guide the applied scientists in our organization and hold us to a high standard of technical rigor and excellence in MLOps. - Lead and execute roadmaps for complex science projects to help SP have a delightful selling experience while creating long term value for our shoppers. - Work with our engineering partners and draw upon your experience to meet latency and other system constraints. - Identify untapped, high-risk technical and scientific directions, and simulate new research directions that you will drive to completion and deliver. - Be responsible for communicating our science innovations to the broader internal & external scientific community.