Distributed computing

Producing proofs of unsatisfiability with distributed clause-sharing SAT solvers

Dawn Michaelson, Dominik Schreiber, Marijn J.H. Heule, Benjamin Kiesl-Reiter, Mike Whalen

Journal of Automated Reasoning

2025

Distributed clause-sharing SAT solvers can solve challenging problems hundreds of times faster than sequential SAT solvers by sharing derived information among multiple sequential solvers. Unlike sequential solvers, however, distributed solvers have not been able to produce proofs of unsatisfiability in a scalable manner, which limits their use in critical applications. In this work, we present a method

Automated reasoning

Distributed training of large language models on AWS Trainium

Xinwei Fu, Zhen Zhang, Haozheng Fan, Guangtai Huang, Randy Huang, Rahul Solanki, Fei Wu, Ron Diamant, Yida Wang

ACM SoCC 2024

2024

Large language models (LLMs) are ubiquitously powerful but prohibitively expensive to train, often requiring thousands of compute devices, typically GPUs. To reduce the cost of training LLMs for customers, Amazon Web Services (AWS) launched the Amazon EC2 trn1 instances, powered by AWS Trainium, Amazon’s homegrown deep-learning accelerator, as an alternative to distributed LLM training. The trn1 instances

Cloud and systems

Unsatisfiability proofs for distributed clause-sharing SAT solvers

Dawn Michaelson, Dominik Schreiber, Marijn J.H. Heule, Benjamin Kiesl-Reiter, Mike Whalen

TACAS 2023

2023

Distributed clause-sharing SAT solvers can solve problems up to one hundred times faster than sequential SAT solvers by sharing derived information among multiple sequential solvers working on the same problem. Unlike sequential solvers, however, distributed solvers have not been able to produce proofs of unsatisfiability in a scalable manner, which has limited their use in critical applications. In this

Automated reasoning

Scaling to trillion-parameter model training on AWS

Shuai Zheng, Zhen Zhang

September 26, 2022

Trillian-parameter training - social.png

Contiguous parameter management and prefetched activation offloading expand the MiCS tool kit.

Cloud and systems

Automated reasoning at Amazon: A conversation

Larry Hardesty

August 08, 2022

To mark the occasion of the eighth Federated Logic Conference (FloC), Amazon’s Byron Cook, Daniel Kröning, and Marijn Heule discussed automated reasoning’s prospects.

Automated reasoning

Near-linear scaling of gigantic-model training on AWS

Zhen Zhang, Shuai Zheng

June 27, 2022

A new distributed-training library achieves near-linear efficiency in scaling from tens to hundreds of GPUs.

Cloud and systems

Making DeepSpeed ZeRO run efficiently on more-affordable hardware

Justin Chiu, Shuai Zheng

March 23, 2022

Amazon researchers optimize the distributed-training tool to run efficiently on the Elastic Fabric Adapter network interface.

Cloud and systems

DAOS: Data access-aware operating system

SeongJae Park, Madhuparna Bhowmik, Alexandru Uta

HPDC 2022

2022

In data-intensive workloads, data placement and memory management are inherently difficult: the programmer and the operating system have to choose between (combinations of) DRAM and storage, replacement policies, as well as paging sizes. Efficient memory management is based on fine-grained data access patterns driving placement decisions. Current solutions in this space cannot be applied to general workloads

Cloud and systems

Automated reasoning's scientific frontiers

Byron Cook

February 10, 2022