Yesterday at Amazon Web Services’ annual re:Invent conference, Amazon CEO Andy Jassy announced Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. The Amazon Nova models include understanding models in three different sizes for varying latency, cost, and accuracy needs. We also announced two new creative-content generation models — Amazon Nova Canvas and Amazon Nova Reel — that can generate studio-quality images and videos from input text prompts and images.
The Amazon Nova Canvas model enables a wide range of practical capabilities, including
- text-to-image generation: input a text prompt and generate a new image as output;
- image editing, including inpainting (addition of visual elements), outpainting (removal of visual elements), automatic editing through text prompts, and background removal;
- image variation: input one to five images and an optional text prompt, and the model generates a new image that preserves the content of the input images but varies their style and background;
- image conditioning: input a reference image and a text prompt, and the model generates an image whose layout and composition follow the reference image but whose content follows the text prompt;
- color-guided content: provide a list of one to ten hex color codes along with a text prompt, and the generated image will incorporate the prescribed color palette.
The Amazon Nova Reel model supports two features: (1) text to video and (2) text and image to video. With both features, Amazon Nova Reel generates video at 1280 x 720 resolution and 24 frames per second, with a duration of six seconds.
Amazon Nova Canvas samples
Amazon Nova Reel samples
Model architecture
Both Amazon Nova Canvas and Amazon Nova Reel are latent diffusion models with transformer backbones, or diffusion transformers. A diffusion model is one that’s trained to iteratively denoise a sample to which more noise is incrementally added, and a latent diffusion model is one where denoising happens in the representational space.
The major components of Amazon Nova Canvas and Amazon Reel include
- a variational autoencoder (VAE) that maps raw pixels to visual tokens (encoder) and vice versa (decoder); VAEs are trained to output the same data they receive as input but with an intervening bottleneck that forces them to produce a low-dimensional latent representation (encoding);
- a text encoder; and
- a transformer-based denoising network (or denoiser for short).
The inference process for Nova Canvas/Reel to generate images/videos from a text input is as follows:
- the text encoder converts the input text to a sequence of text tokens;
- with the text tokens as guidance, the denoiser iteratively removes noise from a set of randomly initialized visual tokens, resulting in noise-free visual tokens;
- the VAE decoder converts the noise-free visual tokens to color images/video frames.
During training, image-text or video-text pairs are sampled from the training dataset, and the diffusion transformer learns to associate the visual signals with their paired textual descriptions. This enables the model to use natural language to guide the synthesis of visual signals at inference.
Specifically, during training, the VAE encoder maps the input visual signal to visual tokens, and the text encoder converts the prompt to text tokens. Noise is artificially added to the visual tokens at various sampling time steps, dictated by a predefined noise scheduler. The denoising network, conditioned on the text tokens, is then trained to predict the amount of noise injected into the visual tokens at each time step.
Training
The training process for both models had two phases, pretraining and fine-tuning. Pretraining establishes a foundational model that demonstrates high performance on generic tasks, and fine-tuning further improves the model performance in terms of visual quality and text-image and text-video alignment, particularly in domains of high interest.
Inference
Runtime optimization is critical for both Amazon Nova Canvas and Amazon Nova Reel, as the iterative inference process of large diffusion transformers makes significant computational demands. We used a number of techniques to improve the inference efficiency, including ahead-of-time (AOT) compilation, multi-GPU inference, model distillation, and a more efficient sampling strategy that samples the solution trajectory densely only when necessary. These optimizations were judiciously selected and tailored to the specific requirements of each model, enabling faster and more efficient inference.