Computer vision

Helping devices see and understand our visual world.

One token to seg them all: Language instructed reasoning segmentation in videos

Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Pichao Wang, Zheng Zhang, Mike Zheng Shou

NeurIPS 2024

2024

We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language models, and augmented by the Segment Anything Model, VideoLISA generates temporally consistent segmentation masks in videos based on language instructions. Existing image-based

Computer vision
Unified lexical representation for interpretable visual-language alignment

Yifan Li, Yikai Wang, Yanwei Fu, Dongyu Ru, Zheng Zhang, Tong He

NeurIPS 2024

2024

Visual-Language Alignment (VLA) has gained a lot of attention since CLIP’s groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation

Computer vision
The Amazon Nova family of models: Technical report and model card

Amazon Artificial General Intelligence

Amazon Technical Reports

2024

We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text

Related: Amazon Nova Canvas examples

Conversational AI
SD2: Synthetic doppler spectrum denoiser using SSM

Koushik Manjunatha, Morris Hsu, Rohit Kumar

MLTEC 2024

2024

The increasing popularity of wireless sensing applications has led to a growing demand for large datasets of realistic wireless data. However, collecting such wireless data is often time-consuming and expensive. To address this challenge, we propose a synthetic data generation pipeline using human mesh generated from videos that can generate data at scale. The pipeline first generates a 3D mesh of the human

Computer vision
Video token merging for long-form video understanding

Seon Ho Lee, Jue Wang, Zhikang Zhang, David Fan, Xinyu (Arthur) Li

NeurIPS 2024

2024

As the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loss, token merging shows promising results when used in collaboration with transformers. However, the application of token merging for long-form video

Computer vision

Johns Hopkins and Amazon announce new AI2AI research awards

Staff writer

October 17, 2023

Research award recipients named as part of the JHU + Amazon Initiative for Interactive AI (AI2AI), now in its second year.

Machine learning
Teaching household robots where to find requested objects

Gunnar Sigurdsson

October 06, 2023

Leveraging a large vision-language foundation model enables state-of-the-art performance in remote-object grounding.

Robotics
A quick guide to Amazon’s papers at ICCV 2023

Staff writer

September 29, 2023

From classic problems like image segmentation and object detection to theoretical topics like data representation and “machine unlearning”, Amazon researchers’ ICCV papers showcase the diversity of their work in computer vision.

Computer vision
Making automated visual-inspection systems practical

Tryambak Gangopadhyay

September 05, 2023

Benchmarking framework that includes a product-agnostic public dataset, guidelines for model selection, and an evaluation approach helps bridge the gap between research and real-world implementation.

Computer vision
“I don't remember a time in my life when I wasn't interested in science"

Sean O'Neill

August 24, 2023

From the urgent challenge of "machine unlearning" to overcoming the problem of critical learning periods in deep neural networks, Alessandro Achille is tackling fundamental issues on behalf of Amazon customers.

Computer vision
Automatically generating labeled training images

Austin Xu, Arjun Seshadri

August 22, 2023

Inverting generative adversarial networks to learn label assignments enables a high-quality labeled-image generator that’s trained on 50 images or fewer.

Computer vision

Computer vision

Recent publications

Related content

Work with us