-
2024We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language models, and augmented by the Segment Anything Model, VideoLISA generates temporally consistent segmentation masks in videos based on language instructions. Existing image-based
-
2024Visual-Language Alignment (VLA) has gained a lot of attention since CLIP’s groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation
-
Amazon Technical Reports2024We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text
-
MLTEC 20242024The increasing popularity of wireless sensing applications has led to a growing demand for large datasets of realistic wireless data. However, collecting such wireless data is often time-consuming and expensive. To address this challenge, we propose a synthetic data generation pipeline using human mesh generated from videos that can generate data at scale. The pipeline first generates a 3D mesh of the human
-
2024As the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loss, token merging shows promising results when used in collaboration with transformers. However, the application of token merging for long-form video
Related content
-
October 17, 2023Research award recipients named as part of the JHU + Amazon Initiative for Interactive AI (AI2AI), now in its second year.
-
October 06, 2023Leveraging a large vision-language foundation model enables state-of-the-art performance in remote-object grounding.
-
September 29, 2023From classic problems like image segmentation and object detection to theoretical topics like data representation and “machine unlearning”, Amazon researchers’ ICCV papers showcase the diversity of their work in computer vision.
-
September 05, 2023Benchmarking framework that includes a product-agnostic public dataset, guidelines for model selection, and an evaluation approach helps bridge the gap between research and real-world implementation.
-
August 24, 2023From the urgent challenge of "machine unlearning" to overcoming the problem of critical learning periods in deep neural networks, Alessandro Achille is tackling fundamental issues on behalf of Amazon customers.
-
August 22, 2023Inverting generative adversarial networks to learn label assignments enables a high-quality labeled-image generator that’s trained on 50 images or fewer.