-
2025Invoices and receipts submitted by employees are visually rich documents (VRDs) with textual, visual and layout information. To protect against the risk of fraud and abuse, it is crucial for organizations to efficiently extract desired information from submitted receipts. This helps in the assessment of key factors such as appropriateness of the expense claim, adherence to spending and transaction policies
-
2025Computing a comprehensive and robust visual representation of an arbitrary object or category of objects is a complex problem. The difficulty increases when one starts from a set of uncalibrated images obtained from different sources. We propose a self-supervised approach, Multi-Image Latent Embedding (MILE), which computes a single representation from such an image set. MILE operates incrementally, considering
-
2024We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language models, and augmented by the Segment Anything Model, VideoLISA generates temporally consistent segmentation masks in videos based on language instructions. Existing image-based
-
2024Visual-Language Alignment (VLA) has gained a lot of attention since CLIP’s groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation
-
Amazon Technical Reports2024We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text
Related content
-
October 17, 2023Research award recipients named as part of the JHU + Amazon Initiative for Interactive AI (AI2AI), now in its second year.
-
October 06, 2023Leveraging a large vision-language foundation model enables state-of-the-art performance in remote-object grounding.
-
September 29, 2023From classic problems like image segmentation and object detection to theoretical topics like data representation and “machine unlearning”, Amazon researchers’ ICCV papers showcase the diversity of their work in computer vision.
-
September 05, 2023Benchmarking framework that includes a product-agnostic public dataset, guidelines for model selection, and an evaluation approach helps bridge the gap between research and real-world implementation.
-
August 24, 2023From the urgent challenge of "machine unlearning" to overcoming the problem of critical learning periods in deep neural networks, Alessandro Achille is tackling fundamental issues on behalf of Amazon customers.
-
August 22, 2023Inverting generative adversarial networks to learn label assignments enables a high-quality labeled-image generator that’s trained on 50 images or fewer.