Paper Digest: CVPR 2025 Papers & Highlights
Note: CVPR-2025 accepts more than 2,800 papers, this page only includes 500 of them selected by our daily paper digest algorithm. Interested users can choose to read All 2,800 CVPR-2025 papers in a separate page.
To search for papers presented at CVPR-2025 on a specific topic, please make use of the search by venue (CVPR-2025) service. To summarize the latest research published at CVPR-2025 on a specific topic, you can utilize the review by venue (CVPR-2025) service. If you are interested in browsing papers by author, we have a comprehensive list of ~ 12,000 authors (CVPR-2025). Additionally, you may want to explore our “Best Paper” Digest (CVPR), which lists the most influential CVPR papers since 1988.
We’ve developed a service – CVPR-2025 Research Report that synthesizes the latest findings from CVPR 2025 into comprehensive reports. For instance, we’ve generated sample reports on Advances in 3D from Multi-View and Sensors: Insights from CVPR 2025 Papers and Advances in Image and Video Synthesis: Insights from CVPR 2025 Papers. We encourage interested users to utilize our service to create tailored reports on other emerging topics.
As a pioneer in the field since 2018, Paper Digest has curated thousands of such lists, drawing on years of accumulated data across decades of conferences and research topics.To ensure you never miss a breakthrough, our daily service sifts through tens of thousands of new papers, clinical trials, news articles, community posts every day – delivering only what matters most to your specific interests. Beyond discovery, Paper Digest offers built-in research tools to help users read articles, write articles, get answers, conduct literature reviews, and generate research reports more efficiently.
Paper Digest Team
New York City, New York, 10017
TABLE 1: Paper Digest: CVPR 2025 Papers & Highlights
| Paper | Author(s) | |
|---|---|---|
| 1 | Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. |
Matt Deitke; Christopher Clark; Sangho Lee; Rohun Tripathi; Yue Yang; Jae Sung Park; Mohammadreza Salehi; Niklas Muennighoff; Kyle Lo; Luca Soldaini; Jiasen Lu; Taira Anderson; Erin Bransom; Kiana Ehsani; Huong Ngo; YenSung Chen; Ajay Patel; Mark Yatskar; Chris Callison-Burch; Andrew Head; Rose Hendrix; Favyen Bastani; Eli VanderBilt; Nathan Lambert; Yvonne Chou; Arnavi Chheda; Jenna Sparks; Sam Skjonsberg; Michael Schmitz; Aaron Sarnat; Byron Bischoff; Pete Walsh; Chris Newell; Piper Wolters; Tanmay Gupta; Kuo-Hao Zeng; Jon Borchardt; Dirk Groeneveld; Crystal Nam; Sophie Lebrecht; Caitlin Wittlif; Carissa Schoenick; Oscar Michel; Ranjay Krishna; Luca Weihs; Noah A. Smith; Hannaneh Hajishirzi; Ross Girshick; Ali Farhadi; Aniruddha Kembhavi; |
| 2 | Scaling Inference Time Compute for Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. |
Nanye Ma; Shangyuan Tong; Haolin Jia; Hexiang Hu; Yu-Chuan Su; Mingda Zhang; Xuan Yang; Yandong Li; Tommi Jaakkola; Xuhui Jia; Saining Xie; |
| 3 | Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. |
Chaoyou Fu; Yuhan Dai; Yongdong Luo; Lei Li; Shuhuai Ren; Renrui Zhang; Zihan Wang; Chenyu Zhou; Yunhang Shen; Mengdan Zhang; Peixian Chen; Yanwei Li; Shaohui Lin; Sirui Zhao; Ke Li; Tong Xu; Xiawu Zheng; Enhong Chen; Caifeng Shan; Ran He; Xing Sun; |
| 4 | OmniGen: Unified Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce OmniGen, a new diffusion model for unified image generation. |
Shitao Xiao; Yueze Wang; Junjie Zhou; Huaying Yuan; Xingrun Xing; Ruiran Yan; Chaofan Li; Shuting Wang; Tiejun Huang; Zheng Liu; |
| 5 | VisionArena: 230k Real World User-VLM Conversations with Preference Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce VisionArena, the largest existing dataset of crowdsourced real-world conversations between users and VLMs. |
Christopher Chou; Lisa Dunlap; Koki Mashita; Krishna Mandal; Trevor Darrell; Ion Stoica; Joseph E. Gonzalez; Wei-Lin Chiang; |
| 6 | Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive–though subhuman–visual-spatial intelligence. |
Jihan Yang; Shusheng Yang; Anjali W. Gupta; Rilyn Han; Li Fei-Fei; Saining Xie; |
| 7 | Structured 3D Latents for Scalable and Versatile 3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce a novel 3D generation method for versatile and high-quality 3D asset creation.The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. |
Jianfeng Xiang; Zelong Lv; Sicheng Xu; Yu Deng; Ruicheng Wang; Bowen Zhang; Dong Chen; Xin Tong; Jiaolong Yang; |
| 8 | MambaVision: A Hybrid Mamba-Transformer Vision Backbone Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. |
Ali Hatamizadeh; Jan Kautz; |
| 9 | Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. |
Guy Yariv; Yuval Kirstain; Amit Zohar; Shelly Sheynin; Yaniv Taigman; Yossi Adi; Sagie Benaim; Adam Polyak; |
| 10 | LLaVA-Critic: Learning to Evaluate Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. |
Tianyi Xiong; Xiyao Wang; Dong Guo; Qinghao Ye; Haoqi Fan; Quanquan Gu; Heng Huang; Chunyuan Li; |
| 11 | DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we aim to incorporate the sub-quadratic modeling capability of Gated Linear Attention (GLA) into the 2D diffusion backbone. |
Lianghui Zhu; Zilong Huang; Bencheng Liao; Jun Hao Liew; Hanshu Yan; Jiashi Feng; Xinggang Wang; |
| 12 | From Slow Bidirectional to Fast Autoregressive Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To enable stable and high-quality distillation, we introduce a student initialization scheme based on teacher’s ODE trajectories, as well as an asymmetric distillation strategy that supervises a causal student model with a bidirectional teacher. |
Tianwei Yin; Qiang Zhang; Richard Zhang; William T. Freeman; Fredo Durand; Eli Shechtman; Xun Huang; |
| 13 | StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Current methods excel in generating short videos (up to 16s), but produce hard-cuts when naively extended to long video synthesis. To overcome these limitations, we present StreamingT2V, an autoregressive method that generates long videos of up to 2 minutes or longer with seamless transitions. |
Roberto Henschel; Levon Khachatryan; Hayk Poghosyan; Daniil Hayrapetyan; Vahram Tadevosyan; Zhangyang Wang; Shant Navasardyan; Humphrey Shi; |
| 14 | MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MoGe, a powerful model for recovering 3D geometry from monocular open-domain images. |
Ruicheng Wang; Sicheng Xu; Cassie Dai; Jianfeng Xiang; Yu Deng; Xin Tong; Jiaolong Yang; |
| 15 | Multi-subject Open-set Personalization in Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Video Alchemist–a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. |
Tsai-Shien Chen; Aliaksandr Siarohin; Willi Menapace; Yuwei Fang; Kwot Sin Lee; Ivan Skorokhodov; Kfir Aberman; Jun-Yan Zhu; Ming-Hsuan Yang; Sergey Tulyakov; |
| 16 | Let’s Verify and Reinforce Image Generation Step By Step Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we provide the first comprehensive investigation in the potential of CoT reasoning to enhance autoregressive image generation. |
Renrui Zhang; Chengzhuo Tong; Zhizheng Zhao; Ziyu Guo; Haoquan Zhang; Manyuan Zhang; Jiaming Liu; Peng Gao; Hongsheng Li; |
| 17 | MUSt3R: Multi-view Network for Stereo 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose an extension of DUSt3R from pairs to multiple views, that addresses all aforementioned concerns. |
Yohann Cabon; Lucas Stoffl; Leonid Antsfeld; Gabriela Csurka; Boris Chidlovskii; Jerome Revaud; Vincent Leroy; |
| 18 | Multimodal Autoregressive Pre-training of Large Vision Encoders Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce a novel method for pre-training of large-scale vision encoders. |
Enrico Fini; Mustafa Shukor; Xiujun Li; Philipp Dufter; Michal Klein; David Haldimann; Sai Aitharaju; Victor G. Turrisi da Costa; Louis Béthune; Zhe Gan; Alexander Toshev; Marcin Eichner; Moin Nabi; Yinfei Yang; Joshua Susskind; Alaaeldin El-Nouby; |
| 19 | Exploring The Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis—specifically, the deep fusion of large language models (LLMs) with diffusion transformers (DiTs) for multimodal generation. |
Bingda Tang; Boyang Zheng; Sayak Paul; Saining Xie; |
| 20 | Breaking The Memory Barrier of Contrastive Loss Via Tile-Based Strategy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the full instantiation of the similarity matrix demands substantial GPU memory, making large batch training highly resource-intensive. To address this, we propose a tile-based computation strategy that partitions the contrastive loss calculation into small blocks, avoiding full materialization of the similarity matrix. |
Zesen Cheng; Hang Zhang; Kehan Li; Sicong Leng; Zhiqiang Hu; Fei Wu; Deli Zhao; Xin Li; Lidong Bing; |
| 21 | Science-T2I: Addressing Scientific Illusions in Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. |
Jialuo Li; Wenhao Chai; Xingyu Fu; Haiyang Xu; Saining Xie; |
| 22 | WonderWorld: Interactive 3D Scene Generation from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present WonderWorld, a novel framework for interactive 3D scene generation that enables users to interactively specify scene contents and layout and see the created scenes in low latency. |
Hong-Xing Yu; Haoyi Duan; Charles Herrmann; William T. Freeman; Jiajun Wu; |
| 23 | MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. |
Wenyi Hong; Yean Cheng; Zhuoyi Yang; Weihan Wang; Lefan Wang; Xiaotao Gu; Shiyu Huang; Yuxiao Dong; Jie Tang; |
| 24 | Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. |
Yuying Ge; Yizhuo Li; Yixiao Ge; Ying Shan; |
| 25 | UniK3D: Universal Camera Monocular 3D Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: These limitations severely restrict their general applicability, causing poor performance in real-world scenarios with fisheye or panoramic images and resulting in substantial context loss. To address this, we present UniK3D, the first generalizable method for monocular 3D estimation able to model any camera. |
Luigi Piccinelli; Christos Sakaridis; Mattia Segu; Yung-Hsu Yang; Siyuan Li; Wim Abbeloos; Luc Van Gool; |
| 26 | MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To facilitate effective distillation, we introduce Monocular Teaching Assistant Knowledge Distillation (MonoTAKD), which proposes a camera-based teaching assistant (TA) model to transfer robust 3D visual knowledge to the student model, leveraging the smaller feature representation gap. |
Hou-I Liu; Christine Wu; Jen-Hao Cheng; Wenhao Chai; Shian-Yun Wang; Gaowen Liu; Hugo Latapie; Jhih-Ciang Wu; Jenq-Neng Hwang; Hong-Han Shuai; Wen-Huang Cheng; |
| 27 | DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, the numerous denoising steps in the robotic diffusion policy and the more dynamic, open-world nature of traffic scenes pose substantial challenges for generating diverse driving actions at a real-time speed. To address these challenges, we propose a novel truncated diffusion policy that incorporates prior multi-mode anchors and truncates the diffusion schedule, enabling the model to learn denoising from anchored Gaussian distribution to the multi-mode driving action distribution. |
Bencheng Liao; Shaoyu Chen; Haoran Yin; Bo Jiang; Cheng Wang; Sixu Yan; Xinbang Zhang; Xiangyu Li; Ying Zhang; Qian Zhang; Xinggang Wang; |
| 28 | Motion Prompting: Controlling Video Generation with Motion Trajectories Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we train a video generation model conditioned on spatio-temporally sparse or dense motion trajectories. |
Daniel Geng; Charles Herrmann; Junhwa Hur; Forrester Cole; Serena Zhang; Tobias Pfaff; Tatiana Lopez-Guevara; Yusuf Aytar; Michael Rubinstein; Chen Sun; Oliver Wang; Andrew Owens; Deqing Sun; |
| 29 | Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos. |
Linyi Jin; Richard Tucker; Zhengqi Li; David Fouhey; Noah Snavely; Aleksander Holynski; |
| 30 | LSNet: See Large, Focus Small Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a "See Large, Focus Small" strategy for lightweight vision network design. |
Ao Wang; Hui Chen; Zijia Lin; Jungong Han; Guiguang Ding; |
| 31 | Diffusion Model Is Effectively Its Own Teacher Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a novel self-distillation paradigm for improving the performance of diffusion models. |
Xinyin Ma; Runpeng Yu; Songhua Liu; Gongfan Fang; Xinchao Wang; |
| 32 | FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images (i.e., as few as 2-8 inputs), which is a challenging yet practical setting in real-world applications. |
Shangzhan Zhang; Jianyuan Wang; Yinghao Xu; Nan Xue; Christian Rupprecht; Xiaowei Zhou; Yujun Shen; Gordon Wetzstein; |
| 33 | CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video. |
Rundi Wu; Ruiqi Gao; Ben Poole; Alex Trevithick; Changxi Zheng; Jonathan T. Barron; Aleksander Holynski; |
| 34 | Apollo: An Exploration of Video Understanding in Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The high computational cost of training and evaluating such models and limited open research hinder the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. |
Orr Zohar; Xiaohan Wang; Yann Dubois; Nikhil Mehta; Tong Xiao; Philippe Hansen-Estruch; Licheng Yu; Xiaofang Wang; Felix Juefei-Xu; Ning Zhang; Serena Yeung-Levy; Xide Xia; |
| 35 | AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. |
Sherwin Bahmani; Ivan Skorokhodov; Guocheng Qian; Aliaksandr Siarohin; Willi Menapace; Andrea Tagliasacchi; David B. Lindell; Sergey Tulyakov; |
| 36 | SpatialLLM: A Compound 3D-Informed Design Towards Spatially-Intelligent Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we systematically study the impact of 3D-informed data, architecture, and training setups, introducing SpatialLLM, a large multi-modal model with advanced 3D spatial reasoning abilities. |
Wufei Ma; Luoxin Ye; Celso M de Melo; Alan Yuille; Jieneng Chen; |
| 37 | SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Otherwise, the model’s answers could be inconsistent with its behavior. Therefore, we propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment. |
Katrin Renz; Long Chen; Elahe Arani; Oleg Sinavski; |
| 38 | Foveated Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present a foveated instance segmentation(FovealSeg) framework that leverages real-time user gaze data to perform instance segmentation exclusively on instance of interest, resulting in substantial computational savings. |
Hongyi Zeng; Wenxuan Liu; Tianhua Xia; Jinhui Chen; Ziyun Li; Sai Qian Zhang; |
| 39 | MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce efficient methods for pointmap matching, camera tracking and local fusion, graph construction and loop closure, and second-order global optimisation. |
Riku Murai; Eric Dexheimer; Andrew J. Davison; |
| 40 | JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model.JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling.Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications.To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training.Extensive experiments show that JaunsFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches.This work represents a step toward more efficient and versatile vision-language models. |
Yiyang Ma; Xingchao Liu; Xiaokang Chen; Wen Liu; Chengyue Wu; Zhiyu Wu; Zizheng Pan; Zhenda Xie; Haowei Zhang; Xingkai Yu; Liang Zhao; Yisong Wang; Jiaying Liu; Chong Ruan; |
| 41 | VISTA: Enhancing Long-Duration and High-Resolution Video Understanding By Video Spatiotemporal Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective video spatiotemporal augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. |
Weiming Ren; Huan Yang; Jie Min; Cong Wei; Wenhu Chen; |
| 42 | Detect Any Mirrors: Boosting Learning Reliability on Large-Scale Unlabeled Data with An Iterative Data Engine Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To address this issue, we first collect a large-scale dataset of approximately 0.4 million mirror-related images from the internet, significantly expanding the data scale for mirror detection. To effectively exploit this unlabeled dataset, we propose the first semi-supervised framework (namely an iterative data engine) consisting of four steps: (1) mirror detection model training, (2) pseudo label prediction, (3) dual guidance scoring, and (4) selection of highly reliable pseudo labels. |
Zhaohu Xing; Lihao Liu; Yijun Yang; Hongqiu Wang; Tian Ye; Sixiang Chen; Wenxue Li; Guang Liu; Lei Zhu; |
| 43 | Arbitrary-steps Image Super-resolution Via Diffusion Inversion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance. |
Zongsheng Yue; Kang Liao; Chen Change Loy; |
| 44 | DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose several key ingredients to improve performance on both global and dense tasks, such as concatenating the [CLS] token with the patch average to train the alignment and curating data using both text and image modalities. |
Cijo Jose; Théo Moutakanni; Dahyun Kang; Federico Baldassarre; Timothée Darcet; Hu Xu; Daniel Li; Marc Szafraniec; Michaël Ramamonjisoa; Maxime Oquab; Oriane Siméoni; Huy V. Vo; Patrick Labatut; Piotr Bojanowski; |
| 45 | LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. |
Joya Chen; Ziyun Zeng; Yiqi Lin; Wei Li; Zejun Ma; Mike Zheng Shou; |
| 46 | RADIOv2.5: Improved Baselines for Agglomerative Vision Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we thoroughly analyze state-of-the-art agglomerative models, identifying critical challenges including resolution mode shifts, teacher imbalance, idiosyncratic teacher artifacts, and an excessive number of output tokens. To address these issues, we propose several novel solutions: multi-resolution training, mosaic augmentation, and improved balancing of teacher loss functions. |
Greg Heinrich; Mike Ranzinger; Hongxu Yin; Yao Lu; Jan Kautz; Andrew Tao; Bryan Catanzaro; Pavlo Molchanov; |
| 47 | DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. |
Jay Zhangjie Wu; Yuxuan Zhang; Haithem Turki; Xuanchi Ren; Jun Gao; Mike Zheng Shou; Sanja Fidler; Zan Gojcic; Huan Ling; |
| 48 | Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, through distribution classification tasks, we reveal that, from the perspective of neural network-based classifiers, even advanced diffusion models are still far from this goal. Specifically, classifiers are able to consistently and effortlessly distinguish real images from generated ones across various settings. |
Zebin You; Xinyu Zhang; Hanzhong Guo; Jingdong Wang; Chongxuan Li; |
| 49 | ScaMo: Exploring The Scaling Law in Autoregressive Motion Generation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a scalable motion generation framework that includes the motion tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer. |
Shunlin Lu; Jingbo Wang; Zeyu Lu; Ling-Hao Chen; Wenxun Dai; Junting Dong; Zhiyang Dou; Bo Dai; Ruimao Zhang; |
| 50 | DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. |
Leqi Shen; Guoqiang Gong; Tianxiang Hao; Tao He; Yifeng Zhang; Pengzhang Liu; Sicheng Zhao; Jungong Han; Guiguang Ding; |
| 51 | One-Minute Video Generation with Test-Time Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle to … |
Karan Dalal; Daniel Koceja; Jiarui Xu; Yue Zhao; Shihao Han; Ka Chun Cheung; Jan Kautz; Yejin Choi; Yu Sun; Xiaolong Wang; |
| 52 | Simpler Diffusion: 1.5 FID on ImageNet512 with Pixel-space Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions. |
Emiel Hoogeboom; Thomas Mensink; Jonathan Heek; Kay Lamerigts; Ruiqi Gao; Tim Salimans; |
| 53 | Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, these methods face several challenges: 1) handcrafted prompts require extensive expert knowledge and trial-and-error; 2) single-form learnable prompts struggle to capture complex anomaly semantics; and 3) an unconstrained prompt space limits generalization to unseen categories. To address these issues, we propose Bayesian Prompt Flow Learning (Bayes-PFL), which models the prompt space as a learnable probability distribution from a Bayesian perspective. |
Zhen Qu; Xian Tao; Xinyi Gong; ShiChen Qu; Qiyu Chen; Zhengtao Zhang; Xingang Wang; Guiguang Ding; |
| 54 | SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SF3D, a novel method for rapid and high-quality textured object mesh reconstruction from a single image in just 0.5 seconds. |
Mark Boss; Zixuan Huang; Aaryaman Vasishta; Varun Jampani; |
| 55 | VisionZip: Longer Is Better But Not Necessary in Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs.However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. |
Senqiao Yang; Yukang Chen; Zhuotao Tian; Chengyao Wang; Jingyao Li; Bei Yu; Jiaya Jia; |
| 56 | HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Multimodal large language models (MLLMs) promise better comprehension and reasoning but face their own challenges: (1) difficulty in fine-grained defect localization due to the limitations in capturing tiny details, and (2) constraints in providing pixel-wise outputs necessary for precise heatmap generation. To address these challenges, we propose HEIE: a novel MLLM-Based Hierarchical Explainable Image Implausibility Evaluator. |
Fan Yang; Ru Zhen; Jianing Wang; Yanhao Zhang; Haoxiang Chen; Haonan Lu; Sicheng Zhao; Guiguang Ding; |
| 57 | Reconstruction Vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. |
Jingfeng Yao; Bin Yang; Xinggang Wang; |
| 58 | Diffusion Self-Distillation for Zero-Shot Customized Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. |
Shengqu Cai; Eric Ryan Chan; Yunzhi Zhang; Leonidas Guibas; Jiajun Wu; Gordon Wetzstein; |
| 59 | TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. |
Liao Qu; Huichao Zhang; Yiheng Liu; Xu Wang; Yi Jiang; Yiming Gao; Hu Ye; Daniel K. Du; Zehuan Yuan; Xinglong Wu; |
| 60 | Model Poisoning Attacks to Federated Learning Via Multi-Round Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we make a key observation that their suboptimal effectiveness arises from only leveraging model-update consistency among malicious clients within individual training rounds, making the attack effect self-cancel across training rounds. |
Yueqi Xie; Minghong Fang; Neil Zhenqiang Gong; |
| 61 | Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose Fast 3D Reconstruction (Fast3R), a novel multi-view generalization to DUSt3R that achieves efficient and scalable 3D reconstruction by processing many views in parallel. |
Jianing Yang; Alexander Sax; Kevin J. Liang; Mikael Henaff; Hao Tang; Ang Cao; Joyce Chai; Franziska Meier; Matt Feiszli; |
| 62 | UniReal: Universal Image Generation and Editing Via Learning Real-world Dynamics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce UniReal, a unified framework designed to address various image generation and editing tasks. |
Xi Chen; Zhifei Zhang; He Zhang; Yuqian Zhou; Soo Ye Kim; Qing Liu; Yijun Li; Jianming Zhang; Nanxuan Zhao; Yilin Wang; Hui Ding; Zhe Lin; Hengshuang Zhao; |
| 63 | DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unlike the previous work, DriveGPT4-V1, which focused on open-loop tasks, this study explores the capabilities of LLMs in enhancing closed-loop autonomous driving. |
Zhenhua Xu; Yan Bai; Yujia Zhang; Zhuoling Li; Fei Xia; Kwan-Yee K. Wong; Jianqiang Wang; Hengshuang Zhao; |
| 64 | ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose ATP-LLaVA, a novel approach that adaptively determines instance-specific token pruning ratios for each LLM layer. |
Xubing Ye; Yukang Gan; Yixiao Ge; Xiao-Ping Zhang; Yansong Tang; |
| 65 | VoCo-LLaMA: Towards Vision Compression with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. |
Xubing Ye; Yukang Gan; Xiaoke Huang; Yixiao Ge; Yansong Tang; |
| 66 | Masking Meets Supervision: A Strong Learning Alliance Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose a novel way to involve masking augmentations dubbed Masked Sub-branch (MaskSub). |
Byeongho Heo; Taekyung Kim; Sangdoo Yun; Dongyoon Han; |
| 67 | Task Singular Vectors: Reducing Task Interference in Model Merging Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we study task vectors at the layer level, focusing on task layer matrices and their singular value decomposition. |
Antonio Andrea Gargiulo; Donato Crisostomi; Maria Sofia Bucarelli; Simone Scardapane; Fabrizio Silvestri; Emanuele Rodolà; |
| 68 | OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Despite recent progress, current document parsing methods have not been fairly and comprehensively evaluated due to the narrow coverage of document types and the simplified, unrealistic evaluation procedures in existing benchmarks. To address these gaps, we introduce OmniDocBench, a novel benchmark featuring high-quality annotations across nine document sources, including academic papers, textbooks, and more challenging cases such as handwritten notes and densely typeset newspapers. |
Linke Ouyang; Yuan Qu; Hongbin Zhou; Jiawei Zhu; Rui Zhang; Qunshu Lin; Bin Wang; Zhiyuan Zhao; Man Jiang; Xiaomeng Zhao; Jin Shi; Fan Wu; Pei Chu; Minghao Liu; Zhenxiang Li; Chao Xu; Bo Zhang; Botian Shi; Zhongying Tu; Conghui He; |
| 69 | ATA: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Image inpainting aims to fill the missing region of an image.Recently, there has been a surge of interest in foreground-conditioned background inpainting, a sub-task that fills the background of an image while the foreground subject and associated text prompt are provided.Existing background inpainting methods typically strictly preserve the subject’s original position from the source image,resulting in inconsistencies between the subject and the generated background.To address this challenge, we propose a new task, the "Text-Guided Subject-Position Variable Background Inpainting", which aims to dynamically adjust the subject position to achieve a harmonious relationship between the subject andthe inpainted background, and propose the Adaptive Transformation Agent (A^\text T A) for this task.Firstly, we design a PosAgent Block that adaptively predicts an appropriate displacement based on given features to achieve variable subject-position. |
Yizhe Tang; Zhimin Sun; Yuzhen Du; Ran Yi; Guangben Lu; Teng Hu; Luying Li; Lizhuang Ma; Fangyuan Zou; |
| 70 | Scaling Vision Pre-Training to 4K Resolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. |
Baifeng Shi; Boyi Li; Han Cai; Yao Lu; Sifei Liu; Marco Pavone; Jan Kautz; Song Han; Trevor Darrell; Pavlo Molchanov; Hongxu Yin; |
| 71 | Estimating Body and Hand Motion in An Ego-sensed World Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present EgoAllo, a system for human motion estimation from a head-mounted device. |
Brent Yi; Vickie Ye; Maya Zheng; Yunqi Li; Lea Müller; Georgios Pavlakos; Yi Ma; Jitendra Malik; Angjoo Kanazawa; |
| 72 | EgoLM: Multi-Modal Language Model of Egocentric Motions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce EgoLM, a versatile framework designed for egocentric motion understanding using multi-modal data. |
Fangzhou Hong; Vladimir Guzov; Hyo Jin Kim; Yuting Ye; Richard Newcombe; Ziwei Liu; Lingni Ma; |
| 73 | Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. |
Haotong Lin; Sida Peng; Jingxiao Chen; Songyou Peng; Jiaming Sun; Minghuan Liu; Hujun Bao; Jiashi Feng; Xiaowei Zhou; Bingyi Kang; |
| 74 | Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. |
Chengyue Wu; Xiaokang Chen; Zhiyu Wu; Yiyang Ma; Xingchao Liu; Zizheng Pan; Wen Liu; Zhenda Xie; Xingkai Yu; Chong Ruan; Ping Luo; |
| 75 | Docopilot: Improving Multimodal Models for Document-Level Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. |
Yuchen Duan; Zhe Chen; Yusong Hu; Weiyun Wang; Shenglong Ye; Botian Shi; Lewei Lu; Qibin Hou; Tong Lu; Hongsheng Li; Jifeng Dai; Wenhai Wang; |
| 76 | Stable Flow: Vital Layers for Training-Free Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. |
Omri Avrahami; Or Patashnik; Ohad Fried; Egor Nemchinov; Kfir Aberman; Dani Lischinski; Daniel Cohen-Or; |
| 77 | Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask^2DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. |
Tianhao Qi; Jianlong Yuan; Wanquan Feng; Shancheng Fang; Jiawei Liu; SiYu Zhou; Qian He; Hongtao Xie; Yongdong Zhang; |
| 78 | HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce HunyuanPortrait, a diffusion-based condition control method that employs implicit representations for highly controllable and lifelike portrait animation. |
Zunnan Xu; Zhentao Yu; Zixiang Zhou; Jun Zhou; Xiaoyu Jin; Fa-ting Hong; Xiaozhong Ji; Junwei Zhu; Chengfei Cai; Shiyu Tang; Qin Lin; Xiu Li; Qinglin Lu; |
| 79 | PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we extend each image into a "static" video and introduce a unified token compression strategy called Progressive Visual Token Compression (PVC), where the tokens of each frame are progressively encoded and adaptively compressed to supplement the information not extracted from previous frames. |
Chenyu Yang; Xuan Dong; Xizhou Zhu; Weijie Su; Jiahao Wang; Hao Tian; Zhe Chen; Wenhai Wang; Lewei Lu; Jifeng Dai; |
| 80 | GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. |
Xuanchi Ren; Tianchang Shen; Jiahui Huang; Huan Ling; Yifan Lu; Merlin Nimier-David; Thomas Müller; Alexander Keller; Sanja Fidler; Jun Gao; |
| 81 | HoVLE: Unleashing The Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most existing monolithic VLMs require tuning pre-trained LLMs to acquire vision abilities, which may degrade their language capabilities. To address this dilemma, this paper presents a novel high-performance monolithic VLM named HoVLE. |
Chenxin Tao; Shiqian Su; Xizhou Zhu; Chenyu Zhang; Zhe Chen; Jiawen Liu; Wenhai Wang; Lewei Lu; Gao Huang; Yu Qiao; Jifeng Dai; |
| 82 | DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. |
Wenbo Hu; Xiangjun Gao; Xiaoyu Li; Sijie Zhao; Xiaodong Cun; Yong Zhang; Long Quan; Ying Shan; |
| 83 | SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. |
Hao Li; Changyao Tian; Jie Shao; Xizhou Zhu; Zhaokai Wang; Jinguo Zhu; Wenhan Dou; Xiaogang Wang; Hongsheng Li; Lewei Lu; Jifeng Dai; |
| 84 | GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce GaussTR, a novel Gaussian-based Transformer framework that unifies sparse 3D modeling with foundation model alignment through Gaussian representations to advance 3D spatial understanding. |
Haoyi Jiang; Liu Liu; Tianheng Cheng; Xinjie Wang; Tianwei Lin; Zhizhong Su; Wenyu Liu; Xinggang Wang; |
| 85 | Mono-InternVL: Pushing The Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we focus on monolithic Multimodal Large Language Models (MLLMs) that integrate visual encoding and language decoding into a single LLM. |
Gen Luo; Xue Yang; Wenhan Dou; Zhaokai Wang; Jiawen Liu; Jifeng Dai; Yu Qiao; Xizhou Zhu; |
| 86 | Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the task of predicting functional 3D scene graphs for real-world indoor environments from posed RGB-D images. |
Chenyangguang Zhang; Alexandros Delitzas; Fangjinhua Wang; Ruida Zhang; Xiangyang Ji; Marc Pollefeys; Francis Engelmann; |
| 87 | StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper aims to tackle the problem of photorealistic view synthesis from vehicle sensors data. |
Yunzhi Yan; Zhen Xu; Haotong Lin; Haian Jin; Haoyu Guo; Yida Wang; Kun Zhan; Xianpeng Lang; Hujun Bao; Xiaowei Zhou; Sida Peng; |
| 88 | Speedy-Splat: Fast 3D Gaussian Splatting with Sparse Pixels and Sparse Primitives Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, its rendering speed and model size still present bottlenecks, especially in resource-constrained settings. In this paper, we identify and address two key inefficiencies in 3D-GS to substantially improve rendering speed.These improvements also yield the ancillary benefits of reduced model size and training time. |
Alex Hanson; Allen Tu; Geng Lin; Vasu Singla; Matthias Zwicker; Tom Goldstein; |
| 89 | PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose a principled sensitivity pruning score that preserves visual fidelity and foreground details at significantly higher compression ratios than existing approaches. |
Alex Hanson; Allen Tu; Vasu Singla; Mayuka Jayawardhana; Matthias Zwicker; Tom Goldstein; |
| 90 | Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. |
Jiuhai Chen; Jianwei Yang; Haiping Wu; Dianqi Li; Jianfeng Gao; Tianyi Zhou; Bin Xiao; |
| 91 | Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Nevertheless, existing methods that depend on paired data struggle to generalize effectively to real-world scenarios, primarily due to the disparity between synthetic and authentic rain effects. To address these limitations, we propose a dual-branch spatio-temporal state-space model to enhance rain streak removal in video sequences. |
Shangquan Sun; Wenqi Ren; Juxiang Zhou; Shu Wang; Jianhou Gan; Xiaochun Cao; |
| 92 | Decentralized Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Decentralized Diffusion Models, a scalable framework to distribute diffusion model training across independent clusters or datacenters by eliminating the dependence on a centralized, high-bandwidth networking fabric. |
David McAllister; Matthew Tancik; Jiaming Song; Angjoo Kanazawa; |
| 93 | AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose AnySat, a multimodal model based on joint embedding predictive architecture (JEPA) and scale-adaptive spatial encoders, allowing us to train a single model on highly heterogeneous data in a self-supervised manner. |
Guillaume Astruc; Nicolas Gonthier; Clément Mallet; Loic Landrieu; |
| 94 | Generative Image Layer Decomposition with Visual Effects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose LayerDecomp, a generative framework for image layer decomposition which outputs photorealistic clean backgrounds and high-quality transparent foregrounds with faithfully preserved visual effects. |
Jinrui Yang; Qing Liu; Yijun Li; Soo Ye Kim; Daniil Pakhomov; Mengwei Ren; Jianming Zhang; Zhe Lin; Cihang Xie; Yuyin Zhou; |
| 95 | What’s in The Image? A Deep-Dive Into The Vision of Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the mechanisms underlying how VLMs process visual information remain largely unexplored. In this paper, we conduct a thorough empirical analysis, focusing on the attention modules across layers, by which we reveal several key insights about how these models process visual data: (i) the internal representation of the query tokens (e.g., representations of "describe the image"), is utilized by the model to store global image information; we demonstrate that the model generates surprisingly descriptive responses solely from these tokens, without direct access to image tokens. |
Omri Kaduri; Shai Bagon; Tali Dekel; |
| 96 | 4D-Fly: Fast 4D Reconstruction from A Single Monocular Video Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the time-consuming issue, we propose 4D-Fly, an efficient and effective framework for reconstructing the 4D scene from a monocular video (hundreds of frames within 6 minutes), more than 20 xfaster and even achieving higher quality than previous optimization methods. |
Diankun Wu; Fangfu Liu; Yi-Hsin Hung; Yue Qian; Xiaohang Zhan; Yueqi Duan; |
| 97 | Robust Multi-Object 4D Generation for In-the-wild Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To rigorously evaluate the quality of scene generation and the accuracy of the motion under multi-object occlusions, we introduce MOSE-PTS, a subset of the challenging MOSE benchmark, which we annotated with high-quality 2D point tracks. |
Wen-Hsuan Chu; Lei Ke; Jianmeng Liu; Mingxiao Huo; Pavel Tokmakov; Katerina Fragkiadaki; |
| 98 | Continuous 3D Perception Model with Persistent State Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a unified framework capable of solving a broad range of 3D tasks. |
Qianqian Wang; Yifei Zhang; Aleksander Holynski; Alexei A. Efros; Angjoo Kanazawa; |
| 99 | DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present DI-PCG, a novel and efficient method for Inverse PCG from general image conditions. |
Wang Zhao; Yan-Pei Cao; Jiale Xu; Yuejiang Dong; Ying Shan; |
| 100 | NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose NVComposer, a novel approach that eliminates the need for explicit external alignment. |
Lingen Li; Zhaoyang Zhang; Yaowei Li; Jiale Xu; Wenbo Hu; Xiaoyu Li; Weihao Cheng; Jinwei Gu; Tianfan Xue; Ying Shan; |
| 101 | MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose MIMO, a novel framework which can not only synthesize realistic character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. |
Yifang Men; Yuan Yao; Miaomiao Cui; Liefeng Bo; |
| 102 | Conical Visual Concentration for Efficient Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address this challenge, we conduct an empirical study revealing that all visual tokens are necessary for LVLMs in the shallow layers, and token redundancy progressively increases in the deeper layers.To this end, we propose ViCo, a conical-style visual concentration strategy for LVLMs to boost their efficiency in both training and inference with neglectable performance loss. |
Long Xing; Qidong Huang; Xiaoyi Dong; Jiajie Lu; Pan Zhang; Yuhang Zang; Yuhang Cao; Conghui He; Jiaqi Wang; Feng Wu; Dahua Lin; |
| 103 | Mask-Adapter: The Devil Is in The Masks for Open-Vocabulary Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Recent open-vocabulary segmentation methods adopt mask generators to predict segmentation masks and leverage pre-trained vision-language models, *e.g.*, CLIP, to classify these masks via mask pooling.Although these approaches show promising results, it is counterintuitive that accurate masks often fail to yield accurate classification results through pooling CLIP image embeddings within the mask regions.In this paper, we reveal the performance limitations of mask pooling and introduce **Mask-Adapter**, a simple yet effective method to address these challenges in open-vocabulary segmentation.Compared to directly using proposal masks, our proposed Mask-Adapter extracts *semantic activation maps* from proposal masks, providing richer contextual information and ensuring alignment between masks and CLIP.Additionally, we propose a *mask consistency loss* that encourages proposal masks with similar IoUs to obtain similar CLIP embeddings to enhance models’ robustness to varying predicted masks.Mask-Adapter integrates seamlessly into open-vocabulary segmentation methods based on mask pooling in a plug-and-play manner, delivering more accurate classification results. |
Yongkang Li; Tianheng Cheng; Bin Feng; Wenyu Liu; Xinggang Wang; |
| 104 | GenFusion: Closing The Loop Between Reconstruction and Generation Via Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We found that the source of this phenomenon lies in the misalignment between 3D constraints and generative priors. To address this problem, we propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. |
Sibo Wu; Congrong Xu; Binbin Huang; Andreas Geiger; Anpei Chen; |
| 105 | InteractVLM: 3D Interaction Reasoning from 2D Foundational Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. |
Sai Kumar Dwivedi; Dimitrije Antić; Shashank Tripathi; Omid Taheri; Cordelia Schmid; Michael J. Black; Dimitrios Tzionas; |
| 106 | MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework (MMAudio). |
Ho Kei Cheng; Masato Ishii; Akio Hayakawa; Takashi Shibuya; Alexander Schwing; Yuki Mitsufuji; |
| 107 | MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a system that allows for accurate, fast, and robust estimation of camera parameters and depth maps from casual monocular videos of dynamic scenes. |
Zhengqi Li; Richard Tucker; Forrester Cole; Qianqian Wang; Linyi Jin; Vickie Ye; Angjoo Kanazawa; Aleksander Holynski; Noah Snavely; |
| 108 | HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. |
Prithviraj Banerjee; Sindi Shkodrani; Pierre Moulon; Shreyas Hampali; Shangchen Han; Fan Zhang; Linguang Zhang; Jade Fountain; Edward Miller; Selen Basol; Richard Newcombe; Robert Wang; Jakob Julian Engel; Tomas Hodan; |
| 109 | Reconstructing People, Places, and Cameras Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present "Humans and Structure from Motion" (HSfM), a method for jointly reconstructing multiple human meshes, scene point clouds, and camera parameters in a metric world coordinate system from a sparse set of uncalibrated multi-view images featuring people. |
Lea Müller; Hongsuk Choi; Anthony Zhang; Brent Yi; Jitendra Malik; Angjoo Kanazawa; |
| 110 | Controllable Human Image Generation with Personalized Multi-Garments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: We present BootControl, a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments.Here, the main … |
Yisol Choi; Sangkyung Kwak; Sihyun Yu; Hyungwon Choi; Jinwoo Shin; |
| 111 | CraftsMan3D: High-fidelity Mesh Generation with 3D Native Diffusion and Interactive Geometry Refiner Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a novel generative 3D modeling system, coined CraftsMan, which can generate high-fidelity 3D geometries with highly varied shapes, regular mesh topologies, and detailed surfaces, and, notably, allows for refining the geometry in an interactive manner. |
Weiyu Li; Jiarui Liu; Hongyu Yan; Rui Chen; Yixun Liang; Xuelin Chen; Ping Tan; Xiaoxiao Long; |
| 112 | Parallel Sequence Modeling Via Generalized Spatial Propagation Network Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. |
Hongjun Wang; Wonmin Byeon; Jiarui Xu; Jinwei Gu; Ka Chun Cheung; Xiaolong Wang; Kai Han; Jan Kautz; Sifei Liu; |
| 113 | All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including True/False, multiple choice, and open-ended questions, which are further divided into short and long-answer categories. |
Ashmal Vayani; Dinura Dissanayake; Hasindri Watawana; Noor Ahsan; Nevasini Sasikumar; Omkar Thawakar; Henok Biadglign Ademtew; Yahya Hmaiti; Amandeep Kumar; Kartik Kukreja; Mykola Maslych; Wafa Al Ghallabi; Mihail Minkov Mihaylov; Chao Qin; Abdelrahman M. Shaker; Mike Zhang; Mahardika Krisna Ihsani; Amiel Gian Esplana; Monil Gokani; Shachar Mirkin; Harsh Singh; Ashay Srivastava; Endre Hamerlik; Fathinah Asma Izzati; Fadillah Adamsyah Maani; Sebastian Cavada; Jenny Chim; Rohit Gupta; Sanjay Manjunath; Kamila Zhumakhanova; Feno Heriniaina Rabevohitra; Azril Hafizi Amirudin; Muhammad Ridzuan; Daniya Najiha Abdul Kareem; Ketan Pravin More; Kunyang Li; Pramesh Shakya; Muhammad Saad; Amirpouya Ghasemaghaei; Amirbek Djanibekov; Dilshod Azizov; Branislava Jankovic; Naman Bhatia; Alvaro Cabrera; Johan Obando-Ceron; Olympiah Otieno; Febian Farestam; Muztoba Rabbani; Sanoojan Ballah; Santosh Sanjeev; Abduragim Shtanchaev; Maheen Fatima; Thao Nguyen; Amrin Kareem; Toluwani Aremu; Nathan Augusto Zacarias Xavier; Amit Bhatkal; Hawau Olamide Toyin; Aman Chadha; Hisham Cholakkal; Rao Muhammad Anwer; Michael Felsberg; Jorma Laaksonen; Thamar Solorio; Monojit Choudhury; Ivan Laptev; Mubarak Shah; Salman Khan; Fahad Shahbaz Khan; |
| 114 | SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present SPAR3D, a novel two-stage approach aiming to take the best of both directions. |
Zixuan Huang; Mark Boss; Aaryaman Vasishta; James M. Rehg; Varun Jampani; |
| 115 | Goku: Flow Based Video Generative Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. |
Shoufa Chen; Chongjian Ge; Yuqi Zhang; Yida Zhang; Fengda Zhu; Hao Yang; Hongxiang Hao; Hui Wu; Zhichao Lai; Yifei Hu; Ting-Che Lin; Shilong Zhang; Fu Li; Chuan Li; Xing Wang; Yanghua Peng; Peize Sun; Ping Luo; Yi Jiang; Zehuan Yuan; Bingyue Peng; Xiaobing Liu; |
| 116 | Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To facilitate research in AD & reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. |
Jiacong Xu; Shao-Yuan Lo; Bardia Safaei; Vishal M. Patel; Isht Dwivedi; |
| 117 | VLog: Video-Language Models By Generative Retrieval of Narration Vocabulary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: **A Vocabulary Update Strategy** Leveraging generative models to extend the vocabulary for novel events encountered during inference.To validate our approach, we introduce **VidCab-Eval**, a development set requiring concise narrations with reasoning relationships (e.g., before and after). |
Kevin Qinghong Lin; Mike Zheng Shou; |
| 118 | ShowUI: One Vision-Language-Action Model for GUI Visual Agent Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we develop a vision-language-action model in the digital world, namely Our Model, which features the following innovations:1. **UI-Guided Visual Token Selection** to reduce computational costs by formulating screenshots as a UI-connected graph, adaptively identifying their redundant relationships and serving as the criteria for token selection during self-attention blocks. 2. **Interleaved Vision-Language-Action Streaming** that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency. 3. **Small-Scale High-Quality GUI Instruction-Following Datasets** by careful data curation and employing a resampling strategy to address significant data type imbalances. |
Kevin Qinghong Lin; Linjie Li; Difei Gao; Zhengyuan Yang; Shiwei Wu; Zechen Bai; Stan Weixian Lei; Lijuan Wang; Mike Zheng Shou; |
| 119 | 3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose 3D Gaussian Unscented Transform (3DGUT), replacing the EWA splatting formulation with the Unscented Transform that approximates the particles through sigma points, which can be projected exactly under any nonlinear projection function. |
Qi Wu; Janick Martinez Esturo; Ashkan Mirzaei; Nicolas Moënne-Loccoz; Zan Gojcic; |
| 120 | Realistic Test-Time Adaptation of Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, previous works on transductive or test-time adaptation (TTA) often make strong assumptions about the data distribution, such as the presence of all classes. Our work challenges these favorable deployment scenarios and introduces a more realistic evaluation framework, including (i) a variable number of effective classes for adaptation within a single batch, and (ii) non-i.i.d. batches of test samples in online adaptation settings. |
Maxime Zanella; Clément Fuchs; Christophe De Vleeschouwer; Ismail Ben Ayed; |
| 121 | Re-thinking Temporal Search for Long-Form Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we revisit temporal search paradigms for long-form video understanding, studying a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). |
Jinhui Ye; Zihan Wang; Haosen Sun; Keshigeyan Chandrasegaran; Zane Durante; Cristobal Eyzaguirre; Yonatan Bisk; Juan Carlos Niebles; Ehsan Adeli; Li Fei-Fei; Jiajun Wu; Manling Li; |
| 122 | FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose FreeSim, a camera simulation method for driving scenes via 3D Gaussian Splatting and diffusion-based image generation. |
Lue Fan; Hao Zhang; Qitai Wang; Hongsheng Li; Zhaoxiang Zhang; |
| 123 | TinyFusion: Diffusion Transformers Learned Shallow Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. |
Gongfan Fang; Kunjun Li; Xinyin Ma; Xinchao Wang; |
| 124 | Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose DUal ConsolidaTion (Duct) to unify and consolidate historical knowledge at both the representation and classifier levels. |
Da-Wei Zhou; Zi-Wen Cai; Han-Jia Ye; Lijun Zhang; De-Chuan Zhan; |
| 125 | MambaOut: Do We Really Need Mamba for Vision? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. |
Weihao Yu; Xinchao Wang; |
| 126 | DepthSplat: Connecting Gaussian Splatting and Depth Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present DepthSplat to connect Gaussian splatting and depth estimation and study their interactions. |
Haofei Xu; Songyou Peng; Fangjinhua Wang; Hermann Blum; Daniel Barath; Andreas Geiger; Marc Pollefeys; |
| 127 | UniPhy: Learning A Unified Constitutive Model for Inverse Physics Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose UniPhy, a common latent-conditioned neural constitutive model that can encode the physical properties of diverse materials. |
Himangi Mittal; Peiye Zhuang; Hsin-Ying Lee; Shubham Tulsiani; |
| 128 | SoMA: Singular Value Decomposed Minor Components Adaptation for Domain Generalizable Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To gain insights into the distribution of generalizable components, we begin by analyzing the pre-trained weights through the lens of singular value decomposition. Building on these insights, we introduce Singular Value Decomposed Minor Components Adaptation (SoMA), an approach that selectively tunes minor singular components while keeping the residual parts frozen. |
Seokju Yun; Seunghye Chae; Dongheon Lee; Youngmin Ro; |
| 129 | MARBLE: Material Recomposition and Blending in CLIP-Space Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose MARBLE, a method for performing material blending and recomposing fine-grained material properties by finding material embeddings in CLIP-space and using that to control pre-trained text-to-image models. |
Ta Ying Cheng; Prafull Sharma; Mark Boss; Varun Jampani; |
| 130 | Navigation World Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. |
Amir Bar; Gaoyue Zhou; Danny Tran; Trevor Darrell; Yann LeCun; |
| 131 | PEACE: Empowering Geologic Map Holistic Understanding with MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To quantify this gap, we construct **GeoMap-Bench**, the first-ever benchmark for evaluating MLLMs in geologic map understanding, which assesses the full-scale abilities in extracting, referring, grounding, reasoning, and analyzing. To bridge this gap, we introduce **GeoMap-Agent**, the inaugural agent designed for geologic map understanding, which features three modules: Hierarchical Information Extraction (HIE), Domain Knowledge Injection (DKI), and Prompt-enhanced Question Answering (PEQA). |
Yangyu Huang; Tianyi Gao; Haoran Xu; Qihao Zhao; Yang Song; Zhipeng Gui; Tengchao Lv; Hao Chen; Lei Cui; Scarlett Li; Furu Wei; |
| 132 | Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. |
Yongshuo Zong; Qin Zhang; Dongsheng An; Zhihua Li; Xiang Xu; Linghan Xu; Zhuowen Tu; Yifan Xing; Onkar Dabeer; |
| 133 | Generating 3D-Consistent Videos from Unposed Internet Photos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We address the problem of generating videos from unposed internet photos. |
Gene Chou; Kai Zhang; Sai Bi; Hao Tan; Zexiang Xu; Fujun Luan; Bharath Hariharan; Noah Snavely; |
| 134 | InstanceCap: Improving Text-to-Video Generation Via Instance-aware Structured Caption Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel instance-aware structured caption framework, termed \mathtt InstanceCap , to achieve instance-level and fine-grained video caption for the first time. |
Tiehan Fan; Kepan Nan; Rui Xie; Penghao Zhou; Zhenheng Yang; Chaoyou Fu; Xiang Li; Jian Yang; Ying Tai; |
| 135 | Koala-36M: A Large-scale Video Dataset Improving Consistency Between Fine-grained Conditions and Video Content Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing datasets exhibit various limitations in these areas. To address these challenges, we introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality. |
Qiuheng Wang; Yukai Shi; Jiarong Ou; Rui Chen; Ke Lin; Jiahao Wang; Boyuan Jiang; Haotian Yang; Mingwu Zheng; Xin Tao; Fei Yang; Pengfei Wan; Di Zhang; |
| 136 | MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To handle more views, reduce errors, and improve inference time, we propose the fast single-stage feed-forward network MV-DUSt3R. |
Zhenggang Tang; Yuchen Fan; Dilin Wang; Hongyu Xu; Rakesh Ranjan; Alexander Schwing; Zhicheng Yan; |
| 137 | Generative Video Propagation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large-scale video generation models have the inherent ability to realistically model natural scenes. In this paper, we demonstrate that through a careful design of a generative video propagation framework, various video tasks can be addressed in a unified way by leveraging the generative power of such models. |
Shaoteng Liu; Tianyu Wang; Jui-Hsien Wang; Qing Liu; Zhifei Zhang; Joon-Young Lee; Yijun Li; Bei Yu; Zhe Lin; Soo Ye Kim; Jiaya Jia; |
| 138 | Language-Guided Image Tokenization for Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). |
Kaiwen Zha; Lijun Yu; Alireza Fathi; David A. Ross; Cordelia Schmid; Dina Katabi; Xiuye Gu; |
| 139 | GG-SSMs: Graph-Generating State Space Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While methods like Mamba and VMamba introduce selective and flexible scanning strategies, they rely on predetermined paths, which fails to efficiently capture complex dependencies. We introduce Graph-Generating State Space Models (GG-SSMs), a novel framework that overcomes these limitations by dynamically constructing graphs based on feature relationships. |
Nikola Zubic; Davide Scaramuzza; |
| 140 | Doppelgangers++: Improved Visual Disambiguation with Geometric 3D Features Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present Doppelgangers++, a method to enhance doppelganger detection and improve 3D reconstruction accuracy. |
Yuanbo Xiangli; Ruojin Cai; Hanyu Chen; Jeffrey Byrne; Noah Snavely; |
| 141 | EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, we propose a half-body human animation method, dubbed EchoMimicV2, that leverages a novel Audio-Pose Dynamic Harmonization strategy, including Pose Sampling and Audio Diffusion, to enhance half-body details, facial and gestural expressiveness, and meanwhile reduce conditions redundancy. |
Rang Meng; Xingyu Zhang; Yuming Li; Chenguang Ma; |
| 142 | Video Depth Anything: Consistent Depth Estimation for Super-Long Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. |
Sili Chen; Hengkai Guo; Shengnan Zhu; Feihu Zhang; Zilong Huang; Jiashi Feng; Bingyi Kang; |
| 143 | Towards Universal Soccer Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper aims to develop a comprehensive multi-modal framework for soccer video understanding.Specifically, we make the following contributions in this paper:(i) we introduce **SoccerReplay-1988**, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1,988 complete matches, with an automated annotation pipeline;(ii) we present the first visual-language foundation model in the soccer domain, **MatchVision**, which leverages spatiotemporal information across soccer videos and excels in various downstream tasks;(iii) we conduct extensive experiments and ablation studies on action classification, commentary generation, and multi-view foul recognition,and demonstrate state-of-the-art performance on all of them, substantially outperforming existing models, which has demonstrated the superiority of our proposed data and model. |
Jiayuan Rao; Haoning Wu; Hao Jiang; Ya Zhang; Yanfeng Wang; Weidi Xie; |
| 144 | CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose the first manual-based appliance manipulation benchmark CheckManual. |
Yuxing Long; Jiyao Zhang; Mingjie Pan; Tianshu Wu; Taewhan Kim; Hao Dong; |
| 145 | Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, our analysis of the combination paradigm reveals that the current one-shot and independent selection mechanism induces an incompatibility issue between distilled and real images. To address this issue, we introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation. |
Yanda Chen; Gongwei Chen; Miao Zhang; Weili Guan; Liqiang Nie; |
| 146 | LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, these task-agnostic object features include much redundant information and missing details for the task-relevant area. To tackle these problems, we propose LSceneLLM, an adaptive framework that automatically identifies task-relevant areas by leveraging LLM’s visual preference for different tasks, followed by a plug-and-play scene magnifier module to capture fine-grained details in focused areas. |
Hongyan Zhi; Peihao Chen; Junyan Li; Shuailei Ma; Xinyu Sun; Tianhang Xiang; Yinjie Lei; Mingkui Tan; Chuang Gan; |
| 147 | RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To enhance the robotic brain’s core capabilities from abstract to concrete, we introduce ShareRobot, a high-quality heterogeneous dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. |
Yuheng Ji; Huajie Tan; Jiayu Shi; Xiaoshuai Hao; Yuan Zhang; Hengyuan Zhang; Pengwei Wang; Mengdi Zhao; Yao Mu; Pengju An; Xinda Xue; Qinghang Su; Huaihai Lyu; Xiaolong Zheng; Jiaming Liu; Zhongyuan Wang; Shanghang Zhang; |
| 148 | Make It Count: Text-to-Image Generation with An Accurate Number of Objects Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We fix the latter by training a model that predicts both the shape and location of a missing object, based on the layout of existing ones, and show how it can be used to guide denoising with correct object count. |
Lital Binyamin; Yoad Tewel; Hilit Segev; Eran Hirsch; Royi Rassin; Gal Chechik; |
| 149 | Magma: A Foundation Model for Multimodal AI Agents Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. |
Jianwei Yang; Reuben Tan; Qianhui Wu; Ruijie Zheng; Baolin Peng; Yongyuan Liang; Yu Gu; Mu Cai; Seonghyeon Ye; Joel Jang; Yuquan Deng; Jianfeng Gao; |
| 150 | GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Although 3D semantic Gaussian serves as an object-centric sparse alternative, most of the Gaussians still describe the empty region with low efficiency. To address this, we propose a probabilistic Gaussian superposition model which interprets each Gaussian as a probability distribution of its neighborhood being occupied and conforms to probabilistic multiplication to derive the overall geometry. |
Yuanhui Huang; Amonnut Thammatadatrakoon; Wenzhao Zheng; Yunpeng Zhang; Dalong Du; Jiwen Lu; |
| 151 | Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Buffer Anytime, a framework for estimation of depth and normal maps (which we call geometric buffers) from video that eliminates the need for paired video–depth and video–normal training data. |
Zhengfei Kuang; Tianyuan Zhang; Kai Zhang; Hao Tan; Sai Bi; Yiwei Hu; Zexiang Xu; Milos Hasan; Gordon Wetzstein; Fujun Luan; |
| 152 | StoryGPT-V: Large Language Models As Consistent Story Visualizers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Yet, the emerging Large Language Model (LLM) showcases robust reasoning abilities to navigate through ambiguous references and process extensive sequences. Therefore, we introduce StoryGPT-V, which leverages the merits of the latent diffusion (LDM) and LLM to produce images with consistent and high-quality characters grounded on given story descriptions. |
Xiaoqian Shen; Mohamed Elhoseiny; |
| 153 | RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Traditional feedback learning for hallucination reduction relies on labor-intensive manual labeling or expensive proprietary models.This leaves the community without foundational knowledge about how to build high-quality feedback with open-source MLLMs.In this work, we introduce RLAIF-V, a novel framework that aligns MLLMs in a fully open-source paradigm. |
Tianyu Yu; Haoye Zhang; Qiming Li; Qixin Xu; Yuan Yao; Da Chen; Xiaoman Lu; Ganqu Cui; Yunkai Dang; Taiwen He; Xiaocheng Feng; Jun Song; Bo Zheng; Zhiyuan Liu; Tat-Seng Chua; Maosong Sun; |
| 154 | Continuous Locomotive Crowd Behavior Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce a novel method for automatically generating continuous, realistic crowd trajectories with heterogeneous behaviors and interactions among individuals. |
Inhwan Bae; Junoh Lee; Hae-Gon Jeon; |
| 155 | VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning Via Core Frame Selection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. |
Songhao Han; Wei Huang; Hairong Shi; Le Zhuo; Xiu Su; Shifeng Zhang; Xu Zhou; Xiaojuan Qi; Yue Liao; Si Liu; |
| 156 | ReCapture: Generative Video Camera Controls for User-Provided Videos Using Masked Video Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present ReCapture, a method for generating new videos with novel camera trajectories from a single user-provided video. |
David Junhao Zhang; Roni Paiss; Shiran Zada; Nikhil Karnad; David E. Jacobs; Yael Pritch; Inbar Mosseri; Mike Zheng Shou; Neal Wadhwa; Nataniel Ruiz; |
| 157 | ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding Using Captions with Grounded Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although both research directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. |
Ali Athar; Xueqing Deng; Liang-Chieh Chen; |
| 158 | Pippo: High-Resolution Multi-View Humans from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. |
Yash Kant; Ethan Weber; Jin Kyu Kim; Rawal Khirodkar; Su Zhaoen; Julieta Martinez; Igor Gilitschenski; Shunsuke Saito; Timur Bagautdinov; |
| 159 | OFER: Occluded Face Expression Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper we introduce OFER, a novel approach for single-image 3D face reconstruction that can generate plausible, diverse, and expressive 3D faces, even under strong occlusions. |
Pratheba Selvaraju; Victoria Fernandez Abrevaya; Timo Bolkart; Rick Akkerman; Tianyu Ding; Faezeh Amjadi; Ilya Zharkov; |
| 160 | FastVLM: Efficient Vision Encoding for Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce FastVLM, which achieves an optimized trade-off between resolution, latency, and accuracy by incorporating FastViTHD–a new hybrid vision encoder that outputs fewer tokens and significantly reduces encoding time while processing high-resolution images. |
Pavan Kumar Anasosalu Vasu; Fartash Faghri; Chun-Liang Li; Cem Koc; Nate True; Albert Antony; Gokula Santhanam; James Gabriel; Peter Grasch; Oncel Tuzel; Hadi Pouransari; |
| 161 | Multiple Object Tracking As ID Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Therefore, we introduce a new perspective that treats Multiple Object Tracking as an in-context ID Prediction task, transforming the aforementioned object association into an end-to-end trainable task. Based on this, we propose a simple yet effective method termed MOTIP. |
Ruopeng Gao; Ji Qi; Limin Wang; |
| 162 | PhysGen3D: Crafting A Miniature Interactive World from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Envisioning physically plausible outcomes from a single image requires a deep understanding of the world’s dynamics. To address this, we introduce MiniTwin, a novel framework that transforms a single image into an amodal, camera-centric, interactive 3D scene. |
Boyuan Chen; Hanxiao Jiang; Shaowei Liu; Saurabh Gupta; Yunzhu Li; Hao Zhao; Shenlong Wang; |
| 163 | VladVA: Discriminative Fine-tuning of LVLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. |
Yassine Ouali; Adrian Bulat; Alexandros Xenos; Anestis Zaganidis; Ioannis Maniadis Metaxas; Brais Martinez; Georgios Tzimiropoulos; |
| 164 | Learning Temporally Consistent Video Depth from Video Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. |
Jiahao Shao; Yuanbo Yang; Hongyu Zhou; Youmin Zhang; Yujun Shen; Vitor Guizilini; Yue Wang; Matteo Poggi; Yiyi Liao; |
| 165 | LOGICZSL: Exploring Logic-induced Representation for Compositional Zero-shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose LOGICZSL, a novel logic-induced learning framework to explicitly model the semantic relationships. |
Peng Wu; Xiankai Lu; Hao Hu; Yongqin Xian; Jianbing Shen; Wenguan Wang; |
| 166 | Video Motion Transfer with Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). |
Alexander Pondaven; Aliaksandr Siarohin; Sergey Tulyakov; Philip Torr; Fabio Pizzati; |
| 167 | XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing benchmarks usually adopt notably smaller image sizes than real-world RS scenarios, suffer from limited annotation quality, and consider insufficient dimensions of evaluation. To address these issues, we present XLRS-Bench: a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios. |
Fengxiang Wang; Hongzhen Wang; Zonghao Guo; Di Wang; Yulin Wang; Mingshuo Chen; Qiang Ma; Long Lan; Wenjing Yang; Jing Zhang; Zhiyuan Liu; Maosong Sun; |
| 168 | S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose S4-Driver, a scalable self-supervised motion planning algorithm with spatio-temporal visual representation, based on the popular PaLI multimodal large language model. |
Yichen Xie; Runsheng Xu; Tong He; Jyh-Jing Hwang; Katie Luo; Jingwei Ji; Hubert Lin; Letian Chen; Yiren Lu; Zhaoqi Leng; Dragomir Anguelov; Mingxing Tan; |
| 169 | DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To overcome the limitations above, we propose DAGSM, a novel pipeline that generates disentangled human bodies and garments from the given text prompts. |
Jingyu Zhuang; Di Kang; Linchao Bao; Liang Lin; Guanbin Li; |
| 170 | Scaling Properties of Diffusion Models For Perceptual Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. |
Rahul Ravishankar; Zeeshan Patel; Jathushan Rajasegaran; Jitendra Malik; |
| 171 | Perception Tokens Enhance Visual Reasoning in Multimodal Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose AURORA, a training method that augments MLMs with perception tokens for improved reasoning over visual inputs. |
Mahtab Bigverdi; Zelun Luo; Cheng-Yu Hsieh; Ethan Shen; Dongping Chen; Linda G. Shapiro; Ranjay Krishna; |
| 172 | UNIALIGN: Scaling Multimodal Alignment Within One Unified Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present UNIALIGN, a unified model to align an arbitrary number of modalities (\text e.g. , image, text, audio, 3D point cloud, etc.) through one encoder and a single training phase. |
Bo Zhou; Liulei Li; Yujia Wang; Huafeng Liu; Yazhou Yao; Wenguan Wang; |
| 173 | RandAR: Decoder-only Autoregressive Visual Generation in Random Orders Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generatng images in arbitrary token orders. |
Ziqi Pang; Tianyuan Zhang; Fujun Luan; Yunze Man; Hao Tan; Kai Zhang; William T. Freeman; Yu-Xiong Wang; |
| 174 | BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose BiomedCoOp, a novel prompt learning framework that enables efficient adaptation of BiomedCLIP for accurate and highly generalizable few-shot biomedical image classification. |
Taha Koleilat; Hojat Asgariandehkordi; Hassan Rivaz; Yiming Xiao; |
| 175 | Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. |
Yunze Man; De-An Huang; Guilin Liu; Shiwei Sheng; Shilong Liu; Liang-Yan Gui; Jan Kautz; Yu-Xiong Wang; Zhiding Yu; |
| 176 | AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some extent, they are heavily dependent on high-quality labeled data and perform worse in rare object classes. To address this issue, this paper proposes a layout-controllable diffusion generative model (i.e. AeroGen) tailored for RSIOD. |
Datao Tang; Xiangyong Cao; Xuan Wu; Jialin Li; Jing Yao; Xueru Bai; Dongsheng Jiang; Yin Li; Deyu Meng; |
| 177 | Joint Vision-Language Social Bias Removal for CLIP Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We then reveal that this performance degradation stems from the unbalanced debiasing in image and text embeddings. To address this issue, we propose a novel V-L debiasing framework to align image and text biases followed by removing them from both modalities. |
Haoyu Zhang; Yangyang Guo; Mohan Kankanhalli; |
| 178 | ArtiFade: Learning to Generate High-quality Subject from Blemished Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This is primarily attributed to the inadequate capability of current techniques in distinguishing subject-related features from disruptive artifacts. In this paper, we introduce ArtiFade to tackle this issue and successfully generate high-quality artifact-free images from blemished datasets. |
Shuya Yang; Shaozhe Hao; Yukang Cao; Kwan-Yee K. Wong; |
| 179 | DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these models often lack effective control over character appearances and interactions, particularly in multi-character scenes. To address these limitations, we propose a new task: customized manga generation and introduce DiffSensei, an innovative framework specifically designed for generating manga with dynamic multi-character control. |
Jianzong Wu; Chao Tang; Jingbo Wang; Yanhong Zeng; Xiangtai Li; Yunhai Tong; |
| 180 | VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we systematically study music generation conditioned solely on the video. |
Zeyue Tian; Zhaoyang Liu; Ruibin Yuan; Jiahao Pan; Qifeng Liu; Xu Tan; Qifeng Chen; Wei Xue; Yike Guo; |
| 181 | PromptHMR: Promptable Human Mesh Recovery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. |
Yufu Wang; Yu Sun; Priyanka Patel; Kostas Daniilidis; Michael J. Black; Muhammed Kocabas; |
| 182 | Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. |
Zaijing Li; Yuquan Xie; Rui Shao; Gongwei Chen; Dongmei Jiang; Liqiang Nie; |
| 183 | Diff-Palm: Realistic Palmprint Generation with Polynomial Creases and Intra-Class Variation Controllable Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Palmprint recognition is significantly limited by the lack of large-scale publicly available datasets. Previous methods have adopted Bezier curves to simulate the palm creases, which then serve as input for conditional GANs to generate realistic palmprints.However, without employing real data fine-tuning, the performance of the recognition model trained on these synthetic datasets would drastically decline, indicating a large gap between generated and real palmprints.This is primarily due to the utilization of an inaccurate palm crease representation and challenges in balancing intra-class variation with identity consistency.To address this, we introduce a polynomial-based palm crease representation that provides a new palm crease generation mechanism more closely aligned with the real distribution. |
Jianlong Jin; Chenglong Zhao; Ruixin Zhang; Sheng Shang; Jianqing Xu; Jingyun Zhang; ShaoMing Wang; Yang Zhao; Shouhong Ding; Wei Jia; Yunsheng Wu; |
| 184 | Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose ViewpointRosetta, an approach that unlocks large-scale unpaired ego and exo video data to learn clip-level viewpoint-invariant video representations. |
Mi Luo; Zihui Xue; Alex Dimakis; Kristen Grauman; |
| 185 | OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce OminiFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. |
Shufan Li; Konstantinos Kallidromitis; Akash Gokul; Zichun Liao; Yusuke Kato; Kazuki Kozuka; Aditya Grover; |
| 186 | Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs). |
Yuhao Dong; Zuyan Liu; Hai-Long Sun; Jingkang Yang; Winston Hu; Yongming Rao; Ziwei Liu; |
| 187 | GS-DiT: Advancing Video Generation with Dynamic 3D Gaussian Fields Through Efficient Dense 3D Point Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we propose a novel framework that constructs dynamic 3D Gaussian fields with dense 3D point tracking and renders the Gaussian field for all video frames. |
Weikang Bian; Zhaoyang Huang; Xiaoyu Shi; Yijin Li; Fu-Yun Wang; Hongsheng Li; |
| 188 | AIpparel: A Multimodal Foundation Model for Digital Garments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet, the creation of garments remains a time-consuming process, largely due to the manual work involved in designing them. To simplify this process, we introduce AIpparel, a multimodal foundation model for generating and editing sewing patterns. |
Kiyohiro Nakayama; Jan Ackermann; Timur Levent Kesdogan; Yang Zheng; Maria Korosteleva; Olga Sorkine-Hornung; Leonidas J. Guibas; Guandao Yang; Gordon Wetzstein; |
| 189 | FoundationStereo: Zero-Shot Stereo Matching Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero shot generalization. |
Bowen Wen; Matthew Trepte; Joseph Aribido; Jan Kautz; Orazio Gallo; Stan Birchfield; |
| 190 | Video Depth Without Video Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. |
Bingxin Ke; Dominik Narnhofer; Shengyu Huang; Lei Ke; Torben Peters; Katerina Fragkiadaki; Anton Obukhov; Konrad Schindler; |
| 191 | MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. |
Zehuan Huang; Yuan-Chen Guo; Xingqiao An; Yunhan Yang; Yangguang Li; Zi-Xin Zou; Ding Liang; Xihui Liu; Yan-Pei Cao; Lu Sheng; |
| 192 | 3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose 3D-Mem, a novel 3D scene memory framework for embodied agents. |
Yuncong Yang; Han Yang; Jiachen Zhou; Peihao Chen; Hongxin Zhang; Yilun Du; Chuang Gan; |
| 193 | BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose BadToken, the first token-level backdoor attack to MLLMs. |
Zenghui Yuan; Jiawen Shi; Pan Zhou; Neil Zhenqiang Gong; Lichao Sun; |
| 194 | WF-VAE: Enhancing Video VAE By Wavelet-Driven Energy Flow for Latent Video Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Wavelet transform can decompose videos into multiple frequency-domain components and improve the efficiency significantly, we thus propose Wavelet Flow VAE (WF-VAE), an autoencoder that leverages multi-level wavelet transform to facilitate low-frequency energy flow into latent representation. |
Zongjian Li; Bin Lin; Yang Ye; Liuhan Chen; Xinhua Cheng; Shenghai Yuan; Li Yuan; |
| 195 | FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate how to best work with MLLMs in an object placement task. |
Ian Huang; Yanan Bao; Karen Truong; Howard Zhou; Cordelia Schmid; Leonidas Guibas; Alireza Fathi; |
| 196 | Sonata: Self-Supervised Learning of Reliable Point Representations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we question whether we have a reliable self-supervised point cloud model that can be used for diverse 3D tasks via simple linear probing, even with limited data and minimal computation. |
Xiaoyang Wu; Daniel DeTone; Duncan Frost; Tianwei Shen; Chris Xie; Nan Yang; Jakob Engel; Richard Newcombe; Hengshuang Zhao; Julian Straub; |
| 197 | VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We design a discretized voting space that accommodates all possible translations and then identify the one shared by nearby points by differentiable voting. |
Yancong Lin; Shiming Wang; Liangliang Nan; Julian Kooij; Holger Caesar; |
| 198 | Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the vision-language model (VLM) for both open-set reactive and proactive failure detection. |
Enshen Zhou; Qi Su; Cheng Chi; Zhizheng Zhang; Zhongyuan Wang; Tiejun Huang; Lu Sheng; He Wang; |
| 199 | AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). |
Khiem Vuong; Anurag Ghosh; Deva Ramanan; Srinivasa Narasimhan; Shubham Tulsiani; |
| 200 | VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. |
Shehan Munasinghe; Hanan Gani; Wenqi Zhu; Jiale Cao; Eric Xing; Fahad Shahbaz Khan; Salman Khan; |
| 201 | SEAL: Semantic Attention Learning for Long Video Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces **SE**mantic **A**ttention **L**earning (SEAL), a novel unified representation for long videos. |
Lan Wang; Yujia Chen; Du Tran; Vishnu Naresh Boddeti; Wen-Sheng Chu; |
| 202 | SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce an effective method, SemGeoMo, for dynamic contextual human motion generation, which fully leverages the text-affordance-joint multi-level semantic and geometric guidance in the generation process, improving the semantic rationality and geometric correctness of generative motions. |
Peishan Cong; Ziyi Wang; Yuexin Ma; Xiangyu Yue; |
| 203 | Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Using localization heads, we introduce a straightforward and effective training-free visual grounding framework that utilizes text-to-image attention maps from localization heads to identify the target objects. |
Seil Kang; Jinyeong Kim; Junhyeok Kim; Seong Jae Hwang; |
| 204 | IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper aims to safeguard portrait photos from unauthorized encoder-based customization. |
Yiren Song; Pei Yang; Hai Ci; Mike Zheng Shou; |
| 205 | DoF-Gaussian: Controllable Depth-of-Field for 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce DoF-Gaussian, a controllable depth-of-field method for 3D-GS. |
Liao Shen; Tianqi Liu; Huiqiang Sun; Jiaqi Li; Zhiguo Cao; Wei Li; Chen Change Loy; |
| 206 | GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce GROVE, a generalized reward framework that enables open-vocabulary physical skill learning without manual engineering or task-specific demonstrations. |
Jieming Cui; Tengyu Liu; Ziyu Meng; Jiale Yu; Ran Song; Wei Zhang; Yixin Zhu; Siyuan Huang; |
| 207 | Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we move a step forward and design an approach that allows for multimodal queries – composed of both an image and a text – and can search within collections of multimodal documents, where images and text are interleaved. |
Davide Caffagni; Sara Sarto; Marcella Cornia; Lorenzo Baraldi; Rita Cucchiara; |
| 208 | RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. |
Yao Mu; Tianxing Chen; Zanxin Chen; Shijia Peng; Zhiqian Lan; Zeyu Gao; Zhixuan Liang; Qiaojun Yu; Yude Zou; Mingkun Xu; Lunkai Lin; Zhiqiang Xie; Mingyu Ding; Ping Luo; |
| 209 | PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial Patches Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present PatchDEMUX, a certifiably robust framework for multi-label classifiers against adversarial patches. |
Dennis Jacob; Chong Xiang; Prateek Mittal; |
| 210 | Panorama Generation From NFoV Image Done Right Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To address the phenomenon, we propose PanoDecouple, a decoupled diffusion model framework, which decouples the panorama generation into distortion guidance and content completion, aiming to generate panoramas with both accurate distortion and visual appeal. |
Dian Zheng; Cheng Zhang; Xiao-Ming Wu; Cao Li; Chengfei Lv; Jian-Fang Hu; Wei-Shi Zheng; |
| 211 | A Distractor-Aware Memory for Visual Object Tracking with SAM2 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We argue that a more sophisticated memory model is required, and propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness. |
Jovana Videnovic; Alan Lukezic; Matej Kristan; |
| 212 | Rethinking Reconstruction and Denoising in The Dark: New Perspective, General Architecture and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces a novel approach by rethinking denoising and reconstruction from a "backbone-head" perspective, leveraging the stronger shared parameter space offered by the backbone, compared to the encoder used in existing works. |
Tengyu Ma; Long Ma; Ziye Li; Yuetong Wang; Jinyuan Liu; Chengpei Xu; Risheng Liu; |
| 213 | DRAWER: Digital Reconstruction and Articulation With Environment Realism Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present DRAWER, a novel framework that converts a video of a static indoor scene into a photorealistic and interactive digital environment. |
Hongchi Xia; Entong Su; Marius Memmel; Arhan Jain; Raymond Yu; Numfor Mbiziwo-Tiapo; Ali Farhadi; Abhishek Gupta; Shenlong Wang; Wei-Chiu Ma; |
| 214 | Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands physically interacting with a scene? |
Yiming Dou; Wonseok Oh; Yuqing Luo; Antonio Loquercio; Andrew Owens; |
| 215 | Scaling Mesh Generation Via Compressive Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose a compressive yet effective mesh tokenization, Blocked and Patchified Tokenization (BPT), facilitating the generation of meshes exceeding 8k faces. |
Haohan Weng; Zibo Zhao; Biwen Lei; Xianghui Yang; Jian Liu; Zeqiang Lai; Zhuo Chen; Yuhong Liu; Jie Jiang; Chunchao Guo; Tong Zhang; Shenghua Gao; C.L. Philip Chen; |
| 216 | VideoDPO: Omni-Preference Alignment for Video Diffusion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unlike previous image alignment methods that focus solely on either (i) visual quality or (ii) semantic alignment between text and videos, we comprehensively consider both dimensions and construct a preference score accordingly, which we term the OmniScore. We design a pipeline to automatically collect preference pair data based on the proposed OmniScore and discover that re-weighting these pairs based on the score significantly impacts overall preference alignment. |
Runtao Liu; Haoyu Wu; Ziqiang Zheng; Chen Wei; Yingqing He; Renjie Pi; Qifeng Chen; |
| 217 | Classifier-Free Guidance Inside The Attraction Basin May Cause Memorization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present a novel perspective on the memorization phenomenon and propose a simple yet effective approach to mitigate it. |
Anubhav Jain; Yuya Kobayashi; Takashi Shibuya; Yuhta Takida; Nasir Memon; Julian Togelius; Yuki Mitsufuji; |
| 218 | Cubify Anything: Scaling Indoor 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. |
Justin Lazarow; David Griffiths; Gefen Kohavi; Francisco Crespo; Afshin Dehghan; |
| 219 | From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA). |
Andrew Szot; Bogdan Mazoure; Omar Attia; Aleksei Timofeev; Harsh Agrawal; Devon Hjelm; Zhe Gan; Zsolt Kira; Alexander Toshev; |
| 220 | DarkIR: Robust Low-Light Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present an efficient and robust neural network for multi-task low-light image restoration. |
Daniel Feijoo; Juan C. Benito; Alvaro Garcia; Marcos V. Conde; |
| 221 | SceneDiffuser++: City-Scale Traffic Simulation Via A Generative World Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose SceneDiffuser++, the first end-to-end generative world model trained on a single loss function capable of point A-to-B simulation on a city scale integrating all the requirements above. |
Shuhan Tan; John Lambert; Hong Jeon; Sakshum Kulshrestha; Yijing Bai; Jing Luo; Dragomir Anguelov; Mingxing Tan; Chiyu Max Jiang; |
| 222 | Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper attempts to tackle the three challenges jointly. First, inspired by the notable generality of using image-level blending data for image forgery detection, we investigate whether and how video-level blending can be effective in video. We then perform a thorough analysis and identify a previously underexplored temporal forgery artifact: Facial Feature Drift (FFD), which commonly exists across different forgeries. |
Zhiyuan Yan; Yandan Zhao; Shen Chen; Mingyi Guo; Xinghe Fu; Taiping Yao; Shouhong Ding; Yunsheng Wu; Li Yuan; |
| 223 | HVI: A New Color Space for Low-light Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While converting the images using Hue, Saturation and Value (HSV) color space helps resolve the brightness issue, it introduces significant red and black noise artifacts. To address this issue, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by polarized HS maps and learnable intensity. |
Qingsen Yan; Yixu Feng; Cheng Zhang; Guansong Pang; Kangbiao Shi; Peng Wu; Wei Dong; Jinqiu Sun; Yanning Zhang; |
| 224 | Floating No More: Object-Ground Reconstruction from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This limitation significantly affects 3D-aware image editing applications like shadow rendering and object pose manipulation. To address this issue, we introduce ORG (Object Reconstruction with Ground), a novel task aimed at reconstructing 3D object geometry in conjunction with the ground surface. |
Yunze Man; Yichen Sheng; Jianming Zhang; Liang-Yan Gui; Yu-Xiong Wang; |
| 225 | MeshArt: Generating Articulated Meshes with Structure-Guided Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MeshArt, a hierarchical transformer-based approach to generate articulated 3D meshes with clean, compact geometry, reminiscent of human-crafted 3D models. |
Daoyi Gao; Yawar Siddiqui; Lei Li; Angela Dai; |
| 226 | Omnia De EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we bring to light that current egocentric video question-answering datasets often include questions that can be answered using only few frames or commonsense reasoning, without being necessarily grounded in the actual video. |
Chiara Plizzari; Alessio Tonioni; Yongqin Xian; Achin Kulshrestha; Federico Tombari; |
| 227 | SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. |
Jierun Chen; Dongting Hu; Xijie Huang; Huseyin Coskun; Arpit Sahni; Aarush Gupta; Anujraaj Goyal; Dishani Lahiri; Rajesh Singh; Yerlan Idelbayev; Junli Cao; Yanyu Li; Kwang-Ting Cheng; S.-H. Gary Chan; Mingming Gong; Sergey Tulyakov; Anil Kag; Yanwu Xu; Jian Ren; |
| 228 | Don’t Shake The Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: End-to-end autonomous driving frameworks enable seamless integration of perception and planning but often rely on one-shot trajectory prediction, which may lead to unstable control and vulnerability to occlusions in single-frame perception. To address this, we propose the Momentum-Aware Driving (MomAD) framework, which introduces trajectory momentum and perception momentum to stabilize and refine trajectory predictions. |
Ziying Song; Caiyan Jia; Lin Liu; Hongyu Pan; Yongchang Zhang; Junming Wang; Xingyu Zhang; Shaoqing Xu; Lei Yang; Yadan Luo; |
| 229 | Horizon-GS: Unified 3D Gaussian Splatting for Large-Scale Aerial-to-Ground Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Horizon-GS, a novel approach built upon Gaussian Splatting techniques, tackles the unified reconstruction and rendering for aerial and street views. |
Lihan Jiang; Kerui Ren; Mulin Yu; Linning Xu; Junting Dong; Tao Lu; Feng Zhao; Dahua Lin; Bo Dai; |
| 230 | Lift3D Policy: Lifting 2D Foundation Models for Robust 3D Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent research has increasingly focused on the explicit extraction of 3D features, while still facing challenges such as the lack of large-scale robotic 3D data and the potential loss of spatial geometry. To address these limitations, we propose the Lift3D framework, which progressively enhances 2D foundation models with implicit and explicit 3D robotic representations to construct a robust 3D manipulation policy. |
Yueru Jia; Jiaming Liu; Sixiang Chen; Chenyang Gu; Zhilve Wang; Longzan Luo; Xiaoqi Li; Pengwei Wang; Zhongyuan Wang; Renrui Zhang; Shanghang Zhang; |
| 231 | DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we utilize the world model as a data machine to synthesize novel trajectory videos, where structured conditions are explicitly leveraged to control the spatial-temporal consistency of traffic elements. |
Guosheng Zhao; Chaojun Ni; Xiaofeng Wang; Zheng Zhu; Xueyang Zhang; Yida Wang; Guan Huang; Xinze Chen; Boyuan Wang; Youyi Zhang; Wenjun Mei; Xingang Wang; |
| 232 | Enhancing Video-LLM Reasoning Via Agent-of-Thoughts Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose **A**gent-**o**f-**T**houghts **D**istillation (**AoTD**), a method that enhances models by incorporating automatically generated Chain-of-Thoughts (CoTs) into the instruction-tuning process. |
Yudi Shi; Shangzhe Di; Qirui Chen; Weidi Xie; |
| 233 | Stretching Each Dollar: Diffusion Training from Scratch on A Micro-Budget Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: As scaling laws in generative AI push performance, they simultaneously concentrate the development of these models among actors with large computational resources. With a focus on text-to-image (T2I) generative models, we aim to unlock this bottleneck by demonstrating very low-cost training of large-scale T2I diffusion transformer models. |
Vikash Sehwag; Xianghao Kong; Jingtao Li; Michael Spranger; Lingjuan Lyu; |
| 234 | A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate the effectiveness of using modern decoder-only LLMs as text encoders for text-to-image diffusion models. |
Andrew Z. Wang; Songwei Ge; Tero Karras; Ming-Yu Liu; Yogesh Balaji; |
| 235 | LamRA: Large Multimodal Model As Your Advanced Retrieval Assistant Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we explore the possibility of re-purposing generative Large Multimodal Models (LMMs) for retrieval. |
Yikun Liu; Yajie Zhang; Jiayin Cai; Xiaolong Jiang; Yao Hu; Jiangchao Yao; Yanfeng Wang; Weidi Xie; |
| 236 | From Zero to Detail: Deconstructing Ultra-High-Definition Image Restoration from Progressive Spectral Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To cope with these challenges, we analyze the restoration process in depth through a progressive spectral perspective, and deconstruct the complex UHD restoration problem into three progressive stages: zero-frequency enhancement, low-frequency restoration, and high-frequency refinement. Building on this insight, we propose a novel framework, ERR, which comprises three collaborative sub-networks: the zero-frequency enhancer (ZFE), the low-frequency restorer (LFR), and the high-frequency refiner (HFR). |
Chen Zhao; Zhizhou Chen; Yunzhe Xu; Enxuan Gu; Jian Li; Zili Yi; Qian Wang; Jian Yang; Ying Tai; |
| 237 | Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. |
Han Xiao; Yina Xie; Guanxin Tan; Yinghao Chen; Rui Hu; Ke Wang; Aojun Zhou; Hao Li; Hao Shao; Xudong Lu; Peng Gao; Yafei Wen; Xiaoxin Chen; Shuai Ren; Hongsheng Li; |
| 238 | VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Besides, the lack of high-quality object-level video instruction data and a comprehensive benchmark further hinders their advancements. To tackle these challenges, we introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding, i.e., enabling perception and reasoning on any objects throughout the video. |
Yuqian Yuan; Hang Zhang; Wentong Li; Zesen Cheng; Boqiang Zhang; Long Li; Xin Li; Deli Zhao; Wenqiao Zhang; Yueting Zhuang; Jianke Zhu; Lidong Bing; |
| 239 | T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we conduct the first systematic study on compositional text-to-video generation. |
Kaiyue Sun; Kaiyi Huang; Xian Liu; Yue Wu; Zihan Xu; Zhenguo Li; Xihui Liu; |
| 240 | MoSca: Dynamic Gaussian Fusion from Casual Videos Via 4D Motion Scaffolds Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce 4D Motion Scaffolds (MoSca), a modern 4D reconstruction system designed to reconstruct and synthesize novel views of dynamic scenes from monocular videos captured casually in the wild. |
Jiahui Lei; Yijia Weng; Adam W. Harley; Leonidas Guibas; Kostas Daniilidis; |
| 241 | Visual Persona: Foundation Model for Full-Body Human Customization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. |
Jisu Nam; Soowon Son; Zhan Xu; Jing Shi; Difan Liu; Feng Liu; Seungryong Kim; Yang Zhou; |
| 242 | Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a generative technique to edit 3D shapes, represented as meshes, NeRFs, or Gaussian Splats, in ~3 seconds, without the need for running an SDS type of optimization.Our key insight is to cast 3D editing as a multiview image inpainting problem, as this representation is generic and can be mapped back to any 3D representation using the bank of available Large Reconstruction Models. |
Amir Barda; Matheus Gadelha; Vladimir G. Kim; Noam Aigerman; Amit H. Bermano; Thibault Groueix; |
| 243 | ROICtrl: Boosting Instance Control for Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by ROI-Align in object detection, we introduce a complementary operation called ROI-Unpool. |
Yuchao Gu; Yipin Zhou; Yunfan Ye; Yixin Nie; Licheng Yu; Pingchuan Ma; Kevin Qinghong Lin; Mike Zheng Shou; |
| 244 | TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While methods such as SinSR and OSEDiff have emerged to condense inference steps via distillation, their performance in image restoration or details recovery is not satisfied. To address this, we propose TSD-SR, a novel distillation framework specifically designed for real-world image super-resolution, aiming to construct an efficient and effective one-step model. |
Linwei Dong; Qingnan Fan; Yihong Guo; Zhonghao Wang; Qi Zhang; Jinwei Chen; Yawei Luo; Changqing Zou; |
| 245 | Uni4D: Unifying Visual Foundation Models for 4D Modeling from A Single Video Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents a unified approach to understanding dynamic scenes from casual videos. |
David Yifan Yao; Albert J. Zhai; Shenlong Wang; |
| 246 | ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images based on a global text prompt and an anonymous region layout. |
Yifan Pu; Yiming Zhao; Zhicong Tang; Ruihong Yin; Haoxing Ye; Yuhui Yuan; Dong Chen; Jianmin Bao; Sirui Zhang; Yanbin Wang; Lin Liang; Lijuan Wang; Ji Li; Xiu Li; Zhouhui Lian; Gao Huang; Baining Guo; |
| 247 | BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present BlueLM-V-3B, an algorithm and system co-design approach specifically tailored for the efficient deployment of MLLMs on mobile platforms. |
Xudong Lu; Yinghao Chen; Cheng Chen; Hui Tan; Boheng Chen; Yina Xie; Rui Hu; Guanxin Tan; Renshou Wu; Yan Hu; Yi Zeng; Lei Wu; Liuyang Bian; Zhaoxiong Wang; Long Liu; Yanzhou Yang; Han Xiao; Aojun Zhou; Yafei Wen; Xiaoxin Chen; Shuai Ren; Hongsheng Li; |
| 248 | Unbiasing Through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities. |
Nina Shvetsova; Arsha Nagrani; Bernt Schiele; Hilde Kuehne; Christian Rupprecht; |
| 249 | RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For example, datasets frequently do not capture reference frame comprehension, yet effective spatial reasoning requires understanding whether to reason from ego-, world-, or object-centric perspectives. To address this issue, we introduce RoboSpatial, a large-scale dataset for spatial understanding in robotics. |
Chan Hee Song; Valts Blukis; Jonathan Tremblay; Stephen Tyree; Yu Su; Stan Birchfield; |
| 250 | Semantic and Expressive Variations in Image Captions Across Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: By analyzing captions across seven languages (English, French, German, Russian, Chinese, Japanese, Korean) in high-quality image captioning datasets (Crossmodal and Visual Genome), we find that multilingual caption sets tend to provide richer visual descriptions than monolingual (including English-only) ones; multilingual sets contain 46.0% more objects, 66.1% more relationships, and 66.8% more attributes. |
Andre Ye; Sebastin Santy; Jena D. Hwang; Amy X. Zhang; Ranjay Krishna; |
| 251 | F-LMM: Grounding Frozen Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we comprehensively evaluate state-of-the-art grounding LMMs across a suite of multimodal question-answering benchmarks, observing drastic performance drops that indicate vanishing general knowledge comprehension and weakened instruction following ability. |
Size Wu; Sheng Jin; Wenwei Zhang; Lumin Xu; Wentao Liu; Wei Li; Chen Change Loy; |
| 252 | Identity-Preserving Text-to-Video Generation By Frequency Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in the literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving Diffusion Transformer (DiT)-based control scheme. To achieve these goals, we propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human-identity consistent in the generated video. |
Shenghai Yuan; Jinfa Huang; Xianyi He; Yunyang Ge; Yujun Shi; Liuhan Chen; Jiebo Luo; Li Yuan; |
| 253 | Complexity Experts Are Task-Discriminative Learners for Any Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This hinders leveraging MoEs’ computational benefits by bypassing irrelevant experts during inference.We attribute this undesired behavior to the uniform and rigid architecture of traditional MoEs. To address this, we introduce "complexity experts" — flexible expert blocks with varying computational complexity and receptive fields. |
Eduard Zamfir; Zongwei Wu; Nancy Mehta; Yuedong Tan; Danda Pani Paudel; Yulun Zhang; Radu Timofte; |
| 254 | StyleMaster: Stylize Your Video with Artistic Generation and Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Style control has been popular in video generation models. Existing methods often generate videos far from the given style, cause content leakage, and struggle to transfer one … |
Zixuan Ye; Huijuan Huang; Xintao Wang; Pengfei Wan; Di Zhang; Wenhan Luo; |
| 255 | Universal Scene Graph Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce Universal SG (USG), a novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs, encompassing modality-invariant and modality-specific scenes. |
Shengqiong Wu; Hao Fei; Tat-seng Chua; |
| 256 | Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. |
Shengqiong Wu; Hao Fei; Jingkang Yang; Xiangtai Li; Juncheng Li; Hanwang Zhang; Tat-seng Chua; |
| 257 | Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs’ spatial-temporal reasoning with 2D images as input, without modifying the architecture or requiring task-specific fine-tuning. |
Benlin Liu; Yuhao Dong; Yiqin Wang; Zixian Ma; Yansong Tang; Luming Tang; Yongming Rao; Wei-Chiu Ma; Ranjay Krishna; |
| 258 | Rethinking Diffusion for Text-Driven Human Motion Generation: Redundant Representations, Evaluation, and Masked Autoregression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution. |
Zichong Meng; Yiming Xie; Xiaogang Peng; Zeyu Han; Huaizu Jiang; |
| 259 | Collaborative Tree Search for Enhancing Embodied Multi-Agent Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Previous approaches with simple communication patterns carry erroneous or incoherent agent actions, which can lead to additional risks. To address these problems, we propose Cooperative Tree Search (CoTS), a framework designed to significantly improve collaborative planning and task execution efficiency among embodied agents. |
Lizheng Zu; Lin Lin; Song Fu; Na Zhao; Pan Zhou; |
| 260 | MVGenMaster: Scaling Multi-View Generation from Any Image Via 3D Priors Enhanced Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. |
Chenjie Cao; Chaohui Yu; Shang Liu; Fan Wang; Xiangyang Xue; Yanwei Fu; |
| 261 | DreamOmni: Unified Image Generation and Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, in computer vision, while text-to-image (T2I) models have significantly improved generation quality through scaling up, their framework design did not initially consider how to unify with downstream tasks, such as various types of editing. To address this, we introduce DreamOmni, a unified model for image generation and editing. |
Bin Xia; Yuechen Zhang; Jingyao Li; Chengyao Wang; Yitong Wang; Xinglong Wu; Bei Yu; Jiaya Jia; |
| 262 | CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the scarcity of real-world CAD data poses challenges in directly training such models. To tackle these challenges, we propose CADCrafter, an image-to-parametric CAD model generation framework that trains solely on synthetic textureless CAD data while testing on real-world images. |
Cheng Chen; Jiacheng Wei; Tianrun Chen; Chi Zhang; Xiaofeng Yang; Shangzhan Zhang; Bingchen Yang; Chuan-Sheng Foo; Guosheng Lin; Qixing Huang; Fayao Liu; |
| 263 | Words or Vision: Do Vision-Language Models Have Blind Faith in Text? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Vision-Language Models (VLMs) excel in integrating visual and textual information for vision-centric tasks, but their handling of inconsistencies between modalities is underexplored. We investigate VLMs’ modality preferences when faced with visual data and varied textual inputs in vision-centered settings.By introducing textual variations to four vision-centric tasks and evaluating ten Vision-Language Models (VLMs), we discover a "blind faith in text" phenomenon: VLMs disproportionately trust textual data over visual data when inconsistencies arise, leading to significant performance drops under corrupted text and raising safety concerns.We analyze factors influencing this text bias, including instruction prompts, language model size, text relevance, token order, and the interplay between visual and textual certainty. |
Ailin Deng; Tri Cao; Zhirui Chen; Bryan Hooi; |
| 264 | TexGaussian: Generating High-quality PBR Material Via Octree-based 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents TexGaussian, a novel method that uses octant-aligned 3D Gaussian Splatting for rapid PBR material generation. |
Bojun Xiong; Jialun Liu; Jiakui Hu; Chenming Wu; Jinbo Wu; Xing Liu; Chen Zhao; Errui Ding; Zhouhui Lian; |
| 265 | STEP: Enhancing Video-LLMs’ Compositional Reasoning By Spatio-Temporal Graph-guided Self-Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose STEP, a novel graph-guided self-training method that enables Video-LLMs to generate reasoning-rich fine-tuning data from any raw videos to improve itself. |
Haiyi Qiu; Minghe Gao; Long Qian; Kaihang Pan; Qifan Yu; Juncheng Li; Wenjie Wang; Siliang Tang; Yueting Zhuang; Tat-Seng Chua; |
| 266 | OSV: One Step Is Enough for High-Quality Image to Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a result, these methods frequently fall short in terms of both performance and training stability. In this work, we introduce a two-stage training framework that effectively combines consistency distillation with adversarial training to address these challenges. |
Xiaofeng Mao; Zhengkai Jiang; Fu-yun Wang; Jiangning Zhang; Hao Chen; Mingmin Chi; Yabiao Wang; Wenhan Luo; |
| 267 | Generative Gaussian Splatting for Unbounded 3D City Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose **GaussianCity**, a generative Gaussian splatting framework dedicated to efficiently synthesizing unbounded 3D cities with a single feed-forward pass. |
Haozhe Xie; Zhaoxi Chen; Fangzhou Hong; Ziwei Liu; |
| 268 | FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, compositionality–the ability to understand and generate novel combinations of known visual and textual components–is critical for facilitating coherent reasoning and understanding across modalities in VLMs. To address this issue, we propose OpenCompositionCap, a new dataset for multi-grained region compositional image captioning that distinguishes itself from prior works by introducing the new task of compositional aspect-aware regional image captioning. |
Hang Hua; Qing Liu; Lingzhi Zhang; Jing Shi; Soo Ye Kim; Zhifei Zhang; Yilin Wang; Jianming Zhang; Zhe Lin; Jiebo Luo; |
| 269 | Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Yet, the inherently coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to prohibitive memory consumption and computational redundancies. To overcome these bottlenecks, we propose *Collaborative Decoding* (CoDe), a novel decoding strategy tailored to the VAR framework. |
Zigeng Chen; Xinyin Ma; Gongfan Fang; Xinchao Wang; |
| 270 | VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a new framework, VILA-M3, for medical VLMs that utilizes domain knowledge via expert models. |
Vishwesh Nath; Wenqi Li; Dong Yang; Andriy Myronenko; Mingxin Zheng; Yao Lu; Zhijian Liu; Hongxu Yin; Yee Man Law; Yucheng Tang; Pengfei Guo; Can Zhao; Ziyue Xu; Yufan He; Stephanie Harmon; Benjamin Simon; Greg Heinrich; Stephen Aylward; Marc Edgar; Michael Zephyr; Pavlo Molchanov; Baris Turkbey; Holger Roth; Daguang Xu; |
| 271 | StableAnimator: High-Quality Identity-Preserving Human Image Animation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses. |
Shuyuan Tu; Zhen Xing; Xintong Han; Zhi-Qi Cheng; Qi Dai; Chong Luo; Zuxuan Wu; |
| 272 | DreamRelation: Bridging Customization and Relation Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we introduce DreamRelation, a framework that disentangles identity and relation learning using a carefully curated dataset. |
Qingyu Shi; Lu Qi; Jianzong Wu; Jinbin Bai; Jingbo Wang; Yunhai Tong; Xiangtai Li; |
| 273 | EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. |
Yang Yue; Yulin Wang; Haojun Jiang; Pan Liu; Shiji Song; Gao Huang; |
| 274 | CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present CheXWorld, the first effort towards a self-supervised world model for radiographic images. |
Yang Yue; Yulin Wang; Chenxin Tao; Pan Liu; Shiji Song; Gao Huang; |
| 275 | DiffusionSfM: Predicting Structure and Motion Via Ray Origin and Endpoint Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. |
Qitao Zhao; Amy Lin; Jeff Tan; Jason Y. Zhang; Deva Ramanan; Shubham Tulsiani; |
| 276 | UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, even within the same domain, current VAD approaches often require large amounts of normal samples to train class-specific models, resulting in poor generalizability and hindering unified evaluation across domains. To address this issue, we propose a generalized few-shot VAD method, UniVAD, capable of detecting anomalies across various domains, with a training-free unified model. |
Zhaopeng Gu; Bingke Zhu; Guibo Zhu; Yingying Chen; Ming Tang; Jinqiao Wang; |
| 277 | FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a simple and effective method, named FaithDiff, to fully harness the impressive power of latent diffusion models (LDMs) for faithful image SR. |
Junyang Chen; Jinshan Pan; Jiangxin Dong; |
| 278 | VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. |
Chi-Pin Huang; Yen-Siang Wu; Hung-Kai Chung; Kai-Po Chang; Fu-En Yang; Yu-Chiang Frank Wang; |
| 279 | A Bias-Free Training Paradigm for More General AI-generated Image Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose B-Free, a bias-free training paradigm, where fake images are generated from real ones using the conditioning procedure of stable diffusion models. |
Fabrizio Guillaro; Giada Zingarini; Ben Usman; Avneesh Sud; Davide Cozzolino; Luisa Verdoliva; |
| 280 | Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing Vision-Language Navigation (VLN) methods primarily focus on single-stage navigation, limiting their effectiveness in multi-stage and long-horizon tasks within complex and dynamic environments. To address these limitations, we propose a novel VLN task, named Long-Horizon Vision-Language Navigation (LH-VLN), which emphasizes long-term planning and decision consistency across consecutive subtasks. |
Xinshuai Song; Weixing Chen; Yang Liu; Weikai Chen; Guanbin Li; Liang Lin; |
| 281 | GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose a world-modelbased framework to exploit the scene evolution for perception. |
Sicheng Zuo; Wenzhao Zheng; Yuanhui Huang; Jie Zhou; Jiwen Lu; |
| 282 | FLAIR: VLM with Fine-grained Language-informed Image Representations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. |
Rui Xiao; Sanghwan Kim; Mariana-Iuliana Georgescu; Zeynep Akata; Stephan Alaniz; |
| 283 | Wonderland: Navigating 3D Scenes from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper addresses a challenging question: how can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image?Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality, and distorted reconstructions for unseen areas. We propose a novel pipeline to overcome these limitations.Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splatting, even from a single-condition image, in a feed-forward manner. |
Hanwen Liang; Junli Cao; Vidit Goel; Guocheng Qian; Sergei Korolev; Demetri Terzopoulos; Konstantinos N. Plataniotis; Sergey Tulyakov; Jian Ren; |
| 284 | UniRestore: Unified Perceptual and Task-Oriented Image Restoration Model Using Diffusion Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces UniRestore, a unified image restoration model that bridges the gap between PIR and TIR by using a diffusion prior. |
I-Hsiang Chen; Wei-Ting Chen; Yu-Wei Liu; Yuan-Chun Chiang; Sy-Yen Kuo; Ming-Hsuan Yang; |
| 285 | Mamba-Reg: Vision Mamba Also Needs Registers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. |
Feng Wang; Jiahao Wang; Sucheng Ren; Guoyizhe Wei; Jieru Mei; Wei Shao; Yuyin Zhou; Alan Yuille; Cihang Xie; |
| 286 | Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. |
Feng Wang; Timing Yang; Yaodong Yu; Sucheng Ren; Guoyizhe Wei; Angtian Wang; Wei Shao; Yuyin Zhou; Alan Yuille; Cihang Xie; |
| 287 | Weakly Supervised Semantic Segmentation Via Progressive Confidence Region Expansion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the prevalent methods relying on vision transformers (ViT) encounter an "over-expansion" issue, i.e., CAM incorrectly expands high activation value from the target object to the background regions, as it is difficult to learn pixel-level local intrinsic inductive bias in ViT from weak supervisions. To solve this problem, we propose a Progressive Confidence Region Expansion (PCRE) framework for WSSS, it gradually learns a faithful mask over the target region and utilizes this mask to correct the confusion in CAM. |
Xiangfeng Xu; Pinyi Zhang; Wenxuan Huang; Yunhang Shen; Haosheng Chen; Jingzhong Lin; Wei Li; Gaoqi He; Jiao Xie; Shaohui Lin; |
| 288 | CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images By AI Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Current efforts to distinguish between real and AI-generated images may lack generalization, being effective for only certain types of generative models and susceptible to post-processing techniques like JPEG compression. To overcome these limitations, we propose a novel framework, CO-SPY, that first enhances existing semantic features (e.g., the number of fingers in a hand) and artifact features (e.g., pixel value differences), and then adaptively integrates them to achieve more general and robust synthetic image detection. |
Siyuan Cheng; Lingjuan Lyu; Zhenting Wang; Xiangyu Zhang; Vikash Sehwag; |
| 289 | Vision-Language Models Do Not Understand Negation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce \benchmark , a new benchmark designed to evaluate negation understanding across 18 task variations and 79k examples spanning image, video, and medical datasets. |
Kumail Alhamoud; Shaden Alshammari; Yonglong Tian; Guohao Li; Philip H.S. Torr; Yoon Kim; Marzyeh Ghassemi; |
| 290 | Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present Pixel-level and Semantic-level Adjustable SR (PiSA-SR), which learns two LoRA modules upon the pre-trained stable-diffusion (SD) model to achieve improved and adjustable SR results. |
Lingchen Sun; Rongyuan Wu; Zhiyuan Ma; Shuaizheng Liu; Qiaosi Yi; Lei Zhang; |
| 291 | SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. |
Xuesong Chen; Linjiang Huang; Tao Ma; Rongyao Fang; Shaoshuai Shi; Hongsheng Li; |
| 292 | SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce SLAM3R, a novel and effective system for real-time, high-quality, dense 3D reconstruction using RGB videos. |
Yuzheng Liu; Siyan Dong; Shuzhe Wang; Yingda Yin; Yanchao Yang; Qingnan Fan; Baoquan Chen; |
| 293 | RayFlow: Instance-Aware Diffusion Acceleration Via Adaptive Flow Trajectories Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing acceleration methods, while aiming to reduce steps, often compromise sample quality, controllability, or introduce training complexities. Therefore, we propose \underline RayFlow , a novel diffusion framework that addresses these limitations. |
Huiyang Shao; Xin Xia; Yuhong Yang; Yuxi Ren; Xing Wang; Xuefeng Xiao; |
| 294 | FlexDrive: Toward Trajectory Flexibility in Driving Scene Gaussian Splatting Reconstruction and Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Driving scene reconstruction and rendering have advanced significantly using the 3D Gaussian Splatting.However, most prior research has focused on the rendering quality along a pre-recorded vehicle path and struggles to generalize to out-of-path viewpoints, which is caused by the lack of high-quality supervision in those out-of-path views. To address this issue, we introduce an Inverse View Warping technique to create compact and high-quality images as supervision for the reconstruction of the out-of-path views, enabling high-quality rendering results for those views.For accurate and robust inverse view warping, a depth bootstrap strategy is proposed to obtain on-the-fly dense depth maps during the optimization process, overcoming the sparsity and incompleteness of LiDAR depth data.Our method achieves superior in-path and out-of-path reconstruction and rendering performance on the widely used Waymo Open dataset.In addition, a simulator-based benchmark is proposed to obtain the out-of-path ground truth and quantitatively evaluate the performance of out-of-path rendering, where our method outperforms previous methods by a significant margin. |
Jingqiu Zhou; Lue Fan; Linjiang Huang; Xiaoyu Shi; Si Liu; Zhaoxiang Zhang; Hongsheng Li; |
| 295 | CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent generative models produce convincing avatars from a single reference image, but visual fidelity lags behind multi-view techniques. Here, we present CAP4D: an approach that uses a morphable multi-view diffusion model to reconstruct photoreal 4D (dynamic 3D) portrait avatars from any number of reference images (i.e., one to 100) and animate and render them in real time. |
Felix Taubner; Ruihang Zhang; Mathieu Tuli; David B. Lindell; |
| 296 | Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In these flat areas, applying a slight noise is more advantageous for the reconstruction. We associate this characteristic with uncertainty and propose to apply uncertainty estimate to guide region-specific noise level control, a technique we refer to as Uncertainty-guided Noise Weighting. |
Leheng Zhang; Weiyi You; Kexuan Shi; Shuhang Gu; |
| 297 | A Stitch in Time Saves Nine: Small VLM Is A Precise Guidance for Accelerating Large VLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, it requires a full inference pass, which increases computational load and is therefore impractical in existing methods; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM, suggesting an efficient alternative. Based on these findings, we introduce \underline S mall VLM \underline G uidance for \underline L arge VLMs (SGL). |
Wangbo Zhao; Yizeng Han; Jiasheng Tang; Zhikai Li; Yibing Song; Kai Wang; Zhangyang Wang; Yang You; |
| 298 | MonSter: Marry Monodepth to Stereo Unleashes Power Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Existing methods struggle to handle ill-posed regions with limited matching cues, such as occlusions and textureless areas. To address this, we propose MonSter, a novel method that leverages the complementary strengths of monocular depth estimation and stereo matching. |
Junda Cheng; Longliang Liu; Gangwei Xu; Xianqi Wang; Zhaoxing Zhang; Yong Deng; Jinliang Zang; Yurui Chen; Zhipeng Cai; Xin Yang; |
| 299 | Synchronized Video-to-Audio Generation Via Mel Quantization-Continuum Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos.Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD). |
Juncheng Wang; Chao Xu; Cheng Yu; Lei Shang; Zhe Hu; Shujun Wang; Liefeng Bo; |
| 300 | ECBench: Can Multi-modal Foundation Models Understand The Egocentric World? A Holistic Embodied Cognition Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed. To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. |
Ronghao Dang; Yuqian Yuan; Wenqi Zhang; Yifei Xin; Boqiang Zhang; Long Li; Liuyi Wang; Qinyang Zeng; Xin Li; Lidong Bing; |
| 301 | BIMBA: Selective-Scan Compression for Long-Range Video Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce BIMBA, an efficient state-space model to handle long-form videos. |
Md Mohaiminul Islam; Tushar Nagarajan; Huiyu Wang; Gedas Bertasius; Lorenzo Torresani; |
| 302 | PartGen: Part-level 3D Generation and Reconstruction with Multi-view Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, most applications and creative workflows require 3D assets to be composed of distinct, meaningful parts that can be independently manipulated. To bridge this gap, we introduce PartGen, a novel approach for generating, from text, images, or unstructured 3D objects, 3D objects composed of meaningful parts. |
Minghao Chen; Roman Shapovalov; Iro Laina; Tom Monnier; Jianyuan Wang; David Novotny; Andrea Vedaldi; |
| 303 | BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose to decompose videos into visual primitives — blob video representation, a general representation for controllable video generation. |
Weixi Feng; Chao Liu; Sifei Liu; William Yang Wang; Arash Vahdat; Weili Nie; |
| 304 | Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. |
Kaihang Pan; Wang Lin; Zhongqi Yue; Tenglong Ao; Liyu Jia; Wei Zhao; Juncheng Li; Siliang Tang; Hanwang Zhang; |
| 305 | One Diffusion to Generate Them All Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce \texttt OneDiffusion – a single large-scale diffusion model designed to tackle a wide range of image synthesis and understanding tasks. |
Duong H. Le; Tuan Pham; Sangho Lee; Christopher Clark; Aniruddha Kembhavi; Stephan Mandt; Ranjay Krishna; Jiasen Lu; |
| 306 | You See It, You Got It: Learning 3D Creation on Pose-Free Videos at Scale Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. |
Baorui Ma; Huachen Gao; Haoge Deng; Zhengxiong Luo; Tiejun Huang; Lulu Tang; Xinlong Wang; |
| 307 | DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This enormous data requirement poses significant challenges for training large DMs due to high data acquisition costs and storage expenses. To alleviate this data burden, we propose a novel scenario: using existing DMs as data sources to train new DMs with any architecture. |
Qianlong Xiang; Miao Zhang; Yuzhang Shang; Jianlong Wu; Yan Yan; Liqiang Nie; |
| 308 | DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose DashGaussian, a scheduling scheme over the optimization complexity of 3DGS that strips redundant complexity to accelerate 3DGS optimization. |
Youyu Chen; Junjun Jiang; Kui Jiang; Xiao Tang; Zhihao Li; Xianming Liu; Yinyu Nie; |
| 309 | VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. |
Enric Corona; Andrei Zanfir; Eduard Gabriel Bazavan; Nikos Kolotouros; Thiemo Alldieck; Cristian Sminchisescu; |
| 310 | Light3R-SfM: Towards Feed-forward Structure-from-Motion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Light3R-SfM, a feed-forward, end-to-end learnable framework for efficient large-scale Structure-from-Motion (SfM) from unconstrained image collections. |
Sven Elflein; Qunjie Zhou; Laura Leal-Taixé; |
| 311 | Large-Scale Text-to-Image Model with Inpainting Is A Zero-Shot Subject-Driven Image Generator Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. |
Chaehun Shin; Jooyoung Choi; Heeseung Kim; Sungroh Yoon; |
| 312 | Frequency Dynamic Convolution for Dense Image Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability. In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. |
Linwei Chen; Lin Gu; Liang Li; Chenggang Yan; Ying Fu; |
| 313 | WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose WeakMCN, a novel multi-task collaborative network that effectively combines WREC and WRES with a dual-branch architecture. |
Silin Cheng; Yang Liu; Xinwei He; Sebastien Ourselin; Lei Tan; Gen Luo; |
| 314 | Generative Omnimatte: Learning to Decompose Video Into Layers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a novel generative layered video decomposition framework to address the omnimatte problem. |
Yao-Chih Lee; Erika Lu; Sarah Rumbley; Michal Geyer; Jia-Bin Huang; Tali Dekel; Forrester Cole; |
| 315 | Cross-Modal 3D Representation with Multi-View Images and Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces OpenView, a novel representation method that integrates both point clouds and multi-view images to form a unified 3D representation. |
Ziyang Zhou; Pinghui Wang; Zi Liang; Haitao Bai; Ruofei Zhang; |
| 316 | Seurat: From Moving Points to Depth Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, humans often perceive relative depth intuitively by observing variations in the size and spacing of objects as they move. Inspired by this, we propose a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories. |
Seokju Cho; Jiahui Huang; Seungryong Kim; Joon-Young Lee; |
| 317 | EditAR: Unified Conditional Generation with Autoregressive Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks, e.g., image editing, depth-to-image, edge-to-image, segmentation-to-image. |
Jiteng Mu; Nuno Vasconcelos; Xiaolong Wang; |
| 318 | Exploring Timeline Control for Facial Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a new control signal for facial motion generation: timeline control. |
Yifeng Ma; Jinwei Qi; Chaonan Ji; Peng Zhang; Bang Zhang; Zhidong Deng; Liefeng Bo; |
| 319 | LT3SD: Latent Trees for 3D Scene Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present LT3SD, a novel latent diffusion model for large-scale 3D scene generation. |
Quan Meng; Lei Li; Matthias Nießner; Angela Dai; |
| 320 | Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we venture into an orthogonal direction and explore self-correction in VLMs focusing on semantic grounding. |
Yuan-Hong Liao; Rafid Mahmood; Sanja Fidler; David Acuna; |
| 321 | EgoLife: Towards Egocentric Life Assistant Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. |
Jingkang Yang; Shuai Liu; Hongming Guo; Yuhao Dong; Xiamengwei Zhang; Sicheng Zhang; Pengyun Wang; Zitang Zhou; Binzhu Xie; Ziyue Wang; Bei Ouyang; Zhengyu Lin; Marco Cominelli; Zhongang Cai; Bo Li; Yuanhan Zhang; Peiyuan Zhang; Fangzhou Hong; Joerg Widmer; Francesco Gringoli; Lei Yang; Ziwei Liu; |
| 322 | MNE-SLAM: Multi-Agent Neural SLAM for Mobile Robots Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, we propose the first distributed multi-agent collaborative SLAM framework with distributed mapping and camera tracking, joint scene representation, intra-to-inter loop closure, and multi-submap fusion. |
Tianchen Deng; Guole Shen; Chen Xun; Shenghai Yuan; Tongxin Jin; Hongming Shen; Yanbo Wang; Jingchuan Wang; Hesheng Wang; Danwei Wang; Weidong Chen; |
| 323 | DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present DirectTriGS, a novel framework designed for 3D object generation with Gaussian Splatting (GS). |
Xiaoliang Ju; Hongsheng Li; |
| 324 | DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we develop an analogous concept for a very different problem, namely, the reconstruction of the 3D shape and pose of deformable objects. |
Ben Kaye; Tomas Jakab; Shangzhe Wu; Christian Ruprecht; Andrea Vedaldi; |
| 325 | ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Proxy Transformation suitable for multimodal task to efficiently improve the point cloud manifold. |
Qihang Peng; Henry Zheng; Gao Huang; |
| 326 | FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present FloVD, a novel video diffusion model for camera-controllable video generation. |
Wonjoon Jin; Qi Dai; Chong Luo; Seung-Hwan Baek; Sunghyun Cho; |
| 327 | Your ViT Is Secretly An Image Segmentation Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. |
Tommie Kerssies; Niccolò Cavagnero; Alexander Hermans; Narges Norouzi; Giuseppe Averta; Bastian Leibe; Gijs Dubbelman; Daan de Geus; |
| 328 | Point Clouds Meets Physics: Dynamic Acoustic Field Fitting Network for Point Cloud Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The limited representational capacity of pure point cloud models continues to constrain the potential of cross-modal fusion methods and performance across various tasks. To address this challenge, we propose a Dynamic Acoustic Field Fitting Network (DAF-Net), inspired by physical acoustic principles. |
Changshuo Wang; Shuting He; Xiang Fang; Jiawei Han; Zhonghang Liu; Xin Ning; Weijun Li; Prayag Tiwari; |
| 329 | SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a novel self-supervised stereo video synthesis paradigm via a video diffusion model, termed SpatialDreamer, which meets the challenges head-on. |
Zhen Lv; Yangqi Long; Congzhentao Huang; Cao Li; Chengfei Lv; Hao Ren; Dian Zheng; |
| 330 | Generative Sparse-View Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Generative Sparse-view Gaussian Splatting (GS-GS), a general pipeline designed to enhance the rendering quality of 3D/4D Gaussian Splatting (GS) when training views are sparse. |
Hanyang Kong; Xingyi Yang; Xinchao Wang; |
| 331 | Is Your World Simulator A Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models’ abilities to handle event-level story presentation. To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models’ story-completion capabilities. |
Yiping Wang; Xuehai He; Kuan Wang; Luyao Ma; Jianwei Yang; Shuohang Wang; Simon Shaolei Du; Yelong Shen; |
| 332 | RealEdit: Reddit Edits As A Large-scale Empirical Dataset for Image Transformations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: So, we introduce 48K training examples, with which we train our REALEDIT model. |
Peter Sushko; Ayana Bharadwaj; Zhi Yang Lim; Vasily Ilin; Ben Caffee; Dongping Chen; Mohammadreza Salehi; Cheng-Yu Hsieh; Ranjay Krishna; |
| 333 | Synthetic Visual Genome Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o further refines ROBIN’s predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. |
Jae Sung Park; Zixian Ma; Linjie Li; Chenhao Zheng; Cheng-Yu Hsieh; Ximing Lu; Khyathi Chandu; Quan Kong; Norimasa Kobori; Ali Farhadi; Yejin Choi; Ranjay Krishna; |
| 334 | Mind The Time: Temporally-Controlled Multi-Event Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. |
Ziyi Wu; Aliaksandr Siarohin; Willi Menapace; Ivan Skorokhodov; Yuwei Fang; Varnith Chordia; Igor Gilitschenski; Sergey Tulyakov; |
| 335 | SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN Encoder, and a decoder based on state space models. |
Yunxiang Fu; Meng Lou; Yizhou Yu; |
| 336 | Boltzmann Attention Sampling for Image Analysis with Small Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing sparse attention mechanisms rely on rigid hierarchical structures, which are poorly suited for detecting small, variable, and uncertain object locations. In this paper, we propose BoltzFormer, a novel transformer-based architecture designed to address these challenges through dynamic sparse attention. |
Theodore Zhao; Sid Kiblawi; Naoto Usuyama; Ho Hin Lee; Sam Preston; Hoifung Poon; Mu Wei; |
| 337 | Gen3DEval: Using VLLMs for Automatic Evaluation of Generated 3D Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Rapid advancements in text-to-3D generation require robust and scalable evaluation metrics that align closely with human judgment, a need unmet by current metrics such as PSNR and CLIP, which require ground-truth data or focus only on prompt fidelity. To address this, we introduce Gen3DEval, a novel evaluation framework that leverages vision large language models (vLLMs) specifically fine-tuned for 3D object quality assessment. |
Shalini Maiti; Lourdes Agapito; Filippos Kokkinos; |
| 338 | Advancing Semantic Future Prediction Through Multimodal Visual Sequence Transformers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. |
Efstathios Karypidis; Ioannis Kakogeorgiou; Spyros Gidaris; Nikos Komodakis; |
| 339 | Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent approaches attempt task-specific design but rarely achieve "The Best of Both Worlds" due to inconsistent optimization goals. To address these issues, we propose a novel method that leverages the semantic knowledge from the Segment Anything Model (SAM) to grow the quality of fusion results and enable downstream task adaptability, namely SAGE. |
Guanyao Wu; Haoyu Liu; Hongming Fu; Yichuan Peng; Jinyuan Liu; Xin Fan; Risheng Liu; |
| 340 | UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce UniGraspTransformer, a universal Transformer-based network for dexterous robotic grasping that simplifies training while enhancing scalability and performance. |
Wenbo Wang; Fangyun Wei; Lei Zhou; Xi Chen; Lin Luo; Xiaohan Yi; Yizhong Zhang; Yaobo Liang; Chang Xu; Yan Lu; Jiaolong Yang; Baining Guo; |
| 341 | DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Specifically, we propose DoraCycle, which integrates two multimodal cycles: text-to-image-to-text and image-to-text-to-image. |
Rui Zhao; Weijia Mao; Mike Zheng Shou; |
| 342 | FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, real-time rendering with 3DGS remains a challenging problem, particularly in large-scale, high-resolution scenes due to the presence of numerous anisotropic Gaussian representations, and it has not been extensively explored. To address this challenge, we introduce FlashGS, an open-source CUDA library with Python bindings, featuring comprehensive algorithm design and optimizations, including redundancy elimination, adaptive scheduling, and efficient pipelining. |
Guofeng Feng; Siyan Chen; Rong Fu; Zimu Liao; Yi Wang; Tao Liu; Boni Hu; Linning Xu; Zhilin Pei; Hengjie Li; Xiuhong Li; Ninghui Sun; Xingcheng Zhang; Bo Dai; |
| 343 | MatAnyone: Stable Video Matting with Consistent Memory Propagation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Auxiliary-free human video matting methods, which rely solely on input frames, often struggle with complex or ambiguous backgrounds. To tackle this, we propose MatAnyone, a practical framework designed for target-assigned video matting. |
Peiqing Yang; Shangchen Zhou; Jixin Zhao; Qingyi Tao; Chen Change Loy; |
| 344 | Universal Actions for Enhanced Embodied Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce UniAct, a new embodied foundation modeling framework operating in the Universal Action Space. |
Jinliang Zheng; Jianxiong Li; Dongxiu Liu; Yinan Zheng; Zhihao Wang; Zhonghong Ou; Yu Liu; Jingjing Liu; Ya-Qin Zhang; Xianyuan Zhan; |
| 345 | TreeMeshGPT: Artistic Mesh Generation with Autoregressive Tree Sequencing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce TreeMeshGPT, an autoregressive Transformer designed to generate high-quality artistic meshes aligned with input point clouds. |
Stefan Lionar; Jiabin Liang; Gim Hee Lee; |
| 346 | AutoPresent: Designing Structured Visuals from Scratch Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. |
Jiaxin Ge; Zora Zhiruo Wang; Xuhui Zhou; Yi-Hao Peng; Sanjay Subramanian; Qinyue Tan; Maarten Sap; Alane Suhr; Daniel Fried; Graham Neubig; Trevor Darrell; |
| 347 | AudCast: Audio-Driven Human Video Generation By Cascaded Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose AudCast, a generalized audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm, which synthesizes holistic human videos based on a reference image and a given audio. |
Jiazhi Guan; Kaisiyuan Wang; Zhiliang Xu; Quanwei Yang; Yasheng Sun; Shengyi He; Borong Liang; Yukang Cao; Yingying Li; Haocheng Feng; Errui Ding; Jingdong Wang; Youjian Zhao; Hang Zhou; Ziwei Liu; |
| 348 | Iterative Predictor-Critic Code Decoding for Real-World Image Dehazing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel Iterative Predictor-Critic Code Decoding framework for real-world image dehazing, abbreviated as IPC-Dehaze, which leverages the high-quality codebook prior encapsulated in a pre-trained VQGAN. |
Jiayi Fu; Siyu Liu; Zikun Liu; Chun-Le Guo; Hyunhee Park; Ruiqi Wu; Guoqing Wang; Chongyi Li; |
| 349 | Event-based Video Super-Resolution Via State Space Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce MamEVSR, a Mamba-based network for event-based VSR that leverages the selective state space model, Mamba. |
Zeyu Xiao; Xinchao Wang; |
| 350 | ReconDreamer: Crafting World Models for Driving Scene Reconstruction Via Online Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Closed-loop simulation is crucial for end-to-end autonomous driving. Existing sensor simulation methods (e.g., NeRF and 3DGS) reconstruct driving scenes based on conditions that … |
Chaojun Ni; Guosheng Zhao; Xiaofeng Wang; Zheng Zhu; Wenkang Qin; Guan Huang; Chen Liu; Yuyin Chen; Yida Wang; Xueyang Zhang; Yifei Zhan; Kun Zhan; Peng Jia; Xianpeng Lang; Xingang Wang; Wenjun Mei; |
| 351 | PMA: Towards Parameter-Efficient Point Cloud Understanding Via Point Mamba Adapter Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: It neglects the rich complementary information in the intermediate layer, thereby failing to fully unlock the potential of pre-trained models. To overcome this limitation, we propose an orthogonal solution: Point Mamba Adapter (PMA), which constructs an ordered feature sequence from all layers of the pre-trained model and leverages Mamba to fuse all complementary semantics, thereby promoting comprehensive point cloud understanding. |
Yaohua Zha; Yanzi Wang; Hang Guo; Jinpeng Wang; Tao Dai; Bin Chen; Zhihao Ouyang; Xue Yuerong; Ke Chen; Shu-Tao Xia; |
| 352 | DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. |
Minghong Cai; Xiaodong Cun; Xiaoyu Li; Wenze Liu; Zhaoyang Zhang; Yong Zhang; Ying Shan; Xiangyu Yue; |
| 353 | Playing The Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, despite the significance of safety alignment, research on the vulnerabilities remains largely underexplored. In this paper, we investigate the unexplored vulnerability of the safety alignment, examining its ability to consistently provide safety guarantees for out-of-distribution(OOD)-ifying harmful inputs that may fall outside the aligned data distribution. |
Joonhyun Jeong; Seyun Bae; Yeonsung Jung; Jaeryong Hwang; Eunho Yang; |
| 354 | Everything to The Synthetic: Diffusion-driven Test-time Adaptation Via Synthetic-Domain Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, in this paper, we reveal that although the synthetic data in diffusion-driven TTA seems indistinguishable from the source data, it is unaligned with, or even markedly different from the latter for deep networks. To address this issue, we propose a Synthetic-Domain Alignment (SDA) framework. |
Jiayi Guo; Junhao Zhao; Chaoqun Du; Yulin Wang; Chunjiang Ge; Zanlin Ni; Shiji Song; Humphrey Shi; Gao Huang; |
| 355 | MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Furthermore, there is currently no publicly accessible dataset specifically designed for analyzing, evaluating, and training models for long video generation. In this paper, we present MovieBench: A Hierarchical Movie-Level Dataset for Long Video Generation, which addresses these challenges by providing unique contributions: (1) character consistency across scenes, (2) long videos with rich and coherent storylines, and (3) multi-scene narratives. |
Weijia Wu; Mingyu Liu; Zeyu Zhu; Xi Xia; Haoen Feng; Wen Wang; Kevin Qinghong Lin; Chunhua Shen; Mike Zheng Shou; |
| 356 | Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present Change3D, a framework that reconceptualizes the change detection and captioning tasks through video modeling. |
Duowang Zhu; Xiaohu Huang; Haiyan Huang; Hao Zhou; Zhenfeng Shao; |
| 357 | AniGS: Animatable Gaussian Avatar from A Single Image with Inconsistent Gaussian Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing 3D reconstruction methods often struggle to capture fine details in animatable models, while generative approaches for controllable animation, though avoiding explicit 3D modeling, suffer from viewpoint inconsistencies in extreme poses and computational inefficiencies. In this paper, we address these challenges by leveraging the power of generative models to produce detailed multi-view canonical pose images, which help resolve ambiguities in animatable human reconstruction. |
Lingteng Qiu; Shenhao Zhu; Qi Zuo; Xiaodong Gu; Yuan Dong; Junfei Zhang; Chao Xu; Zhe Li; Weihao Yuan; Liefeng Bo; Guanying Chen; Zilong Dong; |
| 358 | DrVideo: Document Retrieval Based Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. |
Ziyu Ma; Chenhui Gou; Hengcan Shi; Bin Sun; Shutao Li; Hamid Rezatofighi; Jianfei Cai; |
| 359 | DiC: Rethinking Conv3x3 Designs in Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Based on the architecture, we introduce conditioning improvements including stage-specific embeddings, mid-block condition injection, and conditional gating. |
Yuchuan Tian; Jing Han; Chengcheng Wang; Yuchen Liang; Chao Xu; Hanting Chen; |
| 360 | Embracing Collaboration Over Competition: Condensing Multiple Prompts for Visual In-Context Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Multiple suitable prompts may exist, but individually they often fall short, leading to difficulties in selection and the exclusion of useful context. To address this, we propose a new perspective: prompt condensation. |
Jinpeng Wang; Tianci Luo; Yaohua Zha; Yan Feng; Ruisheng Luo; Bin Chen; Tao Dai; Long Chen; Yaowei Wang; Shu-Tao Xia; |
| 361 | Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Learning from multiple domains is a primary factor that influences the generalization of a single unified robot system. In this paper, we aim to learn the trajectory prediction model by using broad out-of-domain data to improve its performance and generalization ability. |
Jiange Yang; Haoyi Zhu; Yating Wang; Gangshan Wu; Tong He; Limin Wang; |
| 362 | METASCENES: Towards Automated Replica Creation for Real-world 3D Scans Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Embodied AI (EAI) research requires high-quality, diverse 3D scenes to effectively support skill acquisition, sim-to-real transfer, and generalization. Achieving these quality … |
Huangyue Yu; Baoxiong Jia; Yixin Chen; Yandan Yang; Puhao Li; Rongpeng Su; Jiaxin Li; Qing Li; Wei Liang; Song-Chun Zhu; Tengyu Liu; Siyuan Huang; |
| 363 | SegAgent: Exploring Pixel Understanding Capabilities in MLLMs By Imitating Human Annotator Trajectories Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This approach disrupts the MLLM’s text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model’s intrinsic pixel-level understanding.Thus, We introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. |
Muzhi Zhu; Yuzhuo Tian; Hao Chen; Chunluan Zhou; Qingpei Guo; Yang Liu; Ming Yang; Chunhua Shen; |
| 364 | Temporally Consistent Object-Centric Learning By Contrasting Slots Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a novel object-level temporal contrastive loss for video object-centric models that explicitly promotes temporal consistency. |
Anna Manasyan; Maximilian Seitzer; Filip Radovic; Georg Martius; Andrii Zadaianchuk; |
| 365 | Bridging The Gap Between Gaussian Diffusion Models and Universal Quantization for Image Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel quantization-based forward diffusion process with theoretical foundations that tackles all three aforementioned gaps. |
Lucas Relic; Roberto Azevedo; Yang Zhang; Markus Gross; Christopher Schroers; |
| 366 | Pose Priors from Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Language is often used to describe physical interaction, yet most 3D human pose estimation methods overlook this rich source of information. We bridge this gap by leveraging large multimodal models (LMMs) as priors for reconstructing contact poses, offering a scalable alternative to traditional methods that rely on human annotations or motion capture data. |
Sanjay Subramanian; Evonne Ng; Lea Müller; Dan Klein; Shiry Ginosar; Trevor Darrell; |
| 367 | Disco4D: Disentangled 4D Human Generation and Animation from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Disco4D, a novel Gaussian Splatting framework for 4D human generation and animation from a single image. |
Hui En Pang; Shuai Liu; Zhongang Cai; Lei Yang; Tianwei Zhang; Ziwei Liu; |
| 368 | Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. |
Zhenglin Zhou; Fan Ma; Hehe Fan; Tat-Seng Chua; |
| 369 | MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called "mesh attention" to enable training at 1024×1024 resolution. |
Yuhan Wang; Fangzhou Hong; Shuai Yang; Liming Jiang; Wayne Wu; Chen Change Loy; |
| 370 | VGGT: Visual Geometry Grounded Transformer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. |
Jianyuan Wang; Minghao Chen; Nikita Karaev; Andrea Vedaldi; Christian Rupprecht; David Novotny; |
| 371 | Parallelized Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a simple yet effective approach for parallelized autoregressive visual generation that improves generation efficiency while preserving the advantages of autoregressive modeling. |
Yuqing Wang; Shuhuai Ren; Zhijie Lin; Yujin Han; Haoyuan Guo; Zhenheng Yang; Difan Zou; Jiashi Feng; Xihui Liu; |
| 372 | Co-op: Correspondence-based Novel Object Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Co-op, a novel method for accurately and robustly estimating the 6DoF pose of objects unseen during training from a single RGB image. |
Sungphill Moon; Hyeontae Son; Dongcheol Hur; Sangwook Kim; |
| 373 | PanSplat: 4K Panorama Synthesis with Feed-Forward Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present PanSplat, a generalizable, feed-forward approach that efficiently supports resolution up to 4K (2048 x4096). |
Cheng Zhang; Haofei Xu; Qianyi Wu; Camilo Cruz Gambardella; Dinh Phung; Jianfei Cai; |
| 374 | DTOS: Dynamic Time Object Sensing with Large Multimodal Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To address these, we propose a novel framework, Dynamic Time Object Sensing (DTOS), specifically designed for RVOS. |
Jirui Tian; Jinrong Zhang; Shenglan Liu; Luhao Xu; Zhixiong Huang; Gao Huang; |
| 375 | Hierarchical Features Matter: A Deep Exploration of Progressive Parameterization Method for Dataset Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they limit themselves to a fixed optimization space for distillation, neglecting the diverse guidance across different informative latent spaces. To overcome this limitation, we propose a novel parameterization method dubbed Hierarchical Parameterization Distillation (H-PD), to systematically explore hierarchical feature within provided feature space (e.g., layers within pre-trained generative adversarial networks). |
Xinhao Zhong; Hao Fang; Bin Chen; Xulin Gu; Meikang Qiu; Shuhan Qi; Shu-Tao Xia; |
| 376 | Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our method introduces a gradual masking process in which a small set of candidate patches is first pre-selected as potential mask regions. |
Gensheng Pei; Tao Chen; Yujia Wang; Xinhao Cai; Xiangbo Shu; Tianfei Zhou; Yazhou Yao; |
| 377 | MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Because capturing images that avoid these pitfalls is challenging, this severely limits the wider use of SfM, especially by non-expert users. We overcome these limitations by augmenting the classical SfM paradigm with monocular depth and normal priors inferred by deep neural networks. |
Zador Pataki; Paul-Edouard Sarlin; Johannes L. Schönberger; Marc Pollefeys; |
| 378 | AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present AnyEdit, a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing pairs spanning over 20 editing types and five domains. |
Qifan Yu; Wei Chow; Zhongqi Yue; Kaihang Pan; Yang Wu; Xiaoyang Wan; Juncheng Li; Siliang Tang; Hanwang Zhang; Yueting Zhuang; |
| 379 | Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Traditional methods, such as temporal attention mechanisms and 3D convolutions, often struggle with significant object movements and fail to capture long-range temporal dependencies in dynamic scenes. To address these limitations, we propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks — sequences of corresponding points across frames. |
Zihang Lai; Andrea Vedaldi; |
| 380 | 3D-MVP: 3D Multiview Pretraining for Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose 3D-MVP, a novel approach for 3D Multi-View Pretraining using masked autoencoders. |
Shengyi Qian; Kaichun Mo; Valts Blukis; David F. Fouhey; Dieter Fox; Ankit Goyal; |
| 381 | FineVQ: Fine-Grained User Generated Content Video Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the challenges and promote the development of UGC videos, we establish the first large-scale Fine-grained Video quality assessment Database, termed FineVD, which comprises 6104 UGC videos with fine-grained quality scores and descriptions across multiple dimensions. Based on this database, we propose a Fine-grained Video Quality assessment (FineVQ) model to learn the fine-grained quality of UGC videos, with the capabilities of quality rating, quality scoring, and quality attribution. |
Huiyu Duan; Qiang Hu; Jiarui Wang; Liu Yang; Zitong Xu; Lu Liu; Xiongkuo Min; Chunlei Cai; Tianxiao Ye; Xiaoyun Zhang; Guangtao Zhai; |
| 382 | Search and Detect: Training-Free Long Tail Object Detection Via Web-Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce SearchDet, a training-free long-tail object detection framework that significantly enhances open-vocabulary object detection performance. |
Mankeerat Sidhu; Hetarth Chopra; Ansel Blume; Jeonghwan Kim; Revanth Gangi Reddy; Heng Ji; |
| 383 | QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on The Edge Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To mitigate the performance degradation, we introduce activation polishing and compensation algorithm applied before and after activation quantization, as well as a weight reconstruction method for minimizing errors in weight quantization. |
Xuan Shen; Weize Ma; Jing Liu; Changdi Yang; Rui Ding; Quanyi Wang; Henghui Ding; Wei Niu; Yanzhi Wang; Pu Zhao; Jun Lin; Jiuxiang Gu; |
| 384 | Accelerating Diffusion Transformer Via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose increment-calibrated caching, a training-free method for DiT acceleration, where the calibration parameters are generated from the pre-trained model itself with low-rank approximation. |
Zhiyuan Chen; Keyi Li; Yifan Jia; Le Ye; Yufei Ma; |
| 385 | MVSAnywhere: Zero-Shot Multi-View Stereo Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Training a general-purpose multi-view stereo model is challenging and raises several questions, e.g. how to best make use of transformer-based architectures, how to incorporate additional metadata when there is a variable number of input views, and how to estimate the range of valid depths which can vary considerably across different scenes and is typically not known a priori? To address these issues, we introduce MVSA, a novel and versatile Multi-View Stereo architecture that aims to work Anywhere by generalizing across diverse domains and depth ranges. |
Sergio Izquierdo; Mohamed Sayed; Michael Firman; Guillermo Garcia-Hernando; Daniyar Turmukhambetov; Javier Civera; Oisin Mac Aodha; Gabriel Brostow; Jamie Watson; |
| 386 | WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present WildGS-SLAM, a robust and efficient monocular RGB SLAM system designed to handle dynamic environments by leveraging uncertainty-aware geometric mapping. |
Jianhao Zheng; Zihan Zhu; Valentin Bieri; Marc Pollefeys; Songyou Peng; Iro Armeni; |
| 387 | CoMBO: Conflict Mitigation Via Branched Optimization for Class Incremental Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The inherent conflict above often leads to a back-and-forth, which turns the objective into finding the balance between the performance of previous (old) and incremental (new) classes. To address this conflict, we introduce a novel approach, Conflict Mitigation via Branched Optimization (CoMBO). |
Kai Fang; Anqi Zhang; Guangyu Gao; Jianbo Jiao; Chi Harold Liu; Yunchao Wei; |
| 388 | G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present G3Flow, a novel framework that constructs real-time semantic flow, a dynamic, object-centric 3D semantic representation by leveraging foundation models. |
Tianxing Chen; Yao Mu; Zhixuan Liang; Zanxin Chen; Shijia Peng; Qiangyu Chen; Mingkun Xu; Ruizhen Hu; Hongyuan Zhang; Xuelong Li; Ping Luo; |
| 389 | Generative Multiview Relighting for 3D Reconstruction Under Extreme Illumination Variation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present an approach that reconstructs objects from images taken under different illuminations by first relighting the images under a single reference illumination with a multiview relighting diffusion model and then reconstructing the object’s geometry and appearance with a radiance field architecture that is robust to the small remaining inconsistencies among the relit images. |
Hadi Alzayer; Philipp Henzler; Jonathan T. Barron; Jia-Bin Huang; Pratul P. Srinivasan; Dor Verbin; |
| 390 | Efficient Transfer Learning for Video-language Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose a parameter-efficient Multi-modal Spatio-Temporal Adapter (MSTA) to enhance the alignment between textual and visual representations, achieving a balance between generalizable knowledge and task-specific adaptation. |
Haoxing Chen; Zizheng Huang; Yan Hong; Yanshuo Wang; Zhongcai Lyu; Zhuoer Xu; Jun Lan; Zhangxuan Gu; |
| 391 | SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a novel framework, termed SAM2-LOVE, which integrates textual, audio, and visual representations into a learnable token to prompt and align SAM2 for achieving Ref-AVS in the LAVS. |
Yuji Wang; Haoran Xu; Yong Liu; Jiaze Li; Yansong Tang; |
| 392 | Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This limitation not only hinders comprehensive representations of 3D scene, but also compromises training and inference efficiency. To address these challenges, we propose a unified Instance-aware 3D Large Multi-modal Model (Inst3D-LMM) to deal with multiple 3D scene understanding tasks simultaneously. |
Hanxun Yu; Wentong Li; Song Wang; Junbo Chen; Jianke Zhu; |
| 393 | Classic Video Denoising in A Machine Learning World: Robust, Fast, and Controllable Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they require manually tuning parameters for each input video, which is not only tedious but also requires skill. We bridge the gap between these two paradigms by proposing a differentiable denoising pipeline based on traditional methods. |
Xin Jin; Simon Niklaus; Zhoutong Zhang; Zhihao Xia; Chunle Guo; Yuting Yang; Jiawen Chen; Chongyi Li; |
| 394 | CoSER: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce CoSER, a novel consistent dense Multiview Text-to-Image Generator for Text-to-3D, achieving both efficiency and quality by meticulously learning neighbor-view coherence and further alleviating ambiguity through the swift traversal of all views. |
Bonan Li; Zicheng Zhang; Xingyi Yang; Xinchao Wang; |
| 395 | 3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we present a novel 3D enhancement pipeline, dubbed 3DEnhancer, which employs a multi-view latent diffusion model to enhance coarse 3D inputs while preserving multi-view consistency. |
Yihang Luo; Shangchen Zhou; Yushi Lan; Xingang Pan; Chen Change Loy; |
| 396 | ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose ChainHOI, a novel approach for text-driven human-object interaction (HOI) generation that explicitly models interactions at both the joint and kinetic chain levels. |
Ling-An Zeng; Guohong Huang; Yi-Lin Wei; Shengbo Gu; Yu-Ming Tang; Jingke Meng; Wei-Shi Zheng; |
| 397 | Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this study, we discover that many layers within lightweight ViT-based trackers tend to learn relatively redundant and repetitive target representations. |
Chaocan Xue; Bineng Zhong; Qihua Liang; Yaozong Zheng; Ning Li; Yuanliang Xue; Shuxiang Song; |
| 398 | Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Also, recent transformer-based 4D backbones commonly suffer from large computational costs due to their quadratic complexity, particularly for long video sequences. To address these challenges, we propose a novel point cloud video understanding backbone purely based on the State Space Models (SSMs). |
Jiuming Liu; Jinru Han; Lihao Liu; Angelica I. Aviles-Rivero; Chaokang Jiang; Zhe Liu; Hesheng Wang; |
| 399 | Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper proposes a general solution to enable point cloud recognition models to handle distribution shifts at test time. |
Hongyu Sun; Qiuhong Ke; Ming Cheng; Yongcai Wang; Deying Li; Chenhui Gou; Jianfei Cai; |
| 400 | Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation. |
Reza Qorbani; Gianluca Villani; Theodoros Panagiotakopoulos; Marc Botet Colomer; Linus Härenstam-Nielsen; Mattia Segu; Pier Luigi Dovesi; Jussi Karlgren; Daniel Cremers; Federico Tombari; Matteo Poggi; |
| 401 | Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations Via Attention Lens Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, most studies primarily focus on the language aspect rather than the visual. In this paper, we address how LVLMs process visual information and whether this process causes hallucination. |
Zhangqi Jiang; Junkai Chen; Beier Zhu; Tingjin Luo; Yankun Shen; Xu Yang; |
| 402 | MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose scaling up 3D scene reconstruction by training with synthesized data. |
Hanwen Jiang; Zexiang Xu; Desai Xie; Ziwen Chen; Haian Jin; Fujun Luan; Zhixin Shu; Kai Zhang; Sai Bi; Xin Sun; Jiuxiang Gu; Qixing Huang; Georgios Pavlakos; Hao Tan; |
| 403 | GLASS: Guided Latent Slot Diffusion for Object-Centric Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet, object-centric learning struggles on real-world datasets, which contain multiple objects of complex textures and shapes in natural everyday scenes. To address this, we introduce Guided Latent Slot Diffusion (GLASS), a novel slot attention model that learns in the space of generated images and uses semantic and instance guidance modules to learn better slot embeddings for various downstream tasks. |
Krishnakant Singh; Simone Schaub-Meyer; Stefan Roth; |
| 404 | Poly-Autoregressive Prediction for Modeling Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a simple framework for predicting the behavior of an agent in multi-agent settings. |
Neerja Thakkar; Tara Sadjadpour; Jathushan Rajasegeran; Shiry Ginosar; Jitendra Malik; |
| 405 | Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose Assembly of Global and Local Attention (AGLA) , a training-free and plug-and-play approach that mitigates hallucinations by assembling global features for response generation and local features for visual discrimination simultaneously. |
Wenbin An; Feng Tian; Sicong Leng; Jiahao Nie; Haonan Lin; Qianying Wang; Ping Chen; Xiaoqin Zhang; Shijian Lu; |
| 406 | Volumetric Surfaces: Representing Fuzzy Geometries with Layered Meshes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a novel representation for real-time view synthesis where the (P1) number of sampling locations is small and bounded, (P2) sampling locations are efficiently found via rasterization, and (P3) rendering is sorting-free. |
Stefano Esposito; Anpei Chen; Christian Reiser; Samuel Rota Bulò; Lorenzo Porzi; Katja Schwarz; Christian Richardt; Michael Zollhöfer; Peter Kontschieder; Andreas Geiger; |
| 407 | Stacking Brick By Brick: Aligned Feature Isolation for Incremental Face Forgery Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: The rapid advancement of face forgery techniques has introduced a growing variety of forgeries.Incremental Face Forgery Detection (IFFD), involvinggradually adding new forgery data to fine-tune the previously trained model, has been introduced as a promising strategy to deal with evolving forgery methods.However, a naively trained IFFD model is prone to catastrophic forgetting when new forgeries are integrated, as treating all forgeries as a single "Fake" class in the Real/Fake classification can cause different forgery types overriding one another, thereby resulting in the forgetting of unique characteristics from earlier tasks and limiting the model’s effectiveness in learning forgery specificity and generality.In this paper, we propose to stack the latent feature distributions of previous and new tasks brick by brick, i.e., achieving aligned feature isolation. |
Jikang Cheng; Zhiyuan Yan; Ying Zhang; Li Hao; Jiaxin Ai; Qin Zou; Chen Li; Zhongyuan Wang; |
| 408 | GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose GoalFlow, an end-to-end autonomous driving method for generating high-quality multimodal trajectories. |
Zebin Xing; Xingyu Zhang; Yang Hu; Bo Jiang; Tong He; Qian Zhang; Xiaoxiao Long; Wei Yin; |
| 409 | How to Merge Your Multimodal Models Over Time? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Should different strategies be used for the training initialization and deployment phases? To tackle these questions, we propose a unified framework called TIME—Temporal Integration of Model Expertise—that defines temporal model merging across three axes: (1) Initialization Phase, (2) Deployment Phase, and (3) Merging Technique. |
Sebastian Dziadzio; Vishaal Udandarao; Karsten Roth; Ameya Prabhu; Zeynep Akata; Samuel Albanie; Matthias Bethge; |
| 410 | Pathways on The Image Manifold: Image Editing Via Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. |
Noam Rotstein; Gal Yona; Daniel Silver; Roy Velich; David Bensaid; Ron Kimmel; |
| 411 | MC^2: Multi-concept Guidance for Customized Multi-concept Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose MC^2, a novel approach for multi-concept customization that enhances flexibility and fidelity through inference-time optimization. |
Jiaxiu Jiang; Yabo Zhang; Kailai Feng; Xiaohe Wu; Wenbo Li; Renjing Pei; Fan Li; Wangmeng Zuo; |
| 412 | SpiritSight Agent: Advanced GUI Agent with One Look Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While they generally meet the requirements of compatibility and low latency, these vision-based GUI agents tend to have low accuracy due to their limitations in element grounding. To address this issue, we propose SpiritSight, a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms. |
Zhiyuan Huang; Ziming Cheng; Junting Pan; Zhaohui Hou; Mingjie Zhan; |
| 413 | 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. |
Jianing Yang; Xuweiyi Chen; Nikhil Madaan; Madhavan Iyengar; Shengyi Qian; David F. Fouhey; Joyce Chai; |
| 414 | DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Taking into account that depth can be regarded as a geometry supplement for RGB images, a straightforward question arises: Do we really need to explicitly encode depth information with neural networks as done for RGB images? Based on this insight, in this paper, we investigate a new way to learn RGBD feature representations and present DFormerv2, a strong RGBD encoder that explicitly uses depth maps as geometry priors rather than encoding depth information with neural networks. |
Bo-Wen Yin; Jiao-Long Cao; Ming-Ming Cheng; Qibin Hou; |
| 415 | HRAvatar: High-Quality and Relightable Gaussian Head Avatar Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These methods also fail to produce realistic effects under novel lighting conditions. To address these issues, we propose HRAvatar, a 3DGS-based method that reconstructs high-fidelity, relightable 3D head avatars. |
Dongbin Zhang; Yunfei Liu; Lijian Lin; Ye Zhu; Kangjie Chen; Minghan Qin; Yu Li; Haoqian Wang; |
| 416 | Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To safeguard personal videos from unauthorized use, we propose two series of protective video watermarks with imperceptible adversarial perturbations, named Ramblings and Mutes. |
Haitong Liu; Kuofeng Gao; Yang Bai; Jinmin Li; Jinxiao Shan; Tao Dai; Shu-Tao Xia; |
| 417 | MobileMamba: Lightweight Multi-Receptive Visual Mamba Network Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose the MobileMamba framework, which balances efficiency and performance. |
Haoyang He; Jiangning Zhang; Yuxuan Cai; Hongxu Chen; Xiaobin Hu; Zhenye Gan; Yabiao Wang; Chengjie Wang; Yunsheng Wu; Lei Xie; |
| 418 | Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Track4Gen, a spatially aware video generator that combines video diffusion loss with point tracking across frames, providing enhanced spatial supervision on the diffusion features. |
Hyeonho Jeong; Chun-Hao P. Huang; Jong Chul Ye; Niloy J. Mitra; Duygu Ceylan; |
| 419 | Omni-ID: Holistic Identity Representation Designed for Generative Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Omni-ID, a novel facial representation designed specifically for generative tasks. |
Guocheng Qian; Kuan-Chieh Wang; Or Patashnik; Negin Heravi; Daniil Ostashev; Sergey Tulyakov; Daniel Cohen-Or; Kfir Aberman; |
| 420 | Efficient Visual State Space Model for Image Deblurring Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose a simple yet effective visual state space model (EVSSM) for image deblurring, leveraging the benefits of state space models (SSMs) to visual data. |
Lingshun Kong; Jiangxin Dong; Jinhui Tang; Ming-Hsuan Yang; Jinshan Pan; |
| 421 | DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: The increased input length for LLM significantly raises the complexity of inference, resulting in high latency in LMMs. To address this issue, token pruning methods, which remove part of the visual tokens, are proposed. |
Saeed Ranjbar Alvar; Gursimran Singh; Mohammad Akbari; Yong Zhang; |
| 422 | Personalized Preference Fine-tuning of Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This lack of personalization limits the efficacy of these models. To bridge this gap, we introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences. |
Meihua Dang; Anikait Singh; Linqi Zhou; Stefano Ermon; Jiaming Song; |
| 423 | ILIAS: Instance-Level Image Retrieval At Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces ILIAS, a new test dataset for Instance-Level Image retrieval At Scale. |
Giorgos Kordopatis-Zilos; Vladan Stojnić; Anna Manko; Pavel Suma; Nikolaos-Antonios Ypsilantis; Nikos Efthymiadis; Zakaria Laskar; Jiri Matas; Ondrej Chum; Giorgos Tolias; |
| 424 | Hash3D: Training-free Acceleration for 3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce Hash3D, a universal acceleration for 3D score distillation sampling (SDS) without model training.Central to Hash3D is the observation that images rendered from similar camera positions and diffusion time-steps often have redundant feature maps. |
Xingyi Yang; Songhua Liu; Xinchao Wang; |
| 425 | TFCustom: Customized Image Generation with Time-Aware Frequency Feature Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we revisit the extraction of reference features and propose TFCustom, a model framework designed to focus on reference image features at different temporal steps and frequency levels. |
Mushui Liu; Dong She; Jingxuan Pang; Qihan Huang; Jiacheng Ying; Wanggui He; Yuanlei Hou; Siming Fu; |
| 426 | Few-shot Implicit Function Generation Via Equivariance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This is challenging because even for the same signal, the optimal INRs can vary significantly depending on their initializations. To tackle this, we propose EquiGen, a framework that can generate new INRs from limited data. |
Suizhi Huang; Xingyi Yang; Hongtao Lu; Xinchao Wang; |
| 427 | Understanding Multi-Task Activities from Single-Task Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Extensive experiments demonstrate that our framework effectively bridges the gap between single-task training and multi-task testing, advancing temporal action segmentation with state-of-the-art performance in complex environments. |
Yuhan Shen; Ehsan Elhamifar; |
| 428 | LIRM: Large Inverse Rendering Model for Progressive Reconstruction of Shape, Materials and View-dependent Radiance Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Large Inverse Rendering Model (LIRM), a transformer architecture that jointly reconstructs high-quality shape, materials, and radiance fields with view-dependent effects in less than a second. |
Zhengqin Li; Dilin Wang; Ka Chen; Zhaoyang Lv; Thu Nguyen-Phuoc; Milim Lee; Jia-Bin Huang; Lei Xiao; Yufeng Zhu; Carl S. Marshall; Yuheng Ren; Richard Newcombe; Zhao Dong; |
| 429 | VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis Through User Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most evaluations rely on traditional methods like multiple-choice question answering in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation–and due to the prohibitive cost and slow pace of human annotation for video tasks–we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena’s framework, designed to automatically assess LMMs’ video analysis abilities. |
Ziyang Luo; Haoning Wu; Dongxu Li; Jing Ma; Mohan Kankanhalli; Junnan Li; |
| 430 | HD-EPIC: A Highly-Detailed Egocentric Video Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a validation dataset of newly-collected kitchen based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. |
Toby Perrett; Ahmad Darkhalil; Saptarshi Sinha; Omar Emara; Sam Pollard; Kranti Kumar Parida; Kaiting Liu; Prajwal Gatti; Siddhant Bansal; Kevin Flanagan; Jacob Chalk; Zhifan Zhu; Rhodri Guerrier; Fahd Abdelazim; Bin Zhu; Davide Moltisanti; Michael Wray; Hazel Doughty; Dima Damen; |
| 431 | Unboxed: Geometrically and Temporally Consistent Video Outpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite promising results, this strategy seems at the moment limited in terms of quality. In this work, we address this problem using two key ideas: 3D supported outpainting for the static regions of the images, and leveraging pre-trained video diffusion model to ensure realistic and temporally coherent results, particularly for the dynamic parts. |
Zhongrui Yu; Martina Megaro-Boldini; Robert W. Sumner; Abdelaziz Djelouah; |
| 432 | MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining pixel-level generation loss with MAE-style feature-level context learning. |
Jingcheng Ni; Yuxin Guo; Yichen Liu; Rui Chen; Lewei Lu; Zehuan Wu; |
| 433 | Do We Always Need The Simplicity Bias? Looking for Optimal Inductive Biases in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper explores the limits of this assumption. Building on recent work that showed that activation functions are the origin of the simplicity bias (Teney, 2024), we introduce a method to meta-learn activation functions to modulate this bias. |
Damien Teney; Liangze Jiang; Florin Gogianu; Ehsan Abbasnejad; |
| 434 | VideoWorld: Exploring Knowledge Learning from Unlabeled Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual change is crucial for knowledge acquisition. To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. |
Zhongwei Ren; Yunchao Wei; Xun Guo; Yao Zhao; Bingyi Kang; Jiashi Feng; Xiaojie Jin; |
| 435 | Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Generative Densification, an efficient and generalizable densification strategy specifically tailored for feed-forward models. |
Seungtae Nam; Xiangyu Sun; Gyeongjin Kang; Younggeun Lee; Seungjun Oh; Eunbyung Park; |
| 436 | Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Traditional approaches that simply concatenate isolated signs often result in abrupt transitions, disrupting video coherence. To address this, we propose a novel framework, Sign-D2C, that employs a conditional diffusion model to synthesize contextually smooth transition frames, enabling the seamless construction of continuous sign language sequences. |
Shengeng Tang; Jiayi He; Lechao Cheng; Jingjing Wu; Dan Guo; Richang Hong; |
| 437 | ShiftwiseConv: Small Convolutional Kernel with Large Kernel Effect Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we reveal that the key hidden factors of large kernels can be summarized as two separate components: extracting features at a certain granularity and fusing features by multiple pathways. |
Dachong Li; Li Li; Zhuangzhuang Chen; Jianqiang Li; |
| 438 | Generative Zero-Shot Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel generative method that augments Composed Image Retrieval by Composed Image Generation (CIG) to provide pseudo-target images. |
Lan Wang; Wei Ao; Vishnu Naresh Boddeti; Ser-Nam Lim; |
| 439 | 3DTopia-XL: Scaling High-quality 3D Asset Generation Via Primitive Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Despite recent advancements in 3D generative models, existing methods still face challenges with optimization speed, geometric fidelity, and the lack of assets for physically based rendering (PBR). In this paper, we introduce 3DTopia-XL, a scalable native 3D generative model designed to overcome these limitations. |
Zhaoxi Chen; Jiaxiang Tang; Yuhao Dong; Ziang Cao; Fangzhou Hong; Yushi Lan; Tengfei Wang; Haozhe Xie; Tong Wu; Shunsuke Saito; Liang Pan; Dahua Lin; Ziwei Liu; |
| 440 | Evaluating Model Perception of Color Illusions in Photorealistic Scenes Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose an automated framework for generating color illusion images, resulting in RCID (Realistic Color Illusion Dataset), a dataset of 19,000 realistic illusion images. |
Lingjun Mao; Zineng Tang; Alane Suhr; |
| 441 | DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present DeSiRe-GS, a self-supervised gaussian splatting representation, enabling effective static-dynamic decomposition and high-fidelity surface reconstruction in complex driving scenarios. |
Chensheng Peng; Chengwei Zhang; Yixiao Wang; Chenfeng Xu; Yichen Xie; Wenzhao Zheng; Kurt Keutzer; Masayoshi Tomizuka; Wei Zhan; |
| 442 | StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Text-driven style transfer aims to merge the style of a reference image with content described by a text prompt. Recent advancements in text-to-image models have improved the nuance of style transformations, yet significant challenges remain, particularly with overfitting to reference styles, limiting stylistic control, and misaligning with textual content.In this paper, we propose three complementary strategies to address these issues. |
Mingkun Lei; Xue Song; Beier Zhu; Hao Wang; Chi Zhang; |
| 443 | Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We argue that this problem can be solved through explicit cooperation among tasks. To achieve this goal, we propose a unified learning method which achieves explicit inter-task cooperation from both the perspectives of data and model thoroughly. |
Henghui Du; Guangyao Li; Chang Zhou; Chunjie Zhang; Alan Zhao; Di Hu; |
| 444 | CoLLM: A Large Language Model for Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Moreover, existing approaches struggle with complex and nuanced modification texts that demand sophisticated fusion and understanding of vision and language modalities. We present CoLLM, a one-stop framework that effectively addresses these limitations. |
Chuong Huynh; Jinyu Yang; Ashish Tawari; Mubarak Shah; Son Tran; Raffay Hamid; Trishul Chilimbi; Abhinav Shrivastava; |
| 445 | Are Spatial-Temporal Graph Convolution Networks for Human Action Recognition Over-Parameterized? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, despite the development of numerous models, their recognition performance does not differ significantly after aligning the input settings. With this observation, we hypothesize that ST-GCNs are over-parameterized for HAR, a conjecture subsequently confirmed through experiments employing the lottery ticket hypothesis. |
Jianyang Xie; Yitian Zhao; Yanda Meng; He Zhao; Anh Nguyen; Yalin Zheng; |
| 446 | FlexGS: Train Once, Deploy Everywhere with Many-in-One Flexible 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present an elastic inference method for 3DGS. |
Hengyu Liu; Yuehao Wang; Chenxin Li; Ruisi Cai; Kevin Wang; Wuyang Li; Pavlo Molchanov; Peihao Wang; Zhangyang Wang; |
| 447 | PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose PanoGS, a novel and efficient 3D panoptic open vocabulary scene understanding approach. |
Hongjia Zhai; Hai Li; Zhenzhe Li; Xiaokun Pan; Yijia He; Guofeng Zhang; |
| 448 | FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we revisit the conventional static paradigm that allocates a fixed compute budget per denoising iteration and propose a dynamic strategy instead. |
Sotiris Anagnostidis; Gregor Bachmann; Yeongmin Kim; Jonas Kohler; Markos Georgopoulos; Artsiom Sanakoyeu; Yuming Du; Albert Pumarola; Ali Thabet; Edgar Schönfeld; |
| 449 | KVQ: Boosting Video Quality Assessment Via Saliency-guided Local Perception Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Inspired by the Human Visual System (HVS) that links global quality to the local texture of different regions and their visual saliency, we propose a Kaleidoscope Video Quality Assessment (KVQ) framework, which aims to effectively assess both saliency and local texture, thereby facilitating the assessment of global quality. |
Yunpeng Qu; Kun Yuan; Qizhi Xie; Ming Sun; Chao Zhou; Jian Wang; |
| 450 | Inference-Scale Complexity in ANN-SNN Conversion for High-Performance and Low-Power Applications Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Even efficient ANN-SNN conversion methods necessitate quantized training of ANNs to enhance the effectiveness of the conversion, incurring additional training costs. To address these challenges, we propose an efficient ANN-SNN conversion framework with only inference scale complexity. |
Tong Bu; Maohua Li; Zhaofei Yu; |
| 451 | Can Text-to-Video Generation Help Video-Language Alignment? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While a solution would be to collect videos for the negative captions, existing databases lack the fine-grained variations needed to cover all possible negatives. In this work, we study whether synthetic videos can help to overcome this issue. |
Luca Zanella; Massimiliano Mancini; Willi Menapace; Sergey Tulyakov; Yiming Wang; Elisa Ricci; |
| 452 | Teaching Large Language Models to Regress Accurate Image Quality Scores Using Score Distribution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we aim to leverage MLLMs to regress accurate quality scores. |
Zhiyuan You; Xin Cai; Jinjin Gu; Tianfan Xue; Chao Dong; |
| 453 | PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models. |
Song Wang; Xiaolu Liu; Lingdong Kong; Jianyun Xu; Chunyong Hu; Gongfan Fang; Wentong Li; Jianke Zhu; Xinchao Wang; |
| 454 | LayoutVLM: Differentiable Optimization of 3D Layout Via Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. |
Fan-Yun Sun; Weiyu Liu; Siyi Gu; Dylan Lim; Goutam Bhat; Federico Tombari; Manling Li; Nick Haber; Jiajun Wu; |
| 455 | Adapting Pre-trained 3D Models for Point Cloud Video Understanding Via Cross-frame Spatio-temporal Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate the potential of transferring pre-trained static 3D point cloud models to the 4D domain, identifying the limitations of static models that capture only spatial information while neglecting temporal dynamics. |
Baixuan Lv; Yaohua Zha; Tao Dai; Xue Yuerong; Ke Chen; Shu-Tao Xia; |
| 456 | SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less Than 0.2% Training Cost Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce SAM-I2V, an effective image-to-video upgradation method for cultivating a promptable video segmentation (PVS) model. |
Haiyang Mei; Pengyu Zhang; Mike Zheng Shou; |
| 457 | Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. |
Federico Cocchi; Nicholas Moratelli; Marcella Cornia; Lorenzo Baraldi; Rita Cucchiara; |
| 458 | FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image Classification Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, we introduce the knowledge-enhanced adaptive visual compression framework, dubbed FOCUS, which uniquely combines pathology FMs with language prior knowledge to enable a focused analysis of diagnostically relevant regions by prioritizing discriminative WSI patches. |
Zhengrui Guo; Conghao Xiong; Jiabo Ma; Qichen Sun; Lishuang Feng; Jinzhuo Wang; Hao Chen; |
| 459 | SoundVista: Novel-View Ambient Sound Synthesis Via Visual-Acoustic Binding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce SoundVista, a method to generate the ambient sound of an arbitrary scene at novel viewpoints. |
Mingfei Chen; Israel D. Gebru; Ishwarya Ananthabhotla; Christian Richardt; Dejan Markovic; Jake Sandakly; Steven Krenn; Todd Keebler; Eli Shlizerman; Alexander Richard; |
| 460 | One-shot 3D Object Canonicalization Based on Geometric and Semantic Consistency Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce a novel joint energy function to enforce geometric and semantic consistency, aligning object orientations precisely despite significant shape variations. |
Li Jin; Yujie Wang; Wenzheng Chen; Qiyu Dai; Qingzhe Gao; Xueying Qin; Baoquan Chen; |
| 461 | Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most existing VIT datasets rely heavily on human annotations or paid services like the GPT API, which limits users with constrained resources from creating VIT datasets for custom applications. To address this, we introduce Pre-Instruction Data Selection (PreSel), a more practical data selection paradigm that directly selects the most beneficial unlabeled images and generates instructions only for the selected images. |
Bardia Safaei; Faizan Siddiqui; Jiacong Xu; Vishal M. Patel; Shao-Yuan Lo; |
| 462 | BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present BASKET, a large-scale basketball video dataset for fine-grained skill estimation. |
Yulu Pan; Ce Zhang; Gedas Bertasius; |
| 463 | SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they exhibit poor performance when confronted with sparse inputs, primarily due to the sparse distribution of Gaussian points and insufficient view supervision. To relieve these challenges, we propose SPC-GS, leveraging Scene-layout-based Gaussian Initialization (SGI) and Semantic-Prompt Consistency (SPC) Regularization for open-world free view synthesis with sparse inputs. |
Guibiao Liao; Qing Li; Zhenyu Bao; Guoping Qiu; Kanglin Liu; |
| 464 | DPFlow: Adaptive Optical Flow Estimation with A Dual-Pyramid Framework Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose DPFlow, an adaptive optical flow architecture capable of generalizing up to 8K resolution inputs while trained with only low-resolution samples. |
Henrique Morimitsu; Xiaobin Zhu; Roberto M. Cesar; Xiangyang Ji; Xu-Cheng Yin; |
| 465 | SketchAgent: Language-Driven Sequential Sketch Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce SketchAgent, a language-driven, sequential sketch generation method that enables users to create, modify, and refine sketches through dynamic, conversational interactions. |
Yael Vinker; Tamar Rott Shaham; Kristine Zheng; Alex Zhao; Judith E Fan; Antonio Torralba; |
| 466 | ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose ReVisionLLM, a recursive vision-language model designed to locate events in hour-long videos. |
Tanveer Hannan; Md Mohaiminul Islam; Jindong Gu; Thomas Seidl; Gedas Bertasius; |
| 467 | PillarHist: A Quantization-aware Pillar Feature Encoder Based on Height-aware Histogram Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address above issue, we first unveil the importance of different input information during PFE and identify the height dimension as a key factor in enhancing 3D detection performance. Motivated by this observation, we propose a height-aware pillar feature encoder, called PillarHist. |
Sifan Zhou; Zhihang Yuan; Dawei Yang; Xing Hu; Jian Qian; Ziyu Zhao; |
| 468 | EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce EarthDial, a conversational assistant specifically designed for Earth Observation (EO) data, transforming complex, multi-sensory Earth observations into interactive, natural language dialogues. |
Sagar Soni; Akshay Dudhane; Hiyam Debary; Mustansar Fiaz; Muhammad Akhtar Munir; Muhammad Sohail Danish; Paolo Fraccaro; Campbell D Watson; Levente J Klein; Fahad Shahbaz Khan; Salman Khan; |
| 469 | DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, dynamic distractors in wild environments challenge the static scene assumption in radiance fields, while limited view constraints hinder the accurate capture of underlying scene geometry. To address these challenges, we introduce DroneSplat, a novel framework designed for robust 3D reconstruction from in-the-wild drone imagery. |
Jiadong Tang; Yu Gao; Dianyi Yang; Liqi Yan; Yufeng Yue; Yi Yang; |
| 470 | Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In fact, as shown in the recent analysis conducted by EPIC Fields, 3D techniques are ineffective when it comes to studying dynamic phenomena, and, in particular, when segmenting moving objects. In this paper, we look into this issue in more detail. |
Vadim Tschernezki; Diane Larlus; Iro Laina; Andrea Vedaldi; |
| 471 | EffiDec3D: An Optimized Decoder for High-Performance and Efficient 3D Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose EffiDec3D, an optimized 3D decoder that employs a channel reduction strategy across all decoder stages, which sets the number of channels to the minimum needed for accurate feature representation. |
Md Mostafijur Rahman; Radu Marculescu; |
| 472 | Two By Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Leveraging the 2BY2 dataset, we propose a two-step SE(3) pose estimation method with equivariant features for assembly constraints. |
Yu Qi; Yuanchen Ju; Tianming Wei; Chi Chu; Lawson L.S. Wong; Huazhe Xu; |
| 473 | Image Is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Early research typically required training specialized generators for each unique task and domain, often relying on fully-labeled data. Motivated by the powerful generative capabilities and broad applications of diffusion models, we are driven to explore leveraging label-free data to empower these models for in-domain generation.Fine-tuning a pre-trained generative model on domain data is an intuitive but challenging way and often requires complex manual hyper-parameter adjustments since the limited diversity of the training data can easily disrupt the model’s original generative capabilities.To address this challenge, we propose a guidance-decoupled prior preservation mechanism to achieve high generative quality and controllability by image-only data, inspired by preserving the pre-trained model from a denoising guidance perspective.We decouple domain-related guidance from the conditional guidance used in classifier-free guidance mechanisms to preserve open-world control guidance and unconditional guidance from the pre-trained model. |
Pu Cao; Feng Zhou; Lu Yang; Tianrui Huang; Qing Song; |
| 474 | GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thus, they lag behind the growing interest in instruction-following foundation models like LLMs, whose adaptability is crucial yet remains underexplored in fair comparisons. To bridge this gap, we introduce GenManip, a realistic tabletop simulation platform tailored for policy generalization studies. |
Ning Gao; Yilun Chen; Shuai Yang; Xinyi Chen; Yang Tian; Hao Li; Haifeng Huang; Hanqing Wang; Tai Wang; Jiangmiao Pang; |
| 475 | Task-Agnostic Guided Feature Expansion for Class-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To promote the learning and transferring of diverse features across tasks, we propose a framework called Task-Agnostic Guided Feature Expansion (TagFex). |
Bowen Zheng; Da-Wei Zhou; Han-Jia Ye; De-Chuan Zhan; |
| 476 | Around The World in 80 Timesteps: A Generative Approach to Global Visual Geolocation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. |
Nicolas Dufour; Vicky Kalogeiton; David Picard; Loic Landrieu; |
| 477 | MambaIRv2: Attentive State Space Restoration Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose MambaIRv2, which equips Mamba with the non-causal modeling ability similar to ViTs to reach the attentive state space restoration model. |
Hang Guo; Yong Guo; Yaohua Zha; Yulun Zhang; Wenbo Li; Tao Dai; Shu-Tao Xia; Yawei Li; |
| 478 | Calibrated Multi-Preference Optimization for Aligning Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, they lack generalization to multi-preference scenarios and struggle to handle inconsistencies between rewards. To address this, we present Calibrated Preference Optimization (CaPO), a novel method to align T2I diffusion models by incorporating the general preference from multiple reward models without human annotated data. |
Kyungmin Lee; Xiahong Li; Qifei Wang; Junfeng He; Junjie Ke; Ming-Hsuan Yang; Irfan Essa; Jinwoo Shin; Feng Yang; Yinxiao Li; |
| 479 | 3D-GSW: 3D Gaussian Splatting for Robust Watermarking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As 3D Gaussian Splatting (3D-GS) gains significant attention and its commercial usage increases, the need for watermarking technologies to prevent unauthorized use of the 3D-GS models and rendered images has become increasingly important. In this paper, we introduce a robust watermarking method for 3D-GS that secures copyright of both the model and its rendered images. |
Youngdong Jang; Hyunje Park; Feng Yang; Heeju Ko; Euijin Choo; Sangpil Kim; |
| 480 | POSTA: A Go-to Framework for Customized Artistic Poster Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prior work has explored automatic poster design using deep learning techniques, but these approaches lack text accuracy, user customization, and aesthetic appeal, limiting their applicability in artistic domains such as movies and exhibitions, where both clear content delivery and visual impact are essential. To address these limitations, we present POSTA: a modular framework powered by diffusion models and multimodal large language models (MLLMs) for customized artistic poster generation. |
Haoyu Chen; Xiaojie Xu; Wenbo Li; Jingjing Ren; Tian Ye; Songhua Liu; Ying-Cong Chen; Lei Zhu; Xinchao Wang; |
| 481 | ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce ManiVideo, a novel method for generating consistent and temporally coherent bimanual hand-object manipulation videos from given motion sequences of hands and objects. |
Youxin Pang; Ruizhi Shao; Jiajun Zhang; Hanzhang Tu; Yun Liu; Boyao Zhou; Hongwen Zhang; Yebin Liu; |
| 482 | Schedule On The Fly: Diffusion Time Prediction for Faster and Better Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce the Time Prediction Diffusion Model (TPDM) for this. |
Zilyu Ye; Zhiyang Chen; Tiancheng Li; Zemin Huang; Weijian Luo; Guo-Jun Qi; |
| 483 | Visual Agentic AI for Spatial Reasoning with A Dynamic API Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. |
Damiano Marsili; Rohun Agrawal; Yisong Yue; Georgia Gkioxari; |
| 484 | Degradation-Aware Feature Perturbation for All-in-One Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, the feature perturbations primarily include channel-wise perturbations and attention-wise perturbations. |
Xiangpeng Tian; Xiangyu Liao; Xiao Liu; Meng Li; Chao Ren; |
| 485 | Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and A Hybrid Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Lux Post Facto, a novel portrait video relighting method that produces both photorealistic and temporally consistent lighting effects. |
Yiqun Mei; Mingming He; Li Ma; Julien Philip; Wenqi Xian; David M George; Xueming Yu; Gabriel Dedic; Ahmet Levent Taşel; Ning Yu; Vishal M. Patel; Paul Debevec; |
| 486 | GBC-Splat: Generalizable Gaussian-Based Clothed Human Digitalization Under Sparse RGB Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present an efficient approach for generalizable clothed human digitalization, termed GBC-Splat. |
Hanzhang Tu; Zhanfeng Liao; Boyao Zhou; Shunyuan Zheng; Xilong Zhou; Liuxin Zhang; QianYing Wang; Yebin Liu; |
| 487 | Video-Guided Foley Sound Generation with Multimodal Controls Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce *MultiFoley*, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. |
Ziyang Chen; Prem Seetharaman; Bryan Russell; Oriol Nieto; David Bourgin; Andrew Owens; Justin Salamon; |
| 488 | Less Is More: Efficient Model Merging with Binary Task Switch Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing merging methods face challenges of redundant parameter conflicts and the excessive storage burden of fine-tuned parameters. In this work, through controlled experiments, we reveal that for fine-tuned task vectors, only those parameters with magnitudes above a certain threshold contribute positively to the task, exhibiting a pulse-like characteristic. |
Biqing Qi; Fangyuan Li; Zhen Wang; Junqi Gao; Dong Li; Peng Ye; Bowen Zhou; |
| 489 | Towards Practical Real-Time Neural Video Compression Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce a practical real-time neural video codec (NVC) designed to deliver high compression ratio, low latency and broad versatility. |
Zhaoyang Jia; Bin Li; Jiahao Li; Wenxuan Xie; Linfeng Qi; Houqiang Li; Yan Lu; |
| 490 | Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, we propose a Character Detection Matching (CDM) metric, ensuring the evaluation objectivity by designing an image-level rather than a LaTeX-level metric score. |
Bin Wang; Fan Wu; Linke Ouyang; Zhuangcheng Gu; Rui Zhang; Renqiu Xia; Botian Shi; Bo Zhang; Conghui He; |
| 491 | Seeing Is Not Believing: Adversarial Natural Object Optimization for Hard-Label 3D Scene Attacks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, this paper attempts to address the challenging hard-label 3D scene attack with access only to the input/output of the 3D models. To make the attack effective and stealthy, we propose to generate universal adversarial objects, which will mislead scene-aware 3D models to predict attacker-chosen labels whenever these objects are placed on any scene input. |
Daizong Liu; Wei Hu; |
| 492 | Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Classic physically-based rendering (PBR) accurately simulates the light transport, but relies on precise scene representations–explicit 3D geometry, high-quality material properties, and lighting conditions–that are often impractical to obtain in real-world scenarios. Therefore, we introduce Diffusion Renderer, a neural approach that addresses the dual problem of inverse and forward rendering within a holistic framework. |
Ruofan Liang; Zan Gojcic; Huan Ling; Jacob Munkberg; Jon Hasselgren; Chih-Hao Lin; Jun Gao; Alexander Keller; Nandita Vijaykumar; Sanja Fidler; Zian Wang; |
| 493 | Neural Hierarchical Decomposition for Single Image Plant Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present a novel method for generating realistic 3D plant models from single-view photographs. |
Zhihao Liu; Zhanglin Cheng; Naoto Yokoya; |
| 494 | SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SpectroMotion, a novel approach that combines 3D Gaussian Splatting (3DGS) with physically-based rendering (PBR) and deformation fields to reconstruct dynamic specular scenes. |
Cheng-De Fan; Chen-Wei Chang; Yi-Ruei Liu; Jie-Ying Lee; Jiun-Long Huang; Yu-Chee Tseng; Yu-Lun Liu; |
| 495 | GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The rapid development of Multimodal Large Language Models (MLLMs) has enabled the integration of multiple modalities, including texts and images, within the large language model (LLM) framework.However, texts and images are usually interconnected, forming a multimodal attributed graph (MMAG). It is underexplored how MLLMs can incorporate the relational information (i.e., graph structure) and semantic information (i.e., texts and images) on such graphs for multimodal comprehension and generation.In this paper, we propose GraphGPT-o, which supports omni-multimodal understanding and creation on MMAGs.We first comprehensively study linearization variants to transform semantic and structural information as input for MLLMs.Then, we propose a hierarchical aligner that enables deep graph encoding, bridging the gap between MMAGs and MLLMs.Finally, we explore the inference choices, adapting MLLM to interleaved text and image generation in graph scenarios. |
Yi Fang; Bowen Jin; Jiacheng Shen; Sirui Ding; Qiaoyu Tan; Jiawei Han; |
| 496 | K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we argue that the intrinsic properties of LoRA can effectively guide diffusion models in merging learned subject and style. |
Ziheng Ouyang; Zhen Li; Qibin Hou; |
| 497 | TransPixeler: Advancing Text-to-Video Generation with Transparency Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. |
Luozhou Wang; Yijun Li; Zhifei Chen; Jui-Hsien Wang; Zhifei Zhang; He Zhang; Zhe Lin; Ying-Cong Chen; |
| 498 | Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Spatial457 supports a cascading evaluation structure, offering 7 question types across 5 difficulty levels that progress from basic single-object recognition to our newly proposed complex 6D spatial reasoning tasks.We evaluated various large multimodal models (LMMs) on Spatial457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. |
Xingrui Wang; Wufei Ma; Tiezheng Zhang; Celso M de Melo; Jieneng Chen; Alan Yuille; |
| 499 | Geometry Field Splatting with Gaussian Surfels Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our first contribution is to derive an efficient and almost exact differentiable rendering algorithm for geometry fields parameterized by Gaussian surfels, while removing current approximations involving Taylor series and no self-attenuation. |
Kaiwen Jiang; Venkataram Sivaram; Cheng Peng; Ravi Ramamoorthi; |
| 500 | Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they still often generate inaccurate or irrelevant responses due to issues like hallucinated image understandings or unrefined reasoning paths. To address these challenges, we introduce Critic-V, a novel framework inspired by the Actor-Critic paradigm to boost the reasoning capability of VLMs. |
Di Zhang; Jingdi Lei; Junxian Li; Xunzhi Wang; Yujie Liu; Zonglin Yang; Jiatong Li; Weida Wang; Suorong Yang; Jianbo Wu; Peng Ye; Wanli Ouyang; Dongzhan Zhou; |
This table only includes 500 papers selected by our daily digest algorithm. To continue with the full list (~2,800 papers), please visit Paper Digest: CVPR-2025 (Full List).