Paper Digest: ICCV 2025 Papers & Highlights
Note: ICCV-2025 accepts more than 2,700 papers, this page only includes 500 of them selected by our daily paper digest algorithm. Interested users can choose to read All 2,700 ICCV-2025 papers in a separate page.
To search for papers presented at ICCV-2025 on a specific topic, please make use of the search by venue (ICCV-2025) service. To summarize the latest research published at ICCV-2025 on a specific topic, you can utilize the review by venue (ICCV-2025) service. If you are interested in browsing papers by author, we have a comprehensive list of ~ 11,000 authors (ICCV-2025). Using data from 2023 and 2025, our system also generates a report on computer vision trends. Additionally, you may want to explore our “Best Paper” Digest (ICCV), which lists the most influential ICCV papers since 1988.
This curated list is created by the Paper Digest Team. Experience the cutting-edge capabilities of Paper Digest, an innovative AI-powered research platform that gets you the personalized and comprehensive daily paper digests on the latest research in your field. It also empowers you to read articles, write articles, get answers, conduct literature reviews and generate research reports.
Experience the full potential of our services today!
TABLE 1: Paper Digest: ICCV 2025 Papers & Highlights
| Paper | Author(s) | |
|---|---|---|
| 1 | MetaMorph: Multimodal Understanding and Generation Via Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose Visual-Predictive Instruction Tuning (VPiT) – a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. |
Shengbang Tong; David Fan; Jiachen Li; Yunyang Xiong; Xinlei Chen; Koustuv Sinha; Michael Rabbat; Yann LeCun; Saining Xie; Zhuang Liu; |
| 2 | CoTracker3: Simpler and Better Point Tracking By Pseudo-Labelling Real Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce CoTracker3, a new state-of-the-art point tracker. |
Nikita Karaev; Yuri Makarov; Jianyuan Wang; Natalia Neverova; Andrea Vedaldi; Christian Rupprecht; |
| 3 | CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce CC-OCR, a comprehensive benchmark that possesses a diverse range of scenarios, tasks, and challenges. |
Zhibo Yang; Jun Tang; Zhaohai Li; Pengfei Wang; Jianqiang Wan; Humen Zhong; Xuejing Liu; Mingkun Yang; Peng Wang; Shuai Bai; Lianwen Jin; Junyang Lin; |
| 4 | Scaling Language-Free Visual Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we ask the question: "Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?" |
David Fan; Shengbang Tong; Jiachen Zhu; Koustuv Sinha; Zhuang Liu; Xinlei Chen; Michael Rabbat; Nicolas Ballas; Yann LeCun; Amir Bar; Saining Xie; |
| 5 | MIEB: Massive Image Embedding Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. |
Chenghao Xiao; Isaac Chung; Imene Kerboua; Jamie Stirling; Xin Zhang; Márton Kardos; Roman Solomatin; Noura Al Moubayed; Kenneth Enevoldsen; Niklas Muennighoff; |
| 6 | Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the Video Turing Test (Video-TT), a benchmark designed to assess if video LLMs can interpret real-world videos as effectively as humans.Video-TT differentiates between errors due to inadequate frame sampling and 1) genuine gaps in understanding complex visual narratives, and 2) evaluates robustness against natural adversarial questions. |
Yuanhan Zhang; Yunice Chew; Yuhao Dong; Aria Leo; Bo Hu; Ziwei Liu; |
| 7 | Learning 4D Embodied World Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent’s actions, providing both spatial and temporal consistency. |
Haoyu Zhen; Qiao Sun; Hongxin Zhang; Junyan Li; Siyuan Zhou; Yilun Du; Chuang Gan; |
| 8 | EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, its training is heavily reliant on high-quality images and precise camera poses. Meeting these criteria can be challenging in non-ideal real-world conditions, where motion-blurred images frequently occur due to high-speed camera movements.To address these challenges, we introduce Event Stream Assisted Gaussian Splatting (EvaGaussians), a novel approach that harnesses event streams captured by event cameras to facilitate the learning of high-quality 3D-GS from motion-blurred images. |
Wangbo Yu; Chaoran Feng; Jianing Li; Jiye Tang; Jiashu Yang; Zhenyu Tang; Meng Cao; Xu Jia; Yuchao Yang; Li Yuan; Yonghong Tian; |
| 9 | Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Long-LRM, a feed-forward 3D Gaussian reconstruction model for instant, high-resolution, 360deg wide-coverage, scene-level reconstruction. |
Chen Ziwen; Hao Tan; Kai Zhang; Sai Bi; Fujun Luan; Yicong Hong; Li Fuxin; Zexiang Xu; |
| 10 | RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents. |
Yiran Qin; Li Kang; Xiufeng Song; Zhenfei Yin; Xiaohong Liu; Xihui Liu; Ruimao Zhang; Lei Bai; |
| 11 | Temporal-aware Query Routing for Real-time Video Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Further analysis of the similarities between the outputs from adjacent frames at each transformer decoder layer reveals significant redundant computations within the transformer decoder. To address this issue, we introduce Temporal-Aware query Routing (TAR) mechanism. |
Zesen Cheng; Kehan Li; Yian Zhao; Hang Zhang; Chang Liu; Jie Chen; |
| 12 | Magic Insert: Style-Aware Drag-and-Drop Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Magic Insert, a method to drag-and-drop subjects from a user-provided image into a target image of a different style in a plausible manner while matching the style of the target image. |
Nataniel Ruiz; Yuanzhen Li; Neal Wadhwa; Yael Pritch; Michael Rubinstein; David E. Jacobs; Shlomi Fruchter; |
| 13 | Cycle-Consistent Learning for Joint Layout-to-Image Generation and Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a generation-detection cycle consistent (GDCC) learning framework that jointly optimizes both layout-to-image (L2I) generation and object detection (OD) tasks in an end-to-end manner. |
Xinhao Cai; Qiuxia Lai; Gensheng Pei; Xiangbo Shu; Yazhou Yao; Wenguan Wang; |
| 14 | Scaling Laws for Native Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we revisit the architectural design of native multimodal models (NMMs)–those trained from the ground up on all modalities–and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. |
Mustafa Shukor; Enrico Fini; Victor Guilherme Turrisi da Costa; Matthieu Cord; Joshua Susskind; Alaaeldin El-Nouby; |
| 15 | DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. |
Yecheng Wu; Han Cai; Junyu Chen; Zhuoyang Zhang; Enze Xie; Jincheng Yu; Junsong Chen; Jinyi Hu; Yao Lu; Song Han; |
| 16 | VACE: All-in-One Video Creation and Editing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. |
Zeyinzi Jiang; Zhen Han; Chaojie Mao; Jingfeng Zhang; Yulin Pan; Yu Liu; |
| 17 | ViLLa: Video Reasoning Segmentation with Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, they struggled to discriminate and deduce the objects from user queries in more real-world scenes featured by long durations, multiple objects, rapid motion, and heavy occlusions. In this work, we analyze the underlying causes of these limitations, and present **ViLLa**: **Vi**deo reasoning segmentation with **L**arge **La**nguage Model. |
Rongkun Zheng; Lu Qi; Xi Chen; Yi Wang; Kun Wang; Hengshuang Zhao; |
| 18 | Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. |
Weiming Ren; Wentao Ma; Huan Yang; Cong Wei; Ge Zhang; Wenhu Chen; |
| 19 | SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. |
Junsong Chen; Shuchen Xue; Yuyang Zhao; Jincheng Yu; Sayak Paul; Junyu Chen; Han Cai; Song Han; Enze Xie; |
| 20 | YOLOE: Real-Time Seeing Anything Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. |
Ao Wang; Lihao Liu; Hui Chen; Zijia Lin; Jungong Han; Guiguang Ding; |
| 21 | VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. |
Shoubin Yu; Difan Liu; Ziqiao Ma; Yicong Hong; Yang Zhou; Hao Tan; Joyce Chai; Mohit Bansal; |
| 22 | StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce StableDepth, a scene-consistent and scale-invariant depth estimation method achieving scene-level 3D consistency. |
Zheng Zhang; Lihe Yang; Tianyu Yang; Chaohui Yu; Xiaoyang Guo; Yixing Lao; Hengshuang Zhao; |
| 23 | MOVE: Motion-Guided Few-Shot Video Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing FSVOS datasets and methods typically focus on object categories, which are static attributes that ignore the rich temporal dynamics in videos, limiting their application in scenarios requiring motion understanding. To fill this gap, we introduce MOVE, a large-scale dataset specifically designed for motion-guided FSVOS. |
Kaining Ying; Hengrui Hu; Henghui Ding; |
| 24 | DiffDoctor: Diagnosing Image Diffusion Models Before Treating Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we believe problem-solving starts with identification, yielding the request that the model should be aware of not just the presence of defects in an image, but their specific locations. |
Yiyang Wang; Xi Chen; Xiaogang Xu; Sihui Ji; Yu Liu; Yujun Shen; Hengshuang Zhao; |
| 25 | Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,104 videos and 61,095 multimodal referring expressions. |
Kaining Ying; Henghui Ding; Guangquan Jie; Yu-Gang Jiang; |
| 26 | VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a powerful video VAE named VideoVAE+ that effectively reconstructs videos with large motion. |
Yazhou Xing; Yang Fei; Yingqing He; Jingye Chen; Jiaxin Xie; Xiaowei Chi; Qifeng Chen; |
| 27 | DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Drawing inspiration from resampler structures, we introduce DisCo, a novel visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs. |
Jiahe Zhao; Rongkun Zheng; Yi Wang; Helin Wang; Hengshuang Zhao; |
| 28 | MRGen: Segmentation Data Engine For Underrepresented MRI Modalities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Concretely, our contributions are threefold: (i) we introduce MRGen-DB, a large-scale radiology image-text dataset comprising extensive samples with rich metadata, including modality labels, attributes, regions, and organs information, with a subset featuring pixel-wise mask annotations; (ii) we present MRGen, a diffusion-based data engine for controllable medical image synthesis, conditioned on text prompts and segmentation masks. |
Haoning Wu; Ziheng Zhao; Ya Zhang; Yanfeng Wang; Weidi Xie; |
| 29 | StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce StreamDiffusion, a real-time diffusion pipeline designed for streaming image generation. |
Akio Kodaira; Chenfeng Xu; Toshiki Hazama; Takanori Yoshimoto; Kohei Ohno; Shogo Mitsuhori; Soichi Sugano; Hanying Cho; Zhijian Liu; Masayoshi Tomizuka; Kurt Keutzer; |
| 30 | ARGUS: Hallucination and Omission Evaluation in Video-LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, VideoLLMs hallucinate far more aggressively on freeform text generation tasks like video captioning than they do on multiple choice verification tasks. To address this weakness, we propose ARGUS, a VideoLLM benchmark that measures freeform video captioning performance. |
Ruchit Rawal; Reza Shirkavand; Heng Huang; Gowthami Somepalli; Tom Goldstein; |
| 31 | Bolt3D: Generating 3D Scenes in Seconds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a latent diffusion model for fast feed-forward 3D scene generation. |
Stanislaw Szymanowicz; Jason Y. Zhang; Pratul Srinivasan; Ruiqi Gao; Arthur Brussee; Aleksander Holynski; Ricardo Martin-Brualla; Jonathan T. Barron; Philipp Henzler; |
| 32 | LLaVA-3D: A Simple Yet Effective Pathway to Empowering LMMs with 3D Capabilities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a simple yet effective framework called LLaVA-3D. |
Chenming Zhu; Tai Wang; Wenwei Zhang; Jiangmiao Pang; Xihui Liu; |
| 33 | FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce FreeSplatter, a scalable feed-forward framework that generates high-quality 3D Gaussians from uncalibrated sparse-view images while estimating camera parameters within seconds. |
Jiale Xu; Shenghua Gao; Ying Shan; |
| 34 | StochasticSplats: Stochastic Rasterization for Sorting-Free 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For example, and counter-intuitively, rendering a lower-resolution image is not necessarily faster. In this work, we address the above limitations by combining 3D Gaussian splatting with stochastic rasterization. |
Shakiba Kheradmand; Delio Vicini; George Kopanas; Dmitry Lagun; Kwang Moo Yi; Mark Matthews; Andrea Tagliasacchi; |
| 35 | OminiControl: Minimal and Universal Control for Diffusion Transformer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present OminiControl, a novel approach that rethinks how image conditions are integrated into Diffusion Transformer (DiT) architectures. |
Zhenxiong Tan; Songhua Liu; Xingyi Yang; Qiaochu Xue; Xinchao Wang; |
| 36 | ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This approach introduces problematic dependencies between internal skeleton and outer soft tissue, limiting direct control over body height and bone lengths. To address these issues, we present ATLAS, a high-fidelity body model learned from 600k high-resolution scans captured using 240 synchronized cameras. |
Jinhyung Park; Javier Romero; Shunsuke Saito; Fabian Prada; Takaaki Shiratori; Yichen Xu; Federica Bogo; Shoou-I Yu; Kris Kitani; Rawal Khirodkar; |
| 37 | Medical World Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Providing effective treatment and making informed decisions are essential goals of modern medicine and clinical care.We are interested in simulating disease dynamics for clinical decision-making, leveraging recent advances in large generative models.To this end, we introduce the Medical World Model (MeWM), the first world model in medicine that predicts future disease states based on clinical decisions. |
Yijun Yang; Zhao-Yang Wang; Qiuping Liu; Shuwen Sun; Kang Wang; Rama Chellappa; Zongwei Zhou; Alan Yuille; Lei Zhu; Yu-Dong Zhang; Jieneng Chen; |
| 38 | MonoFusion: Sparse-View 4D Reconstruction Via Monocular Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). |
Zihan Wang; Jeff Tan; Tarasha Khurana; Neehar Peri; Deva Ramanan; |
| 39 | From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models Via Reflection Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. |
Le Zhuo; Liangbing Zhao; Sayak Paul; Yue Liao; Renrui Zhang; Yi Xin; Peng Gao; Mohamed Elhoseiny; Hongsheng Li; |
| 40 | Hi3DGen: High-fidelity 3D Geometry Generation from Images Via Normal Bridging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: With the growing demand for high-fidelity 3D models from 2D images, existing methods still face significant challenges in accurately reproducing fine-grained geometric details due to limitations in domain gaps and inherent ambiguities in RGB images. To address these issues, we propose Hi3DGen, a novel framework for generating highfidelity 3D geometry from images via normal bridging. |
Chongjie Ye; Yushuang Wu; Ziteng Lu; Jiahao Chang; Xiaoyang Guo; Jiaqing Zhou; Hao Zhao; Xiaoguang Han; |
| 41 | Randomized Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. |
Qihang Yu; Ju He; Xueqing Deng; Xiaohui Shen; Liang-Chieh Chen; |
| 42 | Pinco: Position-induced Consistent Adapter for Diffusion Transformer in Foreground-conditioned Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While existing T2I-based image inpainting methods can be applied to this task, they suffer from issues of subject shape expansion, distortion, or impaired ability to align with the text description, resulting in inconsistencies between the visual elements and the text description. To address these challenges, we propose Pinco, a plug-and-play foreground-conditioned inpainting adapter that generates high-quality backgrounds with good text alignment while effectively preserving the shape of the foreground subject. |
Guangben Lu; Yuzhen Du; Yizhe Tang; Zhimin Sun; Ran Yi; Yifan Qi; Tianyi Wang; Lizhuang Ma; Fangyuan Zou; |
| 43 | CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, CL-Splats supports storing and recovering previous scene states, facilitating temporal segmentation and new scene-analysis applications.Our extensive experiments demonstrate that CL-Splats achieves efficient updates with improved reconstruction quality over the state-of-the-art. This establishes a robust foundation for future real-time adaptation in 3D scene reconstruction tasks.We will release our source code and the synthetic and real-world datasets we created to support further research in this area. |
Jan Ackermann; Jonas Kulhanek; Shengqu Cai; Haofei Xu; Marc Pollefeys; Gordon Wetzstein; Leonidas J. Guibas; Songyou Peng; |
| 44 | DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation Via Dictionary Lookup Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel framework, namely DictAS, which enables a unified model to detect visual anomalies in unseen object categories without any retraining on the target data, only employing a few normal reference images as visual prompts. |
Zhen Qu; Xian Tao; Xinyi Gong; ShiChen Qu; Xiaopei Zhang; Xingang Wang; Fei Shen; Zhengtao Zhang; Mukesh Prasad; Guiguang Ding; |
| 45 | Stable Virtual Camera: Generative View Synthesis with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present \underline \text S tabl\underline \text e \underline \text V irtual C\underline \text a mera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras.Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations.Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time.As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild.Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure.Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings. |
Jensen Zhou; Hang Gao; Vikram Voleti; Aaryaman Vasishta; Chun-Han Yao; Mark Boss; Philip Torr; Christian Rupprecht; Varun Jampani; |
| 46 | GenHancer: Imperfect Generative Models Are Secretly Strong Vision-Centric Enhancers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. |
Shijie Ma; Yuying Ge; Teng Wang; Yuxin Guo; Yixiao Ge; Ying Shan; |
| 47 | AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose AnimeGamer, which is built upon Multimodal Large Language Models (MLLMs) to generate each game state, including dynamic animation shots that depict character movements and updates to character states, as illustrated in Figure 1. |
Junhao Cheng; Yuying Ge; Yixiao Ge; Jing Liao; Ying Shan; |
| 48 | SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. |
Chun-Han Yao; Yiming Xie; Vikram Voleti; Huaizu Jiang; Varun Jampani; |
| 49 | LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion Via Distillation to Learnable Look-Up Tables Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel approach that towards extremely fast fusion via distillation to learnable lookup tables specifically designed for image fusion, termed as LUT-Fuse. |
Xunpeng Yi; Yibing Zhang; Xinyu Xiang; Qinglong Yan; Han Xu; Jiayi Ma; |
| 50 | MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence.However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps.To fill this gap, we introduce **MMReason**, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions.First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). |
Huanjin Yao; Jiaxing Huang; Yawen Qiu; Michael K. Chen; Wenzheng Liu; Wei Zhang; Wenjie Zeng; Xikun Zhang; Jingyi Zhang; YuXin Song; Wenhao Wu; Dacheng Tao; |
| 51 | VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities (e.g., distinguishing between two answers), thus limiting the all-round evaluation and restricting the development of RMs in the visual-language domain. To address this gap, we propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions. |
Jiacheng Ruan; Wenzhen Yuan; Xian Gao; Ye Guo; Daoxin Zhang; Zhe Xu; Yao Hu; Ting Liu; Yuzhuo Fu; |
| 52 | Leveraging Panoptic Scene Graph for Evaluating Fine-Grained Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Human assessments are costly, and existing automated metrics lack accurate compositional understanding. To address these limitations, we introduce PSG-Bench, a novel benchmark featuring 5K text prompts designed to evaluate the capabilities of advanced T2I models. |
Xueqing Deng; Linjie Yang; Qihang Yu; Chenglin Yang; Liang-Chieh Chen; |
| 53 | Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Can3Tok, the first 3D scene-level variational autoencoder (VAE) capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding, which effectively captures both semantic and spatial information of the inputs. |
Quankai Gao; Iliyan Georgiev; Tuanfeng Y. Wang; Krishna Kumar Singh; Ulrich Neumann; Jae Shin Yoon; |
| 54 | FlowChef: Steering of Rectified Flow Models for Controlled Generations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present FlowChef, a novel training, inversion, and gradient-free inference-time steering strategy for RFMs that deterministically guides the denoising process. |
Maitreya Patel; Song Wen; Dimitris N. Metaxas; Yezhou Yang; |
| 55 | GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. |
Rui Hu; Lianghui Zhu; Yuxuan Zhang; Tianheng Cheng; Lei Liu; Heng Liu; Longjin Ran; Xiaoxin Chen; Wenyu Liu; Xinggang Wang; |
| 56 | Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a kxk grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. |
Sucheng Ren; Qihang Yu; Ju He; Xiaohui Shen; Alan Yuille; Liang-Chieh Chen; |
| 57 | ZipVL: Accelerating Vision-Language Models Through Dynamic Token Sparsity Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present ZipVL, an efficient inference framework designed for LVLMs through a dynamic ratio allocation strategy of important tokens. |
Yefei He; Feng Chen; Jing Liu; Wenqi Shao; Hong Zhou; Kaipeng Zhang; Bohan Zhuang; |
| 58 | Neighboring Autoregressive Modeling for Efficient Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far "next-neighbor prediction" mechanism. |
Yefei He; Yuanyu He; Shaoxuan He; Feng Chen; Hong Zhou; Kaipeng Zhang; Bohan Zhuang; |
| 59 | Rethinking DPO-style Diffusion Aligning Frameworks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, We identify two potential risks for existing DPO algorithms: First, current DPO methods for estimating the rewards of step-wise intermediate samples are biased, leading to inaccurate preference ordering for step-wise optimization. Second, existing DPO methods may inadvertently increase the sampling probabilities of dispreferred samples, potentially introducing application risks. To address these issues, we propose Revised Direct Preference Optimization (RDPO), a simple but effective step-wise DPO-based text-to-image diffusion model alignment method. |
Xun Wu; Shaohan Huang; Lingjie Jiang; Furu Wei; |
| 60 | Beyond Simple Edits: Composed Video Retrieval with Dense Modifications Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content.The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. |
Omkar Thawakar; Dmitry Demidov; Ritesh Thawkar; Rao Muhammad Anwer; Mubarak Shah; Fahad Shahbaz Khan; Salman Khan; |
| 61 | GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel point map Variational Autoencoder (VAE) for encoding and decoding unbounded point maps. |
Tian-Xing Xu; Xiangjun Gao; Wenbo Hu; Xiaoyu Li; Song-Hai Zhang; Ying Shan; |
| 62 | TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos Via Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present TrajectoryCrafter, a novel approach to redirect camera trajectories for monocular videos. |
Mark Yu; Wenbo Hu; Jinbo Xing; Ying Shan; |
| 63 | Flow4Agent: Long-form Video Understanding Via Motion Prior from Optical Flow Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Flow4Agent, a novel framework that pioneeringly incorporates motion priors from optical flow to facilitate LLM-based long video understanding. |
Ruyang Liu; Shangkun Sun; Haoran Tang; Wei Gao; Ge Li; |
| 64 | MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present MaTVLM, a method for distilling pre-trained vision-language models (VLMs) into an efficient Mamba-Transformer hybrid architecture. |
Yingyue Li; Bencheng Liao; Wenyu Liu; Xinggang Wang; |
| 65 | SViM3D: Stable Video Material Diffusion for Single Image 3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. |
Andreas Engelhardt; Mark Boss; Vikram Voleti; Chun-Han Yao; Hendrik P. A. Lensch; Varun Jampani; |
| 66 | Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Lyra, an efficient MLLM that enhances multi-modal abilities, including advanced long speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction. |
Zhisheng Zhong; Chengyao Wang; Yuqi Liu; Senqiao Yang; Longxiang Tang; Yuechen Zhang; Jingyao Li; Tianyuan Qu; Yanwei Li; Yukang Chen; Shaozuo Yu; Sitong Wu; Eric Lo; Shu Liu; Jiaya Jia; |
| 67 | GWM: Towards Scalable Gaussian World Models for Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a novel branch of world model named Gaussian World Model (GWM) for robotic manipulation, which reconstructs the future state by inferring the propagation of Gaussian primitives under the effect of robot actions. |
Guanxing Lu; Baoxiong Jia; Puhao Li; Yixin Chen; Ziwei Wang; Yansong Tang; Siyuan Huang; |
| 68 | Are They The Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. |
Yikang Zhou; Tao Zhang; Shilin Xu; Shihao Chen; Qianyu Zhou; Yunhai Tong; Shunping Ji; Jiangning Zhang; Lu Qi; Xiangtai Li; |
| 69 | Adaptive Caching for Faster Video Generation with Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that ‘not all videos are created equal’: meaning, some videos require fewer denoising steps to attain a reasonable quality than others. |
Kumara Kahatapitiya; Haozhe Liu; Sen He; Ding Liu; Menglin Jia; Chenyang Zhang; Michael S. Ryoo; Tian Xie; |
| 70 | St4RTrack: Simultaneous 4D Reconstruction and Tracking in The World Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose St4RTrack, a feed-forward frame- work that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB in- puts. |
Haiwen Feng; Junyi Zhang; Qianqian Wang; Yufei Ye; Pengcheng Yu; Michael J. Black; Trevor Darrell; Angjoo Kanazawa; |
| 71 | Shape of Motion: 4D Reconstruction from A Single Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a method for reconstructing generic dynamic scenes, featuring explicit, persistent 3D motion trajectories in the world coordinate frame, from casually captured monocular videos. |
Qianqian Wang; Vickie Ye; Hang Gao; Weijia Zeng; Jake Austin; Zhengqi Li; Angjoo Kanazawa; |
| 72 | PUMA: Empowering Unified MLLM with Multi-granular Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation, a novel paradigm that tackles the diversity-controllability trade-off. |
Rongyao Fang; Chengqi Duan; Kun Wang; Hao Li; Linjiang Huang; Hao Tian; Xingyu Zeng; Rui Zhao; Jifeng Dai; Hongsheng Li; Xihui Liu; |
| 73 | External Knowledge Injection for CLIP-Based Class-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To enhance knowledge transfer from outside the dataset, we propose a dual-branch injection tuning framework that encodes informative knowledge from both visual and textual modalities. |
Da-Wei Zhou; Kai-Wen Li; Jingyi Ning; Han-Jia Ye; Lijun Zhang; De-Chuan Zhan; |
| 74 | Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Predict-Optimize-Distill (POD), a self-improving framework that interleaves prediction and optimization in a mutually reinforcing cycle to achieve better 4D object understanding with increasing observation time. |
Mingxuan Wu; Huang Huang; Justin Kerr; Chung Min Kim; Anthony Zhang; Brent Yi; Angjoo Kanazawa; |
| 75 | PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present PhysTwin, a novel framework that uses sparse videos of dynamic objects in interaction to produce a photo- and physically realistic, real-time interactive virtual replica.Our approach centers on two key components: (1) a physics-informed representation that combines spring-mass models for realistic physical simulation, generative shape models for geometry, and Gaussian splats for rendering, and (2) a novel multi-stage optimization-based inverse modeling framework that reconstructs complete geometry, infers dense physical properties, and replicates realistic appearance from videos. |
Hanxiao Jiang; Hao-Yu Hsu; Kaifeng Zhang; Hsin-Ni Yu; Shenlong Wang; Yunzhu Li; |
| 76 | Flash-VStream: Efficient Real-Time Understanding for Long Video Streams Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most existing work treats long videos in the same way as short videos, which is inefficient for real-world applications and hard to generalize to even longer videos. To address these issues, we propose Flash-VStream, an efficient video language model capable of processing extremely long videos and responding to user queries in real time. |
Haoji Zhang; Yiqin Wang; Yansong Tang; Yong Liu; Jiashi Feng; Xiaojie Jin; |
| 77 | Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose Temperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. |
Zhengyao Lv; Tianlin Pan; Chenyang Si; Zhaoxi Chen; Wangmeng Zuo; Ziwei Liu; Kwan-Yee K. Wong; |
| 78 | MaskControl: Spatio-Temporal Control for Masked Motion Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. |
Ekkasit Pinyoanuntapong; Muhammad Saleem; Korrawe Karunratanakul; Pu Wang; Hongfei Xue; Chen Chen; Chuan Guo; Junli Cao; Jian Ren; Sergey Tulyakov; |
| 79 | UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present the UK Biobank Organs and Bones (UKBOB), the largest labeled dataset of body organs of 51,761 MRI 3D samples (17.9 M 2D images) and a total of more than 1.37 billion 2D segmentation masks of 72 organs based on the UK Biobank MRI dataset. |
Emmanuelle Bourigault; Amir Jamaludin; Abdullah Hamdi; |
| 80 | Training-Free Industrial Defect Generation with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing training-based methods fail to handle complex anomalies and multiple defects simultaneously, especially when only a single anomaly sample is available per defect type. To address this issue, we propose TF-IDG, a novel training-free defect generation framework capable of generating diverse anomaly samples in a one-shot setting. |
Ruyi Xu; Yen-Tzu Chiu; Tai-I Chen; Oscar Chew; Yung-Yu Chuang; Wen-Huang Cheng; |
| 81 | FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present FlowSeek, a novel framework for optical flow requiring minimal hardware resources for training. |
Matteo Poggi; Fabio Tosi; |
| 82 | Gradient Short-Circuit: Efficient Out-of-Distribution Detection Via Feature Intervention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: During inference on a model trained exclusively with In-Distribution (ID) data, we observe a salient gradient phenomenon: around an ID sample, the local gradient directions for "enhancing" that sample’s predicted class remain relatively consistent, whereas OOD samples–unseen in training–exhibit disorganized or conflicting gradient directions in the same neighborhood. Motivated by this observation, we propose an inference-stage technique to short-circuit those feature coordinates that spurious gradients exploit to inflate OOD confidence, while leaving ID classification largely intact. |
Jiawei Gu; Ziyue Qiao; Zechao Li; |
| 83 | SCAN: Bootstrapping Contrastive Pre-training for Data Efficiency Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, this static nature renders them unable to dynamically track the data utility throughout pre-training, leading to subpar pre-trained models. To address this challenge, our paper introduces a novel dynamic bootstrapping dataset pruning method. |
Yangyang Guo; Mohan Kankanhalli; |
| 84 | Mixture-of-Scores: Robust Image-Text Data Valuation Via Three Lines of Code Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This complicates the selection of scoring models. In this paper, we analyze these disparities and propose a method called Mixture-of-Scores (MoS). |
Sitong Wu; Haoru Tan; Yukang Chen; Shaofeng Zhang; Jingyao Li; Bei Yu; Xiaojuan Qi; Jiaya Jia; |
| 85 | Enrich and Detect: Video Temporal Grounding with Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. |
Shraman Pramanick; Effrosyni Mavroudi; Yale Song; Rama Chellappa; Lorenzo Torresani; Triantafyllos Afouras; |
| 86 | UniVerse: Unleashing The Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. |
Jin Cao; Hongrui Wu; Ziyong Feng; Hujun Bao; Xiaowei Zhou; Sida Peng; |
| 87 | DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces DriveArena, the first high-fidelity closed-loop simulation system designed for driving agents navigating real-world scenarios. |
Xuemeng Yang; Licheng Wen; Tiantian Wei; Yukai Ma; Jianbiao Mei; Xin Li; Wenjie Lei; Daocheng Fu; Pinlong Cai; Min Dou; Liang He; Yong Liu; Botian Shi; Yu Qiao; |
| 88 | FlowTok: Flowing Seamlessly Across Text and Image Tokens Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This requires projecting both modalities into a shared latent space, which poses a significant challenge due to their inherently different representations: text is highly semantic and encoded as 1D tokens, whereas images are spatially redundant and represented as 2D latent embeddings. To address this, we introduce FlowTok, a minimal framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. |
Ju He; Qihang Yu; Qihao Liu; Liang-Chieh Chen; |
| 89 | DIMO: Diverse 3D Motion Generation for Arbitrary Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. |
Linzhan Mou; Jiahui Lei; Chen Wang; Lingjie Liu; Kostas Daniilidis; |
| 90 | OmniHuman-1: Rethinking The Scaling-Up of One-Stage Conditioned Human Animation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. |
Gaojie Lin; Jianwen Jiang; Jiaqi Yang; Zerong Zheng; Chao Liang; Yuan Zhang; Jingtuo Liu; |
| 91 | FaceCraft4D: Animated 3D Facial Avatar Generation from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a novel framework for generating high-quality, animatable 4D avatar from a single image. |
Fei Yin; Mallikarjun B R; Chun-Han Yao; Rafal K. Mantiuk; Varun Jampani; |
| 92 | LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduce two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. |
Yu Cheng; Fajie Yuan; |
| 93 | Token-Efficient VLM: High-Resolution Image Understanding Via Dynamic Region Proposal Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they struggle with high-resolution images and zoomed-in regions due to the computational burden and token redundancy of uniform patch-based processing, often leading to the loss of critical details. To address these challenges, we propose Token-Efficient Vision Language Model (TEVA), a novel framework that detects key regions and applies dynamic patch sampling to efficiently capture fine-grained details while preserving global context. |
Yitong Jiang; Jinwei Gu; Tianfan Xue; Ka Chun Cheung; Pavlo Molchanov; Hongxu Yin; Sifei Liu; |
| 94 | MedSegFactory: Text-Guided Generation of Medical Image-Mask Pairs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents **MedSegFactory**, a versatile medical synthesis framework that generates high-quality paired medical images and segmentation masks across modalities and tasks. |
Jiawei Mao; Yuhan Wang; Yucheng Tang; Daguang Xu; Kang Wang; Yang Yang; Zongwei Zhou; Yuyin Zhou; |
| 95 | UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. |
Tsu-Jui Fu; Yusu Qian; Chen Chen; Wenze Hu; Zhe Gan; Yinfei Yang; |
| 96 | EVEv2: Improved Baselines for Encoder-Free Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. |
Haiwen Diao; Xiaotong Li; Yufeng Cui; Yueze Wang; Haoge Deng; Ting Pan; Wenxuan Wang; Huchuan Lu; Xinlong Wang; |
| 97 | A Conditional Probability Framework for Compositional Zero-shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Thus, capturing attribute-object interdependence remains a fundamental yet long-ignored challenge in CZSL.In this paper, we adopt a Conditional Probability Framework (CPF) to explicitly model attribute-object dependencies. |
Peng Wu; Qiuxia Lai; Hao Fang; Guo-Sen Xie; Yilong Yin; Xiankai Lu; Wenguan Wang; |
| 98 | SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper consolidates diverse navigation tasks into a unified and generic framework — we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. |
Gengze Zhou; Yicong Hong; Zun Wang; Chongyang Zhao; Mohit Bansal; Qi Wu; |
| 99 | Improved Noise Schedule for Diffusion Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Diffusion models have emerged as the de facto choice for generating high-quality visual signals across various domains.However, training a single model to predict noise across various levels poses significant challenges, necessitating numerous iterations and incurring significant computational costs.Various approaches, such as loss weighting strategy design and architectural refinements, have been introduced to expedite convergence and improve model performance.In this study, we propose a novel approach to design the noise schedule for enhancing the training of diffusion models. |
Tiankai Hang; Shuyang Gu; Jianmin Bao; Fangyun Wei; Dong Chen; Xin Geng; Baining Guo; |
| 100 | Decouple and Track: Benchmarking and Improving Video Diffusion Transformers For Motion Transfer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. |
Qingyu Shi; Jianzong Wu; Jinbin Bai; Jiangning Zhang; Lu Qi; Yunhai Tong; Xiangtai Li; |
| 101 | AlignGuard: Scalable Safety Alignment for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce AlignGuard, a method for safety alignment of T2I models. |
Runtao Liu; I Chieh Chen; Jindong Gu; Jipeng Zhang; Renjie Pi; Qifeng Chen; Philip Torr; Ashkan Khakzar; Fabio Pizzati; |
| 102 | REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" |
Xingjian Leng; Jaskirat Singh; Yunzhong Hou; Zhenchang Xing; Saining Xie; Liang Zheng; |
| 103 | The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. |
Aoxiong Yin; Xu Tan; Kai Shen; Yichong Leng; Xinyu Zhou; Juncheng Li; Siliang Tang; |
| 104 | Deeply Supervised Flow-Based Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, we observe that training velocity solely from the final layer’s output under-utilizes the rich inter-layer representations, potentially impeding model convergence. To address this limitation, we introduce DeepFlow, a novel framework that enhances velocity representation through inter-layer communication. |
Inkyu Shin; Chenglin Yang; Liang-Chieh Chen; |
| 105 | Does Your Vision-Language Model Get Lost in The Long Video Sampling Dilemma? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To tackle the challenges posed by high-NSD questions, we propose a novel Reasoning-Driven Hierarchical Sampling (RHS) framework, which combines global localization of question-relevant cues with local dense sampling for precise inference. |
Tianyuan Qu; Longxiang Tang; Bohao Peng; Senqiao Yang; Bei Yu; Jiaya Jia; |
| 106 | Auto-Controlled Image Perception in MLLMs Via Visual Perception Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For example, they cannot selectively re-encode specific regions of an image or focus on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. |
Runpeng Yu; Xinyin Ma; Xinchao Wang; |
| 107 | ReTracker: Exploring Image Matching for Robust Online Any Point Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper aims to establish correspondences for a set of 2D query points across a video sequence in an online manner. |
Dongli Tan; Xingyi He; Sida Peng; Yiqing Gong; Xing Zhu; Jiaming Sun; Ruizhen Hu; Yujun Shen; Hujun Bao; Xiaowei Zhou; |
| 108 | GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce GaussianOcc, a systematic method that investigates Gaussian splatting for fully self-supervised and efficient 3D occupancy estimation in surround views. |
Wanshui Gan; Fang Liu; Hongbin Xu; Ningkai Mo; Naoto Yokoya; |
| 109 | Controllable Weather Synthesis and Removal with Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Physics-based weather simulation requires precise reconstructions that are hard to scale to in-the-wild videos, while current video editing often lacks realism and control.In this work, we introduce WeatherWeaver, a video diffusion model that synthesizes diverse weather effects—including rain, snow, fog, and clouds—directly into any input video without the need for 3D modeling.Our model provides precise control over weather effect intensity and supports blending various weather types, ensuring both realism and adaptability.To overcome the scarcity of paired training data, we propose a novel data strategy combining synthetic videos, generative image editing, and auto-labeled real-world videos. |
Chih-Hao Lin; Zian Wang; Ruofan Liang; Yuxuan Zhang; Sanja Fidler; Shenlong Wang; Zan Gojcic; |
| 110 | Efficient Track Anything Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The high computation complexity of image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight end-to-end track anything models that produce high-quality results with low latency and small model size. |
Yunyang Xiong; Chong Zhou; Xiaoyu Xiang; Lemeng Wu; Chenchen Zhu; Zechun Liu; Saksham Suri; Balakrishnan Varadarajan; Ramya Akula; Forrest Iandola; Raghuraman Krishnamoorthi; Bilge Soran; Vikas Chandra; |
| 111 | Unraveling The Smoothness Properties of Diffusion Models: A Gaussian Mixture Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, a theoretical understanding of the Lipschitz continuity and second momentum properties of the diffusion process is still lacking. In this paper, we bridge this gap by providing a detailed examination of these smoothness properties for the case where the target data distribution is a mixture of Gaussians, which serves as a universal approximator for smooth densities such as image data. |
Yingyu Liang; Zhizhou Sha; Zhenmei Shi; Zhao Song; Mingda Wan; Yufa Zhou; |
| 112 | Learning Streaming Video Representation Via Multitask Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unlike offline video processing, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions. To address these challenges, our main contributions are three-fold. |
Yibin Yan; Jilan Xu; Shangzhe Di; Yikun Liu; Yudi Shi; Qirui Chen; Zeqian Li; Yifei Huang; Weidi Xie; |
| 113 | GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, prior GUI agents often trained with datasets comprising tasks that can be completed within a single app, leading to poor performance in cross-app navigation. To address this problem, we present GUIOdyssey, a comprehensive dataset for cross-app mobile GUI navigation. |
Quanfeng Lu; Wenqi Shao; Zitao Liu; Lingxiao Du; Fanqing Meng; Boxuan Li; Botong Chen; Siyuan Huang; Kaipeng Zhang; Ping Luo; |
| 114 | LEGION: Learning to Ground and Explain for Synthetic Image Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. |
Hengrui Kang; Siwei Wen; Zichen Wen; Junyan Ye; Weijia Li; Peilin Feng; Baichuan Zhou; Bin Wang; Dahua Lin; Linfeng Zhang; Conghui He; |
| 115 | Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce **T**ext-**A**ware **T**ransformer-based 1-D**i**mensional **Tok**enizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. |
Dongwon Kim; Ju He; Qihang Yu; Chenglin Yang; Xiaohui Shen; Suha Kwak; Liang-Chieh Chen; |
| 116 | 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a high-quality multimodal textbook corpus with richer foundational knowledge for VLM pretraining. |
Wenqi Zhang; Hang Zhang; Xin Li; Jiashuo Sun; Yongliang Shen; Weiming Lu; Deli Zhao; Yueting Zhuang; Lidong Bing; |
| 117 | CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Moreover, the simple binary object-existence identification across all referent scenarios fails to specify their inherent differences, incurring ambiguity in object understanding. To tackle the above issues, we propose a **Co**unting-Aware **H**ierarchical **D**ecoding framework (CoHD) for GRES. |
Zhuoyan Luo; Yinghao Wu; Tianheng Cheng; Yong Liu; Yicheng Xiao; Hongfa Wang; Xiao-Ping Zhang; Yujiu Yang; |
| 118 | Video-T1: Test-time Scaling for Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. |
Fangfu Liu; Hanyang Wang; Yimo Cai; Kaiyan Zhang; Xiaohang Zhan; Yueqi Duan; |
| 119 | LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. |
Fangfu Liu; Hao Li; Jiawei Chi; Hanyang Wang; Minghui Yang; Fudong Wang; Yueqi Duan; |
| 120 | Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers Via In-Context Reflection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose Reflect-DiT, a method that enables Diffusion Transformers to refine their generations using in-context examples of previously generated images alongside textual feedback describing necessary improvements. |
Shufan Li; Konstantinos Kallidromitis; Akash Gokul; Arsh Koneru; Yusuke Kato; Kazuki Kozuka; Aditya Grover; |
| 121 | 3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While existing works equip agents with various scene representations to enhance spatial awareness, they often neglect the complex 3D geometry and rich semantics in VLN scenarios, limiting the ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. |
Jianzhe Gao; Rui Liu; Wenguan Wang; |
| 122 | Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Traditionally, creating photo-realistic 3D head avatars requires a studio-level multi-view capture setup and expensive optimization during test-time, limiting the use of digital human doubles to the VFX industry or offline renderings. To address this shortcoming, we present Avat3r, which regresses a high-quality and animatable 3D head avatar from just a few input images, vastly reducing compute requirements during inference. |
Tobias Kirschstein; Javier Romero; Artem Sevastopolsky; Matthias Nießner; Shunsuke Saito; |
| 123 | SpectralAR: Spectral Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. |
Yuanhui Huang; Weiliang Chen; Wenzhao Zheng; Yueqi Duan; Jie Zhou; Jiwen Lu; |
| 124 | FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present FreeMorph, the first tuning-free method for image morphing that accommodates inputs with varying semantics or layouts. |
Yukang Cao; Chenyang Si; Jinghao Wang; Ziwei Liu; |
| 125 | C4D: 4D Made from 3D Through Dual Correspondences Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction. To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. |
Shizun Wang; Zhenxiang Jiang; Xingyi Yang; Xinchao Wang; |
| 126 | ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose to separate the coordinate regressor and the map representation into a generic transformer and a scene-specific map code. |
Leonard Bruns; Axel Barroso-Laguna; Tommaso Cavallari; Aron Monszpart; Sowmya Munukutla; Victor Adrian Prisacariu; Eric Brachmann; |
| 127 | DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose DepR, a depth-guided single-view scene reconstruction framework that integrates instance-level diffusion within a compositional paradigm. |
Qingcheng Zhao; Xiang Zhang; Haiyang Xu; Zeyuan Chen; Jianwen Xie; Yuan Gao; Zhuowen Tu; |
| 128 | OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, to the best of our knowledge, all these solutions are still not fully open, e.g., their training data remains proprietary and/or their training frameworks are unreleased. In this paper, we address this challenge by introducing a family of fully open vision encoders that are as competitive as, or even surpass, OpenAI’s CLIP in building multimodal foundation models like LLaVA. |
Xianhang Li; Yanqing Liu; Haoqin Tu; Cihang Xie; |
| 129 | GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality–a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. |
Tianwei Xiong; Jun Hao Liew; Zilong Huang; Jiashi Feng; Xihui Liu; |
| 130 | AnyBimanual: Transferring Unimanual Policy for General Bimanual Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a plug-and-play method named AnyBimanual, which transfers pretrained unimanual policy to general bimanual manipulation policy with few bimanual demonstrations. |
Guanxing Lu; Tengbo Yu; Haoyuan Deng; Season Si Chen; Yansong Tang; Ziwei Wang; |
| 131 | MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel pixel-aligned Motion Map (MoMap) representation for 3D scene motion, which can be generated from existing generative image models to facilitate efficient and effective motion prediction. |
Jiahui Lei; Kyle Genova; George Kopanas; Noah Snavely; Leonidas Guibas; |
| 132 | Are VLMs Ready for Autonomous Driving? An Empirical Study from The Reliability, Data and Metric Perspectives Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Given the challenges and inspired by the inherent corruption awareness, we propose Robust Agentic Utilization (RAU), leveraging VLMs’ corruption awareness and agentic planning with external tools to enhance perception reliability for downstream tasks. |
Shaoyuan Xie; Lingdong Kong; Yuhao Dong; Chonghao Sima; Wenwei Zhang; Qi Alfred Chen; Ziwei Liu; Liang Pan; |
| 133 | SAM2Long: Enhancing SAM 2 for Long Video Segmentation with A Training-Free Memory Tree Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we introduce SAM2Long, an improved training-free video object segmentation strategy, which considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways in a constrained tree search manner. |
Shuangrui Ding; Rui Qian; Xiaoyi Dong; Pan Zhang; Yuhang Zang; Yuhang Cao; Yuwei Guo; Dahua Lin; Jiaqi Wang; |
| 134 | ReCamMaster: Camera-Controlled Generative Rendering from A Single Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It is non-trivial due to the extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. |
Jianhong Bai; Menghan Xia; Xiao Fu; Xintao Wang; Lianrui Mu; Jinwen Cao; Zuozhu Liu; Haoji Hu; Xiang Bai; Pengfei Wan; Di Zhang; |
| 135 | LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In response, we propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs. |
Yuzhang Shang; Mu Cai; Bingxin Xu; Yong Jae Lee; Yan Yan; |
| 136 | LHM: Large Animatable Human Reconstruction Model for Single Image to 3D in Seconds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. |
Lingteng Qiu; Xiaodong Gu; Peihao Li; Qi Zuo; Weichao Shen; Junfei Zhang; Kejie Qiu; Weihao Yuan; Guanying Chen; Zilong Dong; Liefeng Bo; |
| 137 | An Empirical Study of Autoregressive Pre-training from Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. |
Jathushan Rajasegaran; Ilija Radosavovic; Rahul Ravishankar; Yossi Gandelsman; Christoph Feichtenhofer; Jitendra Malik; |
| 138 | Zero-Shot Vision Encoder Grafting Via LLM Surrogates Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training.To reduce costs, a promising strategy is to first train the vision encoder using a small language model before transferring it to the large one.We construct small "surrogate models" that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers.Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting — when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM.Furthermore, our surrogate training approach reduces overall VLM training costs by 45% when using Llama-70B as the decoder. |
Kaiyu Yue; Vasu Singla; Menglin Jia; John Kirchenbauer; Rifaa Qadri; Zikui Cai; Abhinav Bhatele; Furong Huang; Tom Goldstein; |
| 139 | Large Multi-modal Models Can Interpret Features in Large Multi-modal Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: One question that arises is how we, as humans, can understand their internal neural representations. This paper takes an initial step towards addressing this question by presenting a versatile framework to identify and interpret the semantics within LMMs. |
Kaichen Zhang; Yifei Shen; Bo Li; Ziwei Liu; |
| 140 | Scene Coordinate Reconstruction Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a probabilistic reinterpretation of training SCR models, which allows us to infuse high-level reconstruction priors. |
Wenjing Bian; Axel Barroso-Laguna; Tommaso Cavallari; Victor Adrian Prisacariu; Eric Brachmann; |
| 141 | General Compression Framework for Efficient Transformer Object Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce model size while preserving tracking accuracy. |
Lingyi Hong; Jinglun Li; Xinyu Zhou; Shilin Yan; Pinxue Guo; Kaixun Jiang; Zhaoyu Chen; Shuyong Gao; Runze Li; Xingdong Sheng; Wei Zhang; Hong Lu; Wenqiang Zhang; |
| 142 | ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. |
Zifu Wan; Ce Zhang; Silong Yong; Martin Q. Ma; Simon Stepputtis; Louis-Philippe Morency; Deva Ramanan; Katia Sycara; Yaqi Xie; |
| 143 | Cycle Consistency As Reward: Learning Image-Text Alignment Without Human Preferences Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an alternative approach that leverages cycle consistency as a supervisory signal. |
Hyojin Bahng; Caroline Chan; Fredo Durand; Phillip Isola; |
| 144 | SynCity: Training-Free Generation of 3D Worlds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose SynCity, a method for generating explorable 3D worlds from textual descriptions. |
Paul Engstler; Aleksandar Shtedritski; Iro Laina; Christian Rupprecht; Andrea Vedaldi; |
| 145 | BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a generalizable RGB-based approach for object pose estimation, specifically designed to address challenges in sparse-view settings. |
Yuanhong Yu; Xingyi He; Chen Zhao; Junhao Yu; Jiaqi Yang; Ruizhen Hu; Yujun Shen; Xing Zhu; Xiaowei Zhou; Sida Peng; |
| 146 | The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This gap between training and testing leads to a subpar performance. To bridge this gap, we propose conditional optimal transport (C2OT) that adds a conditional weighting term in the cost matrix when computing the optimal transport assignment. |
Ho Kei Cheng; Alexander Schwing; |
| 147 | Scalable Ranked Preference Optimization for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training. |
Shyamgopal Karthik; Huseyin Coskun; Zeynep Akata; Sergey Tulyakov; Jian Ren; Anil Kag; |
| 148 | EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by how humans learn through the perception-action loop, we propose EgoAgent, a unified agent model that simultaneously learns to represent, predict, and act within a single transformer. |
Lu Chen; Yizhou Wang; Shixiang Tang; Qianhong Ma; Tong He; Wanli Ouyang; Xiaowei Zhou; Hujun Bao; Sida Peng; |
| 149 | GaussianReg: Rapid 2D/3D Registration for Emergency Surgery Via Explicit 3D Modeling with Gaussian Primitives Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present GaussianReg, a novel registration framework that achieves clinically acceptable accuracy within minutes of preprocessing. |
Weihao Yu; Xiaoqing Guo; Xinyu Liu; Yifan Liu; Hao Zheng; Yawen Huang; Yixuan Yuan; |
| 150 | X2-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, We propose X^2-Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning. |
Weihao Yu; Yuanhao Cai; Ruyi Zha; Zhiwen Fan; Chenxin Li; Yixuan Yuan; |
| 151 | TAB: Transformer Attention Bottlenecks Enable User Intervention and Debugging in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a novel 1-head Transformer Attention Bottleneck (TAB) layer, inserted after the traditional MHSA architecture, to serve as an attention bottleneck for interpretability and intervention. |
Pooyan Rahmanzadehgervi; Hung Huy Nguyen; Rosanne Liu; Long Mai; Anh Totti Nguyen; |
| 152 | Hierarchy UGP: Hierarchy Unified Gaussian Primitive for Large-Scale Dynamic Scene Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods often struggle to scale to large scenes or accurately model arbitrary dynamics. To address these limitations, we propose Hierarchy UGP, which constructs a hierarchical structure consisting of a root level, sub-scenes level, and primitive level, using Unified Gaussian Primitive (UGP) defined in 4D space as the representation. |
Hongyang Sun; Qinglin Yang; Jiawei Wang; Zhen Xu; Chen Liu; Yida Wang; Kun Zhan; Hujun Bao; Xiaowei Zhou; Sida Peng; |
| 153 | ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a tuning-free method for both object insertion and subject-driven generation. |
Daniel Winter; Asaf Shul; Matan Cohen; Dana Berman; Yael Pritch; Alex Rav-Acha; Yedid Hoshen; |
| 154 | Aether: Geometric-Aware Unified World Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. |
Haoyi Zhu; Yifan Wang; Jianjun Zhou; Wenzheng Chang; Yang Zhou; Zizun Li; Junyi Chen; Chunhua Shen; Jiangmiao Pang; Tong He; |
| 155 | ShortFT: Diffusion Model Alignment Via Shortcut-based Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Shortcut-based Fine-Tuning (ShortFT), an efficient fine-tuning strategy that utilizes the shorter denoising chain. |
Xiefan Guo; Miaomiao Cui; Liefeng Bo; Di Huang; |
| 156 | MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present MagicMirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. |
Yuechen Zhang; Yaoyang Liu; Bin Xia; Bohao Peng; Zexin Yan; Eric Lo; Jiaya Jia; |
| 157 | Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. |
Zeren Jiang; Chuanxia Zheng; Iro Laina; Diane Larlus; Andrea Vedaldi; |
| 158 | Dual-Expert Consistency Model for Efficient and High-Quality Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. |
Zhengyao Lv; Chenyang Si; Tianlin Pan; Zhaoxi Chen; Kwan-Yee K. Wong; Yu Qiao; Ziwei Liu; |
| 159 | LightSwitch: Multi-view Relighting with Material-guided Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Lightswitch, a novel finetuned material-relighting diffusion framework that efficiently relights an arbitrary number of input images to a target lighting condition while incorporating cues from inferred intrinsic properties. |
Yehonathan Litman; Fernando De la Torre; Shubham Tulsiani; |
| 160 | How Far Are AI-generated Videos from Simulating The 3D Visual World: A Learned 3D Evaluation Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Learned 3D Evaluation (L3DE), an objective, quantifiable, and interpretable method for assessing AI-generated videos’ ability to simulate the real world in terms of 3D visual qualities and consistencies, without requiring manually labeled defects or quality annotations. |
Chirui Chang; Jiahui Liu; Zhengzhe Liu; Xiaoyang Lyu; Yi-Hua Huang; Xin Tao; Pengfei Wan; Di Zhang; Xiaojuan Qi; |
| 161 | V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Our findings reveal that directly applying the positional encoding mechanism used for textual tokens to visual tokens is suboptimal, and VLM performance degrades sharply when the position encoding exceeds the model’s context window. To address this, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens, enabling more efficient management of long multimodal sequences. |
Junqi Ge; Ziyi Chen; Jintao Lin; Jinguo Zhu; Xihui Liu; Jifeng Dai; Xizhou Zhu; |
| 162 | ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, a significant gap remains between the generated data and real-world sensor observations, particularly in terms of fidelity for structured elements, such as the ground surface. To address these challenges, we propose ReconDreamer++, an enhanced framework that significantly improves the overall rendering quality by mitigating the domain gap and refining the representation of the ground surface.Specifically, ReconDreamer++ introduces the Novel Trajectory Deformable Network (NTDNet), which leverages learnable spatial deformation mechanisms to bridge the domain gap between synthesized novel views and original sensor observations. |
Guosheng Zhao; Xiaofeng Wang; Chaojun Ni; Zheng Zhu; Wenkang Qin; Guan Huang; Xingang Wang; |
| 163 | VSP: Diagnosing The Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce VSP, a benchmark that 1) evaluates the spatial planning capability in MLLMs in general, and 2) diagnoses this capability via finer-grained sub-tasks, including perception and reasoning, and measure the capabilities of models through these sub-tasks. |
Qiucheng Wu; Handong Zhao; Michael Saxon; Trung Bui; William Yang Wang; Yang Zhang; Shiyu Chang; |
| 164 | WonderPlay: Dynamic 3D Scene Generation from A Single Image and Actions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: WonderPlay is a novel framework integrating physics simulation with video generation for generating action-conditioned dynamic 3D scenes from a single image. |
Zizhang Li; Hong-Xing Yu; Wei Liu; Yin Yang; Charles Herrmann; Gordon Wetzstein; Jiajun Wu; |
| 165 | Rethinking The Upsampling Process in Light Field Super-Resolution with Spatial-Epipolar Implicit Image Function Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Besides, given that line structure in epipolar plane image integrates spatial-angular correlation of light field, we present an oriented line sampling strategy to exactly aggregate inter-view information. |
Ruixuan Cong; Yu Wang; Mingyuan Zhao; Da Yang; Rongshan Chen; Hao Sheng; |
| 166 | Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. |
Xinyu Fang; Zhijian Chen; Kai Lan; Lixin Ma; Shengyuan Ding; Yingji Liang; Xiangyu Zhao; Farong Wen; Zicheng Zhang; Guofeng Zhang; Haodong Duan; Kai Chen; Dahua Lin; |
| 167 | ERNet: Efficient Non-Rigid Registration Network for Point Sequences Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The key difficulties stem from two factors: (i) the presence of local minima due to the non-convexity of registration objectives, especially under noisy or partial inputs, which hinders accurate and robust deformation estimation, and (ii) error accumulation over long sequences, leading to tracking failures. To address these challenges, we introduce to adopt a scalable data-driven approach and propose \methodname, an efficient feed-forward model trained on large deformation datasets.It is designed to handle noisy and partial inputs while effectively leveraging temporal information for accurate and consistent sequential registration. |
Guangzhao He; Yuxi Xiao; Zhen Xu; Xiaowei Zhou; Sida Peng; |
| 168 | Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: One key challenge in utilizing LMMs for these tasks is the extraction of useful features from generative LMMs. To overcome this, we propose an approach that leverages multimodal feature extraction from the LMM’s latent space. |
Chancharik Mitra; Brandon Huang; Tianning Chai; Zhiqiu Lin; Assaf Arbelle; Rogerio Feris; Leonid Karlinsky; Trevor Darrell; Deva Ramanan; Roei Herzig; |
| 169 | ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce ConsistentCity, a two-stage framework with a novel Semantic Flow-guided Diffusion Transformers (SF-DiT) that convert sequential BEV semantic maps into temporally consistent driving videos. |
Benjin Zhu; Xiaogang Wang; Hongsheng Li; |
| 170 | CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS. |
Siyu Jiao; Haoye Dong; Yuyang Yin; Zequn Jie; Yinlong Qian; Yao Zhao; Humphrey Shi; Yunchao Wei; |
| 171 | InfoBridge: Balanced Multimodal Integration Through Conditional Dependency Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While existing methods attempt to enhance fusion through cross-modal alignment or interaction mechanisms, they often struggle to balance effective integration with preserving modality-specific information. We introduce InfoBridge, a novel framework grounded in conditional information maximization principles addressing these limitations. |
Chenxin Li; Yifan Liu; Panwang Pan; Hengyu Liu; Xinyu Liu; Wuyang Li; Cheng Wang; Weihao Yu; Yiyang Lin; Yixuan Yuan; |
| 172 | RAGD: Regional-Aware Diffusion Model for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within attention layers using attention masks, resulting in limited control strength when the number of regions increases. To handle these limitations, we present RAGD, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. |
Zhennan Chen; Yajie Li; Haofan Wang; Zhibo Chen; Zhengkai Jiang; Jun Li; Qian Wang; Jian Yang; Ying Tai; |
| 173 | Decoupled Diffusion Sparks Adaptive Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. |
Yunsong Zhou; Naisheng Ye; William Ljungbergh; Tianyu Li; Jiazhi Yang; Zetong Yang; Hongzi Zhu; Christoffer Petersson; Hongyang Li; |
| 174 | Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, when applied in parallelized block-wise training, two critical issues arise since reconstruction accuracy deteriorates due to reduced data diversity when training each block independently, and parallel training restricts the number of divided blocks to the available number of GPUs. To address these issues, we propose Momentum-GS, a novel approach that leverages momentum-based self-distillation to promote consistency and accuracy across the blocks while decoupling the number of blocks from the physical GPU count. |
Jixuan Fan; Wanhua Li; Yifei Han; Tianru Dai; Yansong Tang; |
| 175 | Unified Adversarial Augmentation for Improving Palmprint Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing augmentation methods struggle to generate palmprint-specific variations while preserving identity consistency, leading to suboptimal performance. To address these problems, we propose a unified adversarial augmentation framework. |
Jianlong Jin; Chenglong Zhao; Ruixin Zhang; Sheng Shang; Yang Zhao; Jun Wang; Jingyun Zhang; Shouhong Ding; Wei Jia; Yunsheng Wu; |
| 176 | LLaVA-CoT: Let Vision Language Models Reason Step-by-Step Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce LLaVA-CoT, a large VLM designed to conduct autonomous multistage reasoning. |
Guowei Xu; Peng Jin; Ziang Wu; Hao Li; Yibing Song; Lichao Sun; Li Yuan; |
| 177 | AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we explore a new task, Open-set Language-guided Dexterous Grasp, and find that the main challenge is the huge gap between high-level human language semantics and low-level robot action. |
Yi-Lin Wei; Mu Lin; Yuhao Lin; Jian-Jian Jiang; Xiao-Ming Wu; Ling-An Zeng; Wei-Shi Zheng; |
| 178 | Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these agents face significant challenges in visual perception, particularly when handling high-resolution, visually complex digital environments. This paper introduces Iris, a foundational visual agent that addresses these challenges through two key innovations: Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL). |
Zhiqi Ge; Juncheng Li; Xinglei Pang; Minghe Gao; Kaihang Pan; Wang Lin; Hao Fei; Wenqiao Zhang; Siliang Tang; Yueting Zhuang; |
| 179 | Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. |
Bowen Zhang; Sicheng Xu; Chuxin Wang; Jiaolong Yang; Feng Zhao; Dong Chen; Baining Guo; |
| 180 | Efficient Autoregressive Shape Generation Via Octree-Based Adaptive Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This leads to inefficient latent representations that can compromise downstream generation. We address this challenge by introducing Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity. |
Kangle Deng; Hsueh-Ti Derek Liu; Yiheng Zhu; Xiaoxia Sun; Chong Shang; Kiran S. Bhat; Deva Ramanan; Jun-Yan Zhu; Maneesh Agrawala; Tinghui Zhou; |
| 181 | One Trajectory, One Token: Grounded Video Tokenization Via Panoptic Sub-object Trajectory Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. |
Chenhao Zheng; Jieyu Zhang; Mohammadreza Salehi; Ziqi Gao; Vishnu Iyengar; Norimasa Kobori; Quan Kong; Ranjay Krishna; |
| 182 | Representation Shift: Unifying Token Compression with FlashAttention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token’s representation. |
Joonmyung Choi; Sanghyeok Lee; Byungoh Ko; Eunseo Kim; Jihyung Kil; Hyunwoo J. Kim; |
| 183 | Authentic 4D Driving Simulation with A Video Generation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite progress in generating driving scenes, challenges in transforming views and modeling the dynamics of space and time remain. To tackle these issues, we propose a fresh methodology that reconstructs real-world driving environments and utilizes a generative network to enable 4D simulation. |
Lening Wang; Wenzhao Zheng; Dalong Du; Yunpeng Zhang; Yilong Ren; Han Jiang; Zhiyong Cui; Haiyang Yu; Jie Zhou; Shanghang Zhang; |
| 184 | Easi3R: Estimating Disentangled Motion from DUSt3R Without Training Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. |
Xingyu Chen; Yue Chen; Yuliang Xiu; Andreas Geiger; Anpei Chen; |
| 185 | CameraCtrl II: Dynamic Scene Exploration Via Camera-controlled Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces CameraCtrl II, a framework that enables continuous and dynamic scene exploration through a camera-controlled video diffusion model. |
Hao He; Ceyuan Yang; Shanchuan Lin; Yinghao Xu; Meng Wei; Liangke Gui; Qi Zhao; Gordon Wetzstein; Lu Jiang; Hongsheng Li; |
| 186 | Learning to Inference Adaptively for Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite recent effort on improving the efficiency of MLLMs, prior solutions fall short in responding to varying runtime conditions, in particular changing resource availability (e.g., contention due to the execution of other programs on the device). To bridge this gap, we introduce AdaLLaVA, an adaptive inference framework that learns to dynamically reconfigure operations in an MLLM during inference, accounting for the input data and a latency budget. |
Zhuoyan Xu; Khoi Duc Nguyen; Preeti Mukherjee; Saurabh Bagchi; Somali Chaterji; Yingyu Liang; Yin Li; |
| 187 | Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While this formulation is elegant and powerful, it is limited to static scenes. To overcome this limitation, we introduce the concept of Dynamic Point Maps (DPM), which extends standard point maps to support 4D tasks such as motion segmentation, scene flow estimation, 3D object tracking, and 2D correspondence. |
Edgar Sucar; Zihang Lai; Eldar Insafutdinov; Andrea Vedaldi; |
| 188 | DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We construct a dataset of 3D objects labeled with stability scores obtained from the physics simulator. This dataset enables fine-tuning of the 3D generator using the stability score as an alignment metric, via direct preference optimization (DPO) or direct reward optimization (DRO)—a novel objective we introduce to align diffusion models without requiring pairwise preferences. |
Ruining Li; Chuanxia Zheng; Christian Rupprecht; Andrea Vedaldi; |
| 189 | Puppet-Master: Scaling Interactive Video Generation As A Motion Prior for Part-Level Dynamics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Puppet-Master, an interactive video generator that captures the internal, part-level motion of objects, serving as a proxy for modeling object dynamics universally. |
Ruining Li; Chuanxia Zheng; Christian Rupprecht; Andrea Vedaldi; |
| 190 | Generate, Transduct, Adapt: Iterative Transduction with VLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose GTA-CLIP, a novel technique that incorporates supervision from language models for joint transduction in language and vision spaces. |
Oindrila Saha; Logan Lawrence; Grant Van Horn; Subhransu Maji; |
| 191 | NeuralSVG: An Implicit Representation for Text-to-Vector Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To encourage a layered structure in the generated SVG, we introduce a dropout-based regularization technique that strengthens the standalone meaning of each shape. |
Sagi Polaczek; Yuval Alaluf; Elad Richardson; Yael Vinker; Daniel Cohen-Or; |
| 192 | StyleKeeper: Prevent Content Leakage Using Negative Visual Query Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods often suffer from content leakage, where undesired elements of the visual style prompt are transferred along with the intended style. To address this issue, we 1) extend classifier-free guidance (CFG) to utilize swapping self-attention and propose 2) negative visual query guidance (NVQG) to reduce the transfer of unwanted contents. |
Jaeseok Jeong; Junho Kim; Gayoung Lee; Yunjey Choi; Youngjung Uh; |
| 193 | GenieBlue: Integrating Both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these issues, we systematically analyze methods to maintain pure language capabilities during the training of MLLMs, focusing on both training data and model architecture aspects. Based on these analyses, we propose GenieBlue, an efficient MLLM structural design that integrates both linguistic and multimodal capabilities for LLMs on mobile devices. |
Xudong Lu; Yinghao Chen; Renshou Wu; Haohao Gao; Xi Chen; Xue Yang; Xiangyu Zhao; Aojun Zhou; Fangyuan Li; Yafei Wen; Xiaoxin Chen; Shuai Ren; Hongsheng Li; |
| 194 | TruthPrInt: Mitigating Large Vision-Language Models Object Hallucination Via Latent Truthful-Guided Pre-Intervention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, (2) different LVLMs encode universal patterns of hallucinations in common latent subspaces, indicating that there exist "generic truthful directions" shared by various LVLMs. Based on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt) that first learns the truthful direction of LVLM decoding and then applies truthful-guided inference-time intervention during LVLM decoding. |
Jinhao Duan; Fei Kong; Hao Cheng; James Diffenderfer; Bhavya Kailkhura; Lichao Sun; Xiaofeng Zhu; Xiaoshuang Shi; Kaidi Xu; |
| 195 | Online Dense Point Tracking with Streaming Memory Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To account for temporal consistency and enable efficient information propagation, we present a lightweight and fast model with Streaming memory for dense POint Tracking and online video processing. |
Qiaole Dong; Yanwei Fu; |
| 196 | DiffSim: Taming Diffusion Models for Evaluating Visual Similarity Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper is the first to discover that pretrained diffusion models can be utilized for measuring visual similarity and introduces the DiffSim method, addressing the limitations of traditional metrics in capturing perceptual consistency in custom generation tasks. |
Yiren Song; Xiaokang Liu; Mike Zheng Shou; |
| 197 | LayerTracer: Cognitive-Aligned Layered SVG Synthesis Via Diffusion Transformer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Generating cognitive-aligned layered SVGs remains challenging due to existing methods’ tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies. We propose LayerTracer, a DiT based framework that bridges this gap by learning designers’ layered SVG creation processes from a novel dataset of sequential design operations. |
Yiren Song; Danze Chen; Mike Zheng Shou; |
| 198 | Towards Fine-grained Interactive Segmentation in Images and Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a SAM2Refiner framework built upon the SAM2 backbone. |
Yuan Yao; Qiushi Yang; Miaomiao Cui; Liefeng Bo; |
| 199 | VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. |
Shiduo Zhang; Zhe Xu; Peiju Liu; Xiaopeng Yu; Yuan Li; Qinghui Gao; Zhaoye Fei; Zhangyue Yin; Zuxuan Wu; Yu-Gang Jiang; Xipeng Qiu; |
| 200 | MotionFollower: Editing Video Motion Via Score-Guided Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose MotionFollower, a score-guided diffusion model for video motion editing. |
Shuyuan Tu; Qi Dai; Zihao Zhang; Sicheng Xie; Zhi-Qi Cheng; Chong Luo; Xintong Han; Zuxuan Wu; Yu-Gang Jiang; |
| 201 | Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address XBT challenges, we propose an efficient solution: a projection module that maps the new model’s embeddings to those of the old model. |
Young Kyun Jang; Ser-nam Lim; |
| 202 | STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To enhance the spatio-temporal quality of restored videos, we introduce STAR (Spatial-Temporal Augmentation with T2V models for Real-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. |
Rui Xie; Yinhong Liu; Penghao Zhou; Chen Zhao; Jun Zhou; Kai Zhang; Zhenyu Zhang; Jian Yang; Zhenheng Yang; Ying Tai; |
| 203 | Unleashing Vecset Diffusion Model for Fast Shape Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Challenges exist because of not only difficulties in accelerating diffusion sampling but also VAE decoding in VDM — areas under-explored in previous works. To address these challenges, we present FlashVDM, a systematic framework for accelerating both VAE and DiT in VDM. |
Zeqiang Lai; Yunfei Zhao; Zibo Zhao; Haolin Liu; Fuyun Wang; Huiwen Shi; Xianghui Yang; Qingxiang Lin; Jingwei Huang; Yuhong Liu; Jie Jiang; Chunchao Guo; Xiangyu Yue; |
| 204 | PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose PoseSyn, a novel data synthesis framework that transforms abundant in-the-wild 2D pose dataset into diverse 3D pose-image pairs. |
ChangHee Yang; Hyeonseop Song; Seokhun Choi; Seungwoo Lee; Jaechul Kim; Hoseok Do; |
| 205 | DreamDance: Animating Human Images By Enriching 3D Geometry Cues from 2D Poses Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present DreamDance, a novel method for animating human images using only skeleton pose sequences as conditional inputs. |
Yatian Pang; Bin Zhu; Bin Lin; Mingzhe Zheng; Francis E. H. Tay; Ser-Nam Lim; Harry Yang; Li Yuan; |
| 206 | Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. |
Xiyao Wang; Zhengyuan Yang; Linjie Li; Hongjin Lu; Yuancheng Xu; Chung-Ching Lin; Kevin Lin; Furong Huang; Lijuan Wang; |
| 207 | Free-Form Motion Control: Controlling The 6D Poses of Camera and Objects in Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Due to the lack of datasets with comprehensive 6D pose annotations, existing text-to-video methods can not simultaneously control the motions of both camera and objects in 3D-aware manner, resulting in limited controllability over generated contents. To address this issue and facilitate the research in this field, we introduce a Synthetic Dataset for Free-Form Motion Control (SynFMC). |
Xincheng Shuai; Henghui Ding; Zhenyuan Qin; Hao Luo; Xingjun Ma; Dacheng Tao; |
| 208 | PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present PERSONA, a framework that combines the strengths of both approaches to obtain a personalized 3D human avatar with pose-driven deformations from a single image. |
Geonhee Sim; Gyeongsik Moon; |
| 209 | MUSE-VL: Modeling Unified VLM Through Semantic Discrete Encoding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. |
Rongchang Xie; Chen Du; Ping Song; Chang Liu; |
| 210 | HPSv3: Towards Wide-Spectrum Human Preference Score Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these challenges, we introduce Human Preference Score v3 (HPSv3). (1) We release HPDv3, the first wide-spectrum human preference dataset integrating 1.08M text-image pairs and 1.17M annotated pairwise comparisons from state-of-the-art generative models and low to high-quality real-world images. (2) We introduce a VLM-based preference model trained using an uncertainty-aware ranking loss for fine-grained ranking. |
Yuhang Ma; Xiaoshi Wu; Keqiang Sun; Hongsheng Li; |
| 211 | Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nevertheless, its reliance on the reverse Kullback-Leibler (KL) divergence minimization potentially induces mode collapse (or mode-seeking) in certain applications.To circumvent this inherent drawback, we propose Adversarial Distribution Matching (ADM), a novel framework that leverages diffusion-based discriminators to align the latent predictions between real and fake score estimators for score distillation in an adversarial manner. |
Yanzuo Lu; Yuxi Ren; Xin Xia; Shanchuan Lin; Xing Wang; Xuefeng Xiao; Andy J. Ma; Xiaohua Xie; Jian-Huang Lai; |
| 212 | EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Exact Volumetric Ellipsoid Rendering (EVER), a method for real-time 3D reconstruction.EVER accurately blends an unlimited number of overlapping primitives together in 3D space, eliminating the popping artifacts that 3D Gaussian Splatting (3DGS) and other related methods exhibit.EVER represents a radiance field as a set of constant-density volumetric ellipsoids, which are raytraced by intersecting each primitive twice (once upon ray entrance and another on ray exit) and accumulating the derivatives of the densities and colors along the ray.Because EVER is built around ray tracing, it also enables effects such as defocus blur and fish-eye camera distortion, while still achieving frame rates of 30 FPS at 720p on an NVIDIA RTX4090. |
Alexander Mai; Peter Hedman; George Kopanas; Dor Verbin; David Futschik; Qiangeng Xu; Falko Kuester; Jonathan T. Barron; Yinda Zhang; |
| 213 | BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce OmniDepth, a unified framework that bridges both through iterative bidirectional alignment of their latent representations. |
Tongfan Guan; Jiaxin Guo; Chen Wang; Yun-Hui Liu; |
| 214 | CODA: Repurposing Continuous VAEs for Discrete Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce CODA(COntinuous-to-Discrete Adaptation), a framework that decouples compression and discretization. |
Zeyu Liu; Zanlin Ni; Yeguo Hua; Xin Deng; Xiao Ma; Cheng Zhong; Gao Huang; |
| 215 | Multi-Modal Few-Shot Temporal Action Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose the first MMF-TAS framework, by designing a Prototype Graph Network (PGNet). |
Zijia Lu; Ehsan Elhamifar; |
| 216 | GameFactory: Creating New Games with Generative Interactive Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present GameFactory, a framework for action-controlled scene-generalizable game video generation. |
Jiwen Yu; Yiran Qin; Xintao Wang; Pengfei Wan; Di Zhang; Xihui Liu; |
| 217 | LVBench: An Extreme Long Video Understanding Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. |
Weihan Wang; Zehai He; Wenyi Hong; Yean Cheng; Xiaohan Zhang; Ji Qi; Ming Ding; Xiaotao Gu; Shiyu Huang; Bin Xu; Yuxiao Dong; Jie Tang; |
| 218 | VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. |
Kim Sung-Bin; Jeongsoo Choi; Puyuan Peng; Joon Son Chung; Tae-Hyun Oh; David Harwath; |
| 219 | Harmonizing Visual Representations for Unified Multimodal Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Current approaches that utilize vector quantization (VQ) or variational autoencoders (VAE) for unified visual representation prioritize intrinsic imagery features over semantics, compromising understanding performance. In this work, we take inspiration from masked image modelling (MIM) that learns rich semantics via a mask-and-reconstruct pre-training and its successful extension to masked autoregressive (MAR) image generation. |
Size Wu; Wenwei Zhang; Lumin Xu; Sheng Jin; Zhonghua Wu; Qingyi Tao; Wentao Liu; Wei Li; Chen Change Loy; |
| 220 | What If: Understanding Motion Through Sparse Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present the Flow Poke Transformer (FPT), a novel framework for directly predicting the distribution of local motion, conditioned on sparse interactions termed "pokes". |
Stefan Andreas Baumann; Nick Stracke; Timy Phan; Björn Ommer; |
| 221 | VCA: Video Curious Agent for Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a curiosity-driven video agent with self-exploration capability, dubbed as "VCA". |
Zeyuan Yang; Delin Chen; Xueyang Yu; Maohao Shen; Chuang Gan; |
| 222 | EEdit : Rethinking The Spatial and Temporal Redundancy for Efficient Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we rethink that the redundancy in inversion-based image editing exists in both the spatial and temporal dimensions, such as the unnecessary computation in unedited regions and the redundancy in the inversion progress. To tackle these challenges, we propose an Efficient Editing framework, named EEdit, to achieve efficient image editing. |
Zexuan Yan; Yue Ma; Chang Zou; Wenteng Chen; Qifeng Chen; Linfeng Zhang; |
| 223 | Multi-Schema Proximity Network for Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite significant advances in CIR methods, two unresolved problems remain: 1) existing methods overlook multi-schema interaction due to the lack of fine-grained explicit visual supervision, which hinders the capture of complex correspondences, and 2) existing methods overlook noisy negative pairs formed by potential corresponding query-target pairs, which increases confusion. To address these problems, we propose a Multi-schemA Proximity Network (MAPNet) for CIR, consisting of two key components: Multi-Schema Interaction (MSI) and Relaxed Proximity Loss (RPLoss). |
Jiangming Shi; Xiangbo Yin; Yeyun Chen; Yachao Zhang; Zhizhong Zhang; Yuan Xie; Yanyun Qu; |
| 224 | Scene Graph Guided Generation: Enable Accurate Relations Generation in Text-to-Image Models Via Textural Rectification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce the Scene Graph Adapter(SG-Adapter), leveraging the structured representation of scene graphs to rectify inaccuracies in the original text embeddings. |
Guibao Shen; Luozhou Wang; Jiantao Lin; Wenhang Ge; Chaozhe Zhang; Xin Tao; Di Zhang; Pengfei Wan; Guangyong Chen; Yijun Li; Ying-cong Chen; |
| 225 | ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we identify a critical observation: MLLMs exhibit weak grounding in chart elements and proportional relationships, as evidenced by their inability to localize key positions to match their reasoning. |
Zhengzhuo Xu; SiNan Du; Yiyan Qi; Siwen Lu; Chengjin Xu; Chun Yuan; Jian Guo; |
| 226 | Flow to The Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose FlowMo, a transformer-based diffusion autoencoder. |
Kyle Sargent; Kyle Hsu; Justin Johnson; Li Fei-Fei; Jiajun Wu; |
| 227 | DiffVSR: Revealing An Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We identify that existing diffusion-based VSR methods struggle primarily because they face an overwhelming learning burden: simultaneously modeling complex degradation distributions, content representations, and temporal relationships with limited high-quality training data. To address this fundamental challenge, we present DiffVSR, featuring a Progressive Learning Strategy (PLS) that systematically decomposes this learning burden through staged training, enabling superior performance on complex degradations. |
Xiaohui Li; Yihao Liu; Shuo Cao; Ziyan Chen; Shaobin Zhuang; Xiangyu Chen; Yinan He; Yi Wang; Yu Qiao; |
| 228 | UniversalBooth: Model-Agnostic Personalized Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we instead propose a model-agnostic personalized method termed UniversalBooth. |
Songhua Liu; Ruonan Yu; Xinchao Wang; |
| 229 | RayZer: A Self-supervised Large View Synthesis Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. |
Hanwen Jiang; Hao Tan; Peng Wang; Haian Jin; Yue Zhao; Sai Bi; Kai Zhang; Fujun Luan; Kalyan Sunkavalli; Qixing Huang; Georgios Pavlakos; |
| 230 | Real3D: Towards Scaling Large Reconstruction Models with Real Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these limitations, we introduce Real3D, the first LRM that uses single-view real images for training, benefiting from their scalability and capturing the real-world shape distribution. Real3D introduces a novel self-training framework, including unsupervised losses at the pixel- and semantic-level, enabling LRMs to learn from these single-view images without multi-view supervision. |
Hanwen Jiang; Qixing Huang; Georgios Pavlakos; |
| 231 | MV-Adapter: Multi-View Consistent Image Generation Made Easy Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. |
Zehuan Huang; Yuan-Chen Guo; Haoran Wang; Ran Yi; Lizhuang Ma; Yan-Pei Cao; Lu Sheng; |
| 232 | Controllable and Expressive One-Shot Video Head Swapping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel diffusion-based multi-condition controllable framework for video head swapping, which seamlessly transplant a human head from a static image into a dynamic video, while preserving the original body and background of target video, and further allowing to tweak head expressions and movements during swapping as needed. |
Chaonan Ji; Jinwei Qi; Peng Zhang; Bang Zhang; Liefeng Bo; |
| 233 | SpatialTrackerV2: Advancing 3D Point Tracking with Explicit Camera Motion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. |
Yuxi Xiao; Jianyuan Wang; Nan Xue; Nikita Karaev; Yuri Makarov; Bingyi Kang; Xing Zhu; Hujun Bao; Yujun Shen; Xiaowei Zhou; |
| 234 | DreamCube: RGB-D Panorama Generation Via Multi-plane Synchronization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we demonstrate that by applying multi-plane synchronization to the operators from 2D foundation models, their capabilities can be seamlessly extended to the omnidirectional domain. |
Yukun Huang; Yanning Zhou; Jianan Wang; Kaiyi Huang; Xihui Liu; |
| 235 | AdsQA: Towards Advertisement Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose to use advertisement (ad) videos as a challenging test-bed to probe the ability of LLMs in perceiving beyond the objective physical content of common visual domain. |
Xinwei Long; Kai Tian; Peng Xu; Guoli Jia; Jingxuan Li; Sa Yang; Yihua Shao; Kaiyan Zhang; Che Jiang; Hao Xu; Yang Liu; Jiaheng Ma; Bowen Zhou; |
| 236 | Function-centric Bayesian Network for Zero-Shot Object Goal Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the Function-centric Bayesian Network (FBN) for the zero-shot ObjectNav task. |
Sixian Zhang; Xinyao Yu; Xinhang Song; Yiyao Wang; Shuqiang Jiang; |
| 237 | Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our method, Marigold-DC, builds on a pretrained latent diffusion model (LDM) for depth estimation and injects the depth observations as test-time guidance, via an optimization scheme that runs in tandem with the iterative inference of denoising diffusion. |
Massimiliano Viola; Kevin Qu; Nando Metzger; Bingxin Ke; Alexander Becker; Konrad Schindler; Anton Obukhov; |
| 238 | Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these approaches fail to produce reasonable associations between characters and their environments. To address this limitation, we introduce Animate Anyone 2, aiming to animate characters with environment affordance. |
Li Hu; Guangyuan Wang; Zhen Shen; Xin Gao; Dechao Meng; Lian Zhuo; Peng Zhang; Bang Zhang; Liefeng Bo; |
| 239 | FreeScale: Unleashing The Resolution of Diffusion Models Via Tuning-Free Scale Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. To tackle this challenge, we propose FreeScale, a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion. |
Haonan Qiu; Shiwei Zhang; Yujie Wei; Ruihang Chu; Hangjie Yuan; Xiang Wang; Yingya Zhang; Ziwei Liu; |
| 240 | From Panels to Prose: Generating Literary Narratives from Comics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the visual nature of comics presents a significant barrier for visually impaired readers, limiting their access to these engaging stories. In this work, we provide a pragmatic solution to this accessibility challenge by developing an automated system that generates text-based literary narratives from manga comics. |
Ragav Sachdeva; Andrew Zisserman; |
| 241 | MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. |
Zijian Dong; Longteng Duan; Jie Song; Michael J. Black; Andreas Geiger; |
| 242 | GenDoP: Auto-regressive Camera Trajectory Generation As A Director of Photography Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. |
Mengchen Zhang; Tong Wu; Jing Tan; Ziwei Liu; Gordon Wetzstein; Dahua Lin; |
| 243 | UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents UniPortrait, an innovative human image personalization framework that unifies single- and multi-ID customization with high face fidelity, extensive facial editability, free-form input description, and diverse layout generation. |
Junjie He; Yifeng Geng; Liefeng Bo; |
| 244 | Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from street view images, utilizing a diffusion model and the Bird’s-Eye View (BEV) paradigm. |
Junyan Ye; Jun He; Weijia Li; Zhutao Lv; Yi Lin; Jinhua Yu; Haote Yang; Conghui He; |
| 245 | Prompt-A-Video: Prompt Your Video Diffusion Model Via Preference-Aligned LLM Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware when applied to text-to-video diffusion models. To address these problem, we introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model. |
Yatai Ji; Jiacheng Zhang; Jie Wu; Shilong Zhang; Shoufa Chen; Chongjian Ge; Peize Sun; Weifeng Chen; Wenqi Shao; Xuefeng Xiao; Weilin Huang; Ping Luo; |
| 246 | Balanced Image Stylization with Style Matching Score Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. |
Yuxin Jiang; Liming Jiang; Shuai Yang; Jia-Wei Liu; Ivor W. Tsang; Mike Zheng Shou; |
| 247 | Advancing Visual Large Language Model for Multi-granular Versatile Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVL-LM, a Multi-granular and Versatile Perception framework incorporating Visual Large Language Model. |
Wentao Xiang; Haoxian Tan; Yujie Zhong; Cong Wei; Dengjie Li; Yujiu Yang; |
| 248 | Less-to-More Generalization: Unlocking More Controllability By In-Context Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle these challenges. |
Shaojin Wu; Mengqi Huang; Wenxu Wu; Yufeng Cheng; Fei Ding; Qian He; |
| 249 | Describe, Don’t Dictate: Semantic Image Editing with Natural Language Intent Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inversion-based algorithms unavoidably introduce reconstruction errors, while instruction-based models mainly suffer from limited dataset quality and scale. To address these problems, we propose a descriptive-prompt-based editing framework, named DescriptiveEdit. |
En Ci; Shanyan Guan; Yanhao Ge; Yilin Zhang; Wei Li; Zhenyu Zhang; Jian Yang; Ying Tai; |
| 250 | IRASim: A Fine-Grained World Model for Robot Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present IRASim, a novel world model capable of generating videos with fine-grained robot-object interaction details, conditioned on historical observations and robot action trajectories. |
Fangqi Zhu; Hongtao Wu; Song Guo; Yuxiao Liu; Chilam Cheang; Tao Kong; |
| 251 | Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This occurs partly because certain benign concepts (e.g., "skin") retained in DMs are related to the unlearned ones (e.g., "nudity"), facilitating their relearning via finetuning. To address this, we propose meta-unlearning on DMs. |
Hongcheng Gao; Tianyu Pang; Chao Du; Taihang Hu; Zhijie Deng; Min Lin; |
| 252 | Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce 3D-aware motion representation and propose an image animation framework, called Perception-as-Control, to achieve fine-grained collaborative motion control. |
Yingjie Chen; Yifang Men; Yuan Yao; Miaomiao Cui; Liefeng Bo; |
| 253 | A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A0, a hierarchical affordance-aware diffusion model that decomposes manipulation task into high-level spatial affordance understanding and low-level action execution. |
Rongtao Xu; Jian Zhang; Minghao Guo; Youpeng Wen; Haoting Yang; Min Lin; Jianzheng Huang; Zhe Li; Kaidong Zhang; Liqiong Wang; Yuxuan Kuang; Meng Cao; Feng Zheng; Xiaodan Liang; |
| 254 | STIV: Scalable Text and Image Conditioned Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a simple and scalable text and image conditioned video generation method. |
Zongyu Lin; Wei Liu; Chen Chen; Jiasen Lu; Wenze Hu; Tsu-Jui Fu; Jesse Allardice; Zhengfeng Lai; Liangchen Song; Bowen Zhang; Cha Chen; Yiran Fei; Lezhi Li; Yinfei Yang; Yizhou Sun; Kai-Wei Chang; |
| 255 | MM-IFEngine: Towards Multimodal Instruction Following Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and doing it right.Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints.To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs.Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). |
Shengyuan Ding; Shenxi Wu; Xiangyu Zhao; Yuhang Zang; Haodong Duan; Xiaoyi Dong; Pan Zhang; Yuhang Cao; Dahua Lin; Jiaqi Wang; |
| 256 | RogSplat: Robust Gaussian Splatting Via Generative Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In real-world scenarios, violations of this assumption–such as occlusions, dynamic objects, or camera blur–often lead to reconstruction artifacts and rendering inaccuracies. To address these challenges, we introduce RogSplat, a robust framework that leverages generative models to enhance the reliability of 3DGS. |
Hanyang Kong; Xingyi Yang; Xinchao Wang; |
| 257 | Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in LVLMs. |
Qizhe Zhang; Aosong Cheng; Ming Lu; Renrui Zhang; Zhiyong Zhuo; Jiajun Cao; Shaobo Guo; Qi She; Shanghang Zhang; |
| 258 | OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel-level semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. |
Ding Zhong; Xu Zheng; Chenfei Liao; Yuanhuiyi Lyu; Jialei Chen; Shengyang Wu; Linfeng Zhang; Xuming Hu; |
| 259 | Textured 3D Regenerative Morphing with 3D Diffusion Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This restriction leads to labor-intensive preprocessing and poor generalization. To overcome these challenges, we propose a method for 3D regenerative morphing using a 3D diffusion prior. |
Songlin Yang; Yushi Lan; Honghua Chen; Xingang Pan; |
| 260 | ReCoT: Reflective Self-Correction Training for Mitigating Confirmation Bias in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This problem is more common in smaller-scale LVLMs, as they are usually fine-tuned with training data that is mostly positive, focusing on generating coherent dialogue. To address this issue, we introduce ReCoT, a method designed to mitigate confirmation bias in smaller-scale LVLMs through Reflective Self-Correction Training.The method follows a two-stage SFT-DPO paradigm: the first SFT stage aims to cultivate the model’s reflective correction abilities, while the DPO stage focuses on enhancing the consistency between answers and reflections. |
Mengxue Qu; Yibo Hu; Kunyang Han; Yunchao Wei; Yao Zhao; |
| 261 | MOERL: When Mixture-of-Experts Meet Reinforcement Learning for Adverse Weather Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose MOERL, a Mixture-of-Experts (MoE) model optimized with reinforcement learning (RL) to enhance image restoration across diverse weather conditions. |
Tao Wang; Peiwen Xia; Bo Li; Peng-Tao Jiang; Zhe Kong; Kaihao Zhang; Tong Lu; Wenhan Luo; |
| 262 | Visual Test-time Scaling for GUI Agent Grounding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents. |
Tiange Luo; Lajanugen Logeswaran; Justin Johnson; Honglak Lee; |
| 263 | OmniPaint: Mastering Object-Oriented Editing Via Disentangled Insertion-Removal Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce OmniPaint, a unified framework that re-conceptualizes object removal and insertion as interdependent processes rather than isolated tasks. |
Yongsheng Yu; Ziyun Zeng; Haitian Zheng; Jiebo Luo; |
| 264 | Describe Anything: Detailed Localized Image and Video Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). |
Long Lian; Yifan Ding; Yunhao Ge; Sifei Liu; Hanzi Mao; Boyi Li; Marco Pavone; Ming-Yu Liu; Trevor Darrell; Adam Yala; Yin Cui; |
| 265 | Can Knowledge Be Transferred from Unimodal to Multimodal? Investigating The Transitivity of Multimodal Knowledge Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, in practical applications, it is desirable for knowledge to be transferable across different modalities, which can enhance the robustness of knowledge editing and potentially allow for cost-effective editing of multimodal knowledge using textual information. To address this, we introduce the concept of Transitivity of Multimodal Knowledge Editing (TMKE) and design corresponding evaluation criteria. |
Lingyong Fang; Xinzhong Wang; Depeng Wang; Zongru Wu; Ya Guo; Huijia Zhu; Zhuosheng Zhang; Gongshen Liu; |
| 266 | USP: Unified Self-Supervised Pretraining for Image Generation and Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. |
Xiangxiang Chu; Renda Li; Yong Wang; |
| 267 | T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. |
Chieh-Yun Chen; Min Shi; Gong Zhang; Humphrey Shi; |
| 268 | Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our objective is the automatic generation of Audio Descriptions (ADs) for edited video material, such as movies and TV series. To achieve this, we propose a two-stage framework that leverages "shots" as the fundamental units of video understanding. |
Junyu Xie; Tengda Han; Max Bain; Arsha Nagrani; Eshika Khandelwal; Gül Varol; Weidi Xie; Andrew Zisserman; |
| 269 | Moto: Latent Motion Token As The Bridging Language for Learning Robot Manipulation from Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks.Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. |
Yi Chen; Yuying Ge; Weiliang Tang; Yizhuo Li; Yixiao Ge; Mingyu Ding; Ying Shan; Xihui Liu; |
| 270 | Dual-Process Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a dual-process distillation scheme that allows feed-forward image generators to learn new tasks from deliberative VLMs. |
Grace Luo; Jonathan Granskog; Aleksander Holynski; Trevor Darrell; |
| 271 | ZeroStereo: Zero-shot Stereo Matching from Single Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose ZeroStereo, a novel stereo image generation pipeline for zero-shot stereo matching. |
Xianqi Wang; Hao Yang; Gangwei Xu; Junda Cheng; Min Lin; Yong Deng; Jinliang Zang; Yurui Chen; Xin Yang; |
| 272 | Rethinking The Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. |
Liuyi Wang; Xinyuan Xia; Hui Zhao; Hanqing Wang; Tai Wang; Yilun Chen; Chengju Liu; Qijun Chen; Jiangmiao Pang; |
| 273 | UPRE: Zero-Shot Domain Adaptation for Object Detection Via Unified Prompt and Representation Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on manually crafted prompts. To overcome these limitations, we propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimizes both textual prompts and visual representations. |
Xiao Zhang; Fei Wei; Yong Wang; Wenda Zhao; Feiyi Li; Xiangxiang Chu; |
| 274 | ResGS: Residual Densification of 3D Gaussian for Efficient Detail Recovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our analysis reveals that the 3D-GS densification operation lacks adaptiveness and faces a dilemma between geometry coverage and detail recovery. To address this, we introduce a novel densification operation, residual split, which adds a downscaled Gaussian as a residual. |
Yanzhe Lyu; Kai Cheng; Xin Kang; Xuejin Chen; |
| 275 | GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To counteract thought collapse, we highlight the necessity of process guidance and propose an automated corrector that evaluates and refines the agent’s reasoning at each RL step. |
Tong Wei; Yijun Yang; Junliang Xing; Yuanchun Shi; Zongqing Lu; Deheng Ye; |
| 276 | MaterialMVP: Illumination-Invariant Material Generation Via Multi-view PBR Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present MaterialMVP, a novel end-to-end model for generating PBR textures from 3D meshes and image prompts, addressing key challenges in multi-view material synthesis. |
Zebin He; Mingxin Yang; Shuhui Yang; Yixuan Tang; Tao Wang; Kaihao Zhang; Guanying Chen; Yuhong Liu; Jie Jiang; Chunchao Guo; Wenhan Luo; |
| 277 | SuperDec: 3D Scene Decomposition with Superquadrics Primitives Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present SuperDec, an approach for compact 3D scene representations based on geometric primitives, namely superquadrics.While most recent works leverage geometric primitives to obtain photorealistic 3D scene representations, we propose to leverage them to obtain a compact yet expressive representation. |
Elisabetta Fedele; Boyang Sun; Leonidas Guibas; Marc Pollefeys; Francis Engelmann; |
| 278 | NormalLoc: Visual Localization on Textureless 3D Models Using Surface Normals Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose NormalLoc, a novel visual localization method for estimating the 6-DoF pose of a camera using textureless 3D models. |
Jiro Abe; Gaku Nakano; Kazumine Ogura; |
| 279 | AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. |
Zijie Wu; Chaohui Yu; Fan Wang; Xiang Bai; |
| 280 | Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. |
Yuqing Wang; Zhijie Lin; Yao Teng; Yuanzhi Zhu; Shuhuai Ren; Jiashi Feng; Xihui Liu; |
| 281 | Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast to most existing data-driven approaches that directly predict future trajectories, we rethink this task from a planning perspective, advocating a "First Reasoning, Then Forecasting" strategy that explicitly incorporates behavior intentions as spatial guidance for trajectory prediction. To achieve this, we introduce an interpretable, reward-driven intention reasoner grounded in a novel query-centric Inverse Reinforcement Learning (IRL) scheme. |
Muleilan Pei; Shaoshuai Shi; Xuesong Chen; Xu Liu; Shaojie Shen; |
| 282 | PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce PathFinder, a multi-modal, multi-agent framework that emulates the decision-making process of expert pathologists. |
Fatemeh Ghezloo; Mehmet Saygin Seyfioglu; Rustin Soraki; Wisdom O. Ikezogwo; Beibin Li; Tejoram Vivekanandan; Joann G. Elmore; Ranjay Krishna; Linda Shapiro; |
| 283 | Dynamic Multi-Layer Null Space Projection for Vision-Language Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose a Dynamic Multi-layer Null Space Projection (DMNSP) strategy and apply it only to the visual modality branch, while optimizing the language branch according to the original optimizer. |
Borui Kang; Lei Wang; Zhiping Wu; Tao Feng; Yawen Li; Yang Gao; Wenbin Li; |
| 284 | DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Even state-of-the-art models like FLUX and 3DIS face challenges, such as attribute leakage between instances, which limits user control. To address these issues, we introduce DreamRenderer, a training-free approach built upon the FLUX model. |
Dewei Zhou; Mingwei Li; Zongxin Yang; Yi Yang; |
| 285 | RapVerse: Coherent Vocals and Whole-Body Motion Generation from Text Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. |
Jiaben Chen; Xin Yan; Yihang Chen; Siyuan Cen; Zixin Wang; Qinwei Ma; Haoyu Zhen; Kaizhi Qian; Lie Lu; Chuang Gan; |
| 286 | Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. |
Tianqi Liu; Zihao Huang; Zhaoxi Chen; Guangcong Wang; Shoukang Hu; Liao Shen; Huiqiang Sun; Zhiguo Cao; Wei Li; Ziwei Liu; |
| 287 | AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose AV-Link, a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. |
Moayed Haji-Ali; Willi Menapace; Aliaksandr Siarohin; Ivan Skorokhodov; Alper Canberk; Kwot Sin Lee; Vicente Ordonez; Sergey Tulyakov; |
| 288 | KV-Edit: Training-Free Image Editing for Precise Background Preservation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Here, we propose KV-Edit, a training-free approach that uses KV cache in DiTs to maintain background consistency, where background tokens are preserved rather than regenerated, eliminating the need for complex mechanisms or expensive training, ultimately generating new content that seamlessly integrates with the background within user-provided regions. |
Tianrui Zhu; Shiyi Zhang; Jiawei Shao; Yansong Tang; |
| 289 | Multimodal LLM Guided Exploration and Active Mapping Using Fisher Information Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: By leveraging high-quality view synthesis from our 3DGS representation, our method employs a multimodal LLM as a zero-shot planner for long-horizon exploration goals from the semantic perspective. |
Wen Jiang; Boshu Lei; Katrina Ashton; Kostas Daniilidis; |
| 290 | Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we share an interesting finding that training an MLLM with chain-of-thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. |
Jiaer Xia; Bingkui Tong; Yuhang Zang; Rui Shao; Kaiyang Zhou; |
| 291 | Frequency-Dynamic Attention Modulation For Dense Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. |
Linwei Chen; Lin Gu; Ying Fu; |
| 292 | DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose our DiT4SR, one of the pioneering works to tame the large-scale DiT model for Real-ISR. |
Zheng-Peng Duan; Jiawei Zhang; Xin Jin; Ziheng Zhang; Zheng Xiong; Dongqing Zou; Jimmy S. Ren; Chunle Guo; Chongyi Li; |
| 293 | SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to 1024^3 directly from rendering losses. |
Xianglong He; Zi-Xin Zou; Chia-Hao Chen; Yuan-Chen Guo; Ding Liang; Chun Yuan; Wanli Ouyang; Yan-Pei Cao; Yangguang Li; |
| 294 | V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we focus on the spatio-temporal fusion in V2X scenarios and design one-step and multi-step communication strategies (when to transmit) as well as examine their integration with three fusion strategies – early, late, and intermediate (what to transmit), providing comprehensive benchmarks with 11 fusion models (how to fuse). |
Zewei Zhou; Hao Xiang; Zhaoliang Zheng; Seth Z. Zhao; Mingyue Lei; Yun Zhang; Tianhui Cai; Xinyi Liu; Johnson Liu; Maheswari Bajji; Xin Xia; Zhiyu Huang; Bolei Zhou; Jiaqi Ma; |
| 295 | TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce TurboTrain, a novel and efficient training framework for multi-agent perception and prediction. |
Zewei Zhou; Seth Z. Zhao; Tianhui Cai; Zhiyu Huang; Bolei Zhou; Jiaqi Ma; |
| 296 | AllGCD: Leveraging All Unlabeled Data for Generalized Category Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current methods employ supervised contrastive learning on labeled data to capture known category structures but neglect unlabeled data, limiting their effectiveness in classifying novel classes, especially in fine-grained open-set detection where subtle class differences are crucial. To address this issue, we propose a novel learning approach, AllGCD, which seamlessly integrates all unlabeled data into contrastive learning to enhance the discrimination of novel classes. |
Xinzi Cao; Ke Chen; Feidiao Yang; Xiawu Zheng; Yonghong Tian; Yutong Lu; |
| 297 | Exploring The Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In particular, we introduce two untargeted attack objectives that leverage spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory. |
Taowen Wang; Cheng Han; James Liang; Wenhao Yang; Dongfang Liu; Luna Xinyu Zhang; Qifan Wang; Jiebo Luo; Ruixiang Tang; |
| 298 | Social Debiasing for Fair Multi-modal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper addresses the issue of social biases in MLLMs by i) introducing a comprehensive counterfactual dataset with multiple social concepts (CMSC), which complements existing datasets by providing 18 diverse and balanced social concepts; and ii) proposing a counter-stereotype debiasing (CSD) strategy that mitigates social biases in MLLMs by leveraging the opposites of prevalent stereotypes. |
Harry Cheng; Yangyang Guo; Qingpei Guo; Ming Yang; Tian Gan; Weili Guan; Liqiang Nie; |
| 299 | A Unified Framework for Motion Reasoning and Generation in Human Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, understanding and generating interactive human-like motion, especially involving coordinated interactive motion, remains a challenging problem due to its inherent complexity. To address this, we present MoLaM, the Interactive Motion-LAnguage Model, a unified architecture that jointly processes language and motion modalities for understanding, generating, and controlling interactive motions in multi-turn conversational settings. |
Jeongeun Park; Sungjoon Choi; Sangdoo Yun; |
| 300 | 3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Physical adversarial attack methods expose the vulnerabilities of deep neural networks and pose a significant threat to safety-critical scenarios such as autonomous driving. |
Tianrui Lou; Xiaojun Jia; Siyuan Liang; Jiawei Liang; Ming Zhang; Yanjun Xiao; Xiaochun Cao; |
| 301 | Seeing The Unseen: A Semantic Alignment and Context-Aware Prompt Framework for Open-Vocabulary Camouflaged Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite existing open-vocabulary methods exhibit strong segmentation capabilities, they still have a major limitation in camouflaged scenarios: semantic confusion, which leads to incomplete segmentation and class shift in the model. To mitigate the above limitation, we propose a framework for OVCOS, named SuCLIP. |
Peng Ren; Tian Bai; Jing Sun; Fuming Sun; |
| 302 | Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, treating voxels as the basic interaction units inherently limits the utilization of class-level information, which is proven critical for enhancing the granularity of completion results. To address this, we propose Disentangling Instance and Scene Contexts (DISC), a novel dual-stream paradigm that enhances learning for both instance and scene categories through separated optimization. |
Enyu Liu; En Yu; Sijia Chen; Wenbing Tao; |
| 303 | IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The Layout-to-Image (L2I) task was introduced to address the positioning challenges by incorporating bounding boxes as spatial control signals, but it still falls short in generating precise instance features. To address this Instance Feature Generation (IFG) task, we introduce the Instance Feature Adapter (IFAdapter). |
Yinwei Wu; Xianpan Zhou; Bing Ma; Xuefeng Su; Kai Ma; Xinchao Wang; |
| 304 | RCTDistill: Cross-Modal Knowledge Distillation Framework for Radar-Camera 3D Object Detection with Temporal Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose RCTDistill, a novel cross-modal KD method based on temporal fusion, comprising three key modules: Range-Azimuth Knowledge Distillation (RAKD), Temporal Knowledge Distillation (TKD), and Region-Decoupled Knowledge Distillation (RDKD). |
Geonho Bang; Minjae Seong; Jisong Kim; Geunju Baek; Daye Oh; Junhyung Kim; Junho Koh; Jun Won Choi; |
| 305 | SA-LUT: Spatial Adaptive 4D Look-Up Table for Photorealistic Style Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Photorealistic style transfer (PST) enables real-world color grading by adapting reference image colors while preserving content structure.Existing methods mainly follow either approaches: generation-based methods that prioritize stylistic fidelity at the cost of content integrity and efficiency, or global color transformation methods such as LUT, which preserve structure but lack local adaptability. To bridge this gap, we propose Spatial Adaptive 4D Look-Up Table (SA-LUT), combining LUT efficiency with neural network adaptability. |
Zerui Gong; Zhonghua Wu; Qingyi Tao; Qinyue Li; Chen Change Loy; |
| 306 | AllTracker: Efficient Dense Point Tracking at High Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce AllTracker: a model that estimates long-range point tracks by way of estimating the flow field between a query frame and every other frame of a video. |
Adam W. Harley; Yang You; Xinglong Sun; Yang Zheng; Nikhil Raghuraman; Yunqi Gu; Sheldon Liang; Wen-Hsuan Chu; Achal Dave; Suya You; Rares Ambrus; Katerina Fragkiadaki; Leonidas Guibas; |
| 307 | Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Part-Aware Point Grounded Description (PaPGD), a challenging task aimed at advancing 3D multimodal learning for fine-grained, part-aware segmentation grounding and detailed explanation of 3D objects. |
Mahmoud Ahmed; Junjie Fei; Jian Ding; Eslam Mohamed Bakr; Mohamed Elhoseiny; |
| 308 | Continuous-Time Human Motion Field from Event Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we predict a continuous-time human motion field from events caused by human motion. |
Ziyun Wang; Ruijun Zhang; Zi-Yan Liu; Yufu Wang; Kostas Daniilidis; |
| 309 | WorldScore: A Unified Evaluation Benchmark for World Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the WorldScore benchmark, the first unified benchmark for world generation. |
Haoyi Duan; Hong-Xing Yu; Sirui Chen; Li Fei-Fei; Jiajun Wu; |
| 310 | MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths.However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality.Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios.Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation.To address these challenges, we introduce MagicMotion, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. |
Quanhao Li; Zhen Xing; Rui Wang; Hui Zhang; Qi Dai; Zuxuan Wu; |
| 311 | Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we present a new benchmark named OpenBench that differs significantly from the training semantics. |
Yong Liu; Song-Li Wu; Sule Bai; Jiahao Wang; Yitong Wang; Yansong Tang; |
| 312 | LD-RPS: Zero-Shot Unified Image Restoration Via Latent Diffusion Recurrent Posterior Sampling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. |
Huaqiu Li; Yong Wang; Tongwen Huang; Hailang Huang; Haoqian Wang; Xiangxiang Chu; |
| 313 | Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this task poses significant challenges, including the accurate modeling of complex style patterns–encompassing both intra- and inter-word relationships–and maintaining content accuracy across numerous characters. To address these challenges, we propose DiffBrush, a novel diffusion-based model for handwritten text-line generation. |
Gang Dai; Yifan Zhang; Yutao Qin; Qiangya Guo; Shuangping Huang; Shuicheng Yan; |
| 314 | MultiModal Action Conditioned Video Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. |
Yichen Li; Antonio Torralba; |
| 315 | Stable Diffusion Models Are Secretly Good at Visual In-Context Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we show that off-the-shelf Stable Diffusion models can be repurposed for visual in-context learning (V-ICL). |
Trevine Oorloff; Vishwanath Sindagi; Wele Gedara Chaminda Bandara; Ali Shafahi; Amin Ghiasi; Charan Prakash; Reza Ardekani; |
| 316 | MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although DiT with 3D VAE has become a standard framework for video generation, it introduces challenges in controllable driving video generation, especially for frame-wise geometric control, rendering existing methods ineffective. To address these issues, we propose MagicDrive-V2, a novel approach that integrates the MVDiT block and spatial-temporal conditional encoding to enable multi-view video generation and precise geometric control. |
Ruiyuan Gao; Kai Chen; Bo Xiao; Lanqing Hong; Zhenguo Li; Qiang Xu; |
| 317 | Growing A Twig to Accelerate Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the limitations above, we present TwigVLM—a simple and general architecture by "growing" a lightweight twig upon an early layer of the base VLM. |
Zhenwei Shao; Mingyang Wang; Zhou Yu; Wenwen Pan; Yan Yang; Tao Wei; Hongyuan Zhang; Ning Mao; Wei Chen; Jun Yu; |
| 318 | Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we show that we only need a single parameter \omega to effectively control granularity in diffusion-based synthesis. |
Xinyu Hou; Zongsheng Yue; Xiaoming Li; Chen Change Loy; |
| 319 | Chimera: Improving Generalist Model with Domain-Specific Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Directly integrating expert models tailored for those tasks is also challenging due to representational gaps and imbalanced optimization. To address these challenges, we introduce Chimera, a scalable and low-cost multi-modal pipeline designed to boost the ability of existing LMMs with domain-specific experts. |
Tianshuo Peng; Mingsheng Li; Jiakang Yuan; Hongbin Zhou; Renqiu Xia; Renrui Zhang; Lei Bai; Song Mao; Bin Wang; Aojun Zhou; Botian Shi; Tao Chen; Bo Zhang; Xiangyu Yue; |
| 320 | Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation for Semi-Supervised Lifelong Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we pioneer the investigation of Semi-LReID, introducing a novel Self-Reinforcing PRototype Evolution with Dual-Knowledge Cooperation framework (SPRED). |
Kunlun Xu; Fan Zhuo; Jiangmeng Li; Xu Zou; Jiahuan Zhou; |
| 321 | Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. |
Luca Barsellotti; Lorenzo Bianchi; Nicola Messina; Fabio Carrara; Marcella Cornia; Lorenzo Baraldi; Fabrizio Falchi; Rita Cucchiara; |
| 322 | CAPTURE: Evaluating Spatial Reasoning in Vision Language Models Via Occluded Object Counting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To test models’ ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). |
Atin Pothiraj; Elias Stengel-Eskin; Jaemin Cho; Mohit Bansal; |
| 323 | Heavy Labels Out! Dataset Distillation with Label Space Lightening Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, the required storage can be comparable even to original datasets, especially for large-scale ones. To solve this problem, instead of storing these heavy labels, we propose a novel label-lightening framework termed HeLlO aiming at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. |
Ruonan Yu; Songhua Liu; Zigeng Chen; Jingwen Ye; Xinchao Wang; |
| 324 | GUAVA: Generalizable Upper Body 3D Gaussian Avatar Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these challenges, we first introduce an expressive human model (EHM) to enhance facial expression capabilities and develop an accurate tracking method. Based on this template model, we propose GUAVA, the first framework for fast animatable upper-body 3D Gaussian avatar reconstruction. |
Dongbin Zhang; Yunfei Liu; Lijian Lin; Ye Zhu; Yang Li; Minghan Qin; Yu Li; Haoqian Wang; |
| 325 | "Principal Components" Enable A New Language of Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. |
Xin Wen; Bingchen Zhao; Ismail Elezi; Jiankang Deng; Xiaojuan Qi; |
| 326 | Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce Implicit Structure Locking (*ISLock*), the first training-free editing strategy for AR visual models. |
Taihang Hu; Linxuan Li; Kai Wang; Yaxing Wang; Jian Yang; Ming-Ming Cheng; |
| 327 | PrimHOI: Compositional Human-Object Interaction Via Reusable Primitives Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here we show that PrimHOI generates complex HOI motions through spatial and temporal composition of generalizable interaction primitives defined by relative geometry. |
Kai Jia; Tengyu Liu; Mingtao Pei; Yixin Zhu; Siyuan Huang; |
| 328 | Doppler-Aware LiDAR-RADAR Fusion for Weather-Robust 3D Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods often handle Doppler in ways that are not well-suited for multi-modal settings or lack tailored encoding strategies, hindering effective feature fusion and performance. To address these shortcomings, we propose a novel Doppler-aware LiDAR-4D RADAR fusion (DLR-Fusion) framework for robust 3D object detection. |
Yujeong Chae; Heejun Park; Hyeonseong Kim; Kuk-Jin Yoon; |
| 329 | I2VControl: Disentangled and Unified Video Motion Synthesis Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a disentangled and unified framework, namely I2VControl, to overcome the logical conflicts. |
Wanquan Feng; Tianhao Qi; Jiawei Liu; Mingzhen Sun; Pengqi Tu; Tianxiang Ma; Fei Dai; Songtao Zhao; Siyu Zhou; Qian He; |
| 330 | End-to-End Driving with Online Trajectory Evaluation Via BEV World Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This goal can be achieved by employing a world model to capture environmental dynamics and predict future states. Therefore, we propose an end-to-end driving framework **WoTE**, which leverages a BEV **Wo**rld model to predict future BEV states for **T**rajectory **E**valuation. |
Yingyan Li; Yuqi Wang; Yang Liu; Jiawei He; Lue Fan; Zhaoxiang Zhang; |
| 331 | ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction Via Score-Guided Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce ScoreHOI, an effective diffusion-based optimizer that introduces diffusion priors for the precise recovery of human-object interactions. |
Ao Li; Jinpeng Liu; Yixuan Zhu; Yansong Tang; |
| 332 | AV-Flow: Transforming Text to Audio-Visual Human-like Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. |
Aggelina Chatziagapi; Louis-Philippe Morency; Hongyu Gong; Michael Zollhöfer; Dimitris Samaras; Alexander Richard; |
| 333 | Boosting MLLM Reasoning with Text-Debiased Hint-GRPO Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we reveal two problems that impede the performance of GRPO on the MLLM: Low data utilization and Text-bias. |
Qihan Huang; Weilong Dai; Jinlong Liu; Wanggui He; Hao Jiang; Mingli Song; Jingyuan Chen; Chang Yao; Jie Song; |
| 334 | From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, we extend the model to generate smooth and consistent attribute transitions by introducing frame-wise guidance for the video latent during the denoising process. |
Ling Lo; Kelvin C.K. Chan; Wen-Huang Cheng; Ming-Hsuan Yang; |
| 335 | Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a collaborative framework, DataTailor, which leverages three key principles–informativeness, uniqueness, and representativeness–for effective data selection. |
Qifan Yu; Zhebei Shen; Zhongqi Yue; Yang Wu; Bosheng Qin; Wenqiao Zhang; Yunfei Li; Juncheng Li; Siliang Tang; Yueting Zhuang; |
| 336 | VLR-Driver: Large Vision-Language-Reasoning Models for Embodied Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose VLR-Driver, a novel multi-modal Vision-Language-Reasoning (VLR) framework based on Chain of Thought (CoT) for embodied autonomous driving. |
Fanjie Kong; Yitong Li; Weihuang Chen; Chen Min; Yizhe Li; Zhiqiang Gao; Haoyang Li; Zhongyu Guo; Hongbin Sun; |
| 337 | Diffusion Epistemic Uncertainty with Asymmetric Learning for Diffusion-Generated Image Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel framework, Diffusion Epistemic Uncertainty with Asymmetric Learning (DEUA), for detecting diffusion-generated images. |
Yingsong Huang; Hui Guo; Jing Huang; Bing Bai; Qi Xiong; |
| 338 | FairGen: Enhancing Fairness in Text-to-Image Diffusion Models Via Self-Discovering Latent Directions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods on debiasing DMs usually requires model re-training with a human-crafted reference dataset or additional classifiers, which suffer from two major limitations: (1) collecting reference datasets causes expensive annotation cost; (2) the debiasing performance is heavily constrained by the quality of the reference dataset or the additional classifier. To address the above limitations, we propose FairGen, a plug-and-play method that learns attribute latent directions in a self-discovering manner, thus eliminating the reliance on such reference dataset. |
Yilei Jiang; Wei-Hong Li; Yiyuan Zhang; Minghong Cai; Xiangyu Yue; |
| 339 | After The Party: Navigating The Mapping From Color to Ambient Lighting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods often oversimplify this challenge by assuming a single light source or uniform, white-balanced lighting, leaving many of these complexities unaddressed. In this paper, we introduce CL3AN, the first large-scale, high-resolution dataset of its kind designed to facilitate the restoration of images captured under multiple Colored Light sources to their Ambient-Normalized counterparts. |
Florin-Alexandru Vasluianu; Tim Seizinger; Zongwei Wu; Radu Timofte; |
| 340 | P-AVAS: Can Physics-Integrated Audio-Visual Modeling Boost Neural Acoustic Synthesis? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Physics-Integrated Audio-Visual Acoustic Synthesis (PI-AVAS or \pi-AVAS), a novel framework designed with two key objectives. |
Susan Liang; Chao Huang; Yunlong Tang; Zeliang Zhang; Chenliang Xu; |
| 341 | WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce WikiAutoGen, a novel system for automated multimodal Wikipedia-style article generation. |
Zhongyu Yang; Jun Chen; Dannong Xu; Junjie Fei; Xiaoqian Shen; Liangbing Zhao; Chun-Mei Feng; Mohamed Elhoseiny; |
| 342 | DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While auto-regressive methods generate structured meshes by predicting discrete vertex tokens, they are often constrained by limited face counts and mesh incompleteness. To address these challenges, we propose DeepMesh, a framework that optimizes mesh generation through two key innovations: (1) an efficient pre-training strategy incorporating a novel tokenization algorithm, along with improvements in data curation and processing, and (2) the introduction of Reinforcement Learning (RL) into 3D mesh generation to achieve human preference alignment via Direct Preference Optimization (DPO). |
Ruowen Zhao; Junliang Ye; Zhengyi Wang; Guangce Liu; Yiwen Chen; Yikai Wang; Jun Zhu; |
| 343 | CharaConsist: Fine-Grained Consistent Character Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Furthermore, when the foreground character undergoes large motion variations, inconsistencies in identity and clothing details become evident. To address these problems, we propose CharaConsist, which employs point-tracking attention and adaptive token merge along with decoupled control of the foreground and background.CharaConsist enables fine-grained consistency for both foreground and background, supporting the generation of one character in continuous shots within a fixed scene or in discrete shots across different scenes.Moreover, CharaConsist is the first consistent generation method tailored for text-to-image DiT model. |
Mengyu Wang; Henghui Ding; Jianing Peng; Yao Zhao; Yunpeng Chen; Yunchao Wei; |
| 344 | Edicho: Consistent Image Editing in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Edicho steps in with a training-free solution based on diffusion models, featuring a fundamental design principle of using explicit image correspondence to direct editing. |
Qingyan Bai; Hao Ouyang; Yinghao Xu; Qiuyu Wang; Ceyuan Yang; Ka Leong Cheng; Yujun Shen; Qifeng Chen; |
| 345 | Find Any Part in 3D Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: By training on our annotated data with a simple contrastive objective, we obtain an open-world model that generalizes to any part in any object based on any text query. |
Ziqi Ma; Yisong Yue; Georgia Gkioxari; |
| 346 | InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we define the union of referring segmentation and reasoning segmentation at both the image and video levels as Instructed Visual Segmentation (IVS). |
Cong Wei; Yujie Zhong; Haoxian Tan; Yingsen Zeng; Yong Liu; Hongfa Wang; Yujiu Yang; |
| 347 | 4D Visual Pre-training for Robot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. |
Chengkai Hou; Yanjie Ze; Yankai Fu; Zeyu Gao; Songbo Hu; Yue Yu; Shanghang Zhang; Huazhe Xu; |
| 348 | TokensGen: Harnessing Condensed Tokens for Long Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In this paper, we propose TokensGen, a novel two-stage framework that leverages condensed tokens to address these issues. |
Wenqi Ouyang; Zeqi Xiao; Danni Yang; Yifan Zhou; Shuai Yang; Lei Yang; Jianlou Si; Xingang Pan; |
| 349 | Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our approach introduces a novel tokenization algorithm that preserves face proximity relationships and compresses sequence length through locally shared vertices and edges, enabling the generation of meshes with an unprecedented scale of up to 5,000 faces. |
Yuxuan Wang; Xuanyu Yi; Haohan Weng; Qingshan Xu; Xiaokang Wei; Xianghui Yang; Chunchao Guo; Long Chen; Hanwang Zhang; |
| 350 | MOBIUS: Big-to-Mobile Universal Instance Segmentation Via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. |
Mattia Segu; Marta Tintore Gazulla; Yongqin Xian; Luc Van Gool; Federico Tombari; |
| 351 | Trans-Adapter: A Plug-and-Play Framework for Transparent Image Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This pipeline, however, struggles to preserve transparency consistency in edited regions, and matting can introduce jagged edges along transparency boundaries. To address these challenges, we propose Trans-Adapter, a plug-and-play adapter that enables diffusion-based inpainting models to process transparent images directly. |
Yuekun Dai; Haitian Li; Shangchen Zhou; Chen Change Loy; |
| 352 | HADES: Human Avatar with Dynamic Explicit Hair Strands Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce HADES, the first framework to seamlessly integrate dynamic hair into human avatars. |
Zhanfeng Liao; Hanzhang Tu; Cheng Peng; Hongwen Zhang; Boyao Zhou; Yebin Liu; |
| 353 | Contrastive Flow Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. |
George Stoica; Vivek Ramanujan; Xiang Fan; Ali Farhadi; Ranjay Krishna; Judy Hoffman; |
| 354 | Phantom: Subject-Consistent Video Generation Via Cross-Modal Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we propose Phantom, a unified video generation framework for both single- and multi-subject references.Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. |
Lijie Liu; Tianxiang Ma; Bingchuan Li; Zhuowei Chen; Jiawei Liu; Gen Li; Siyu Zhou; Qian He; Xinglong Wu; |
| 355 | World4Drive: End-to-End Autonomous Driving Via Intention-aware Physical Latent World Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. |
Yupeng Zheng; Pengxuan Yang; Zebin Xing; Qichao Zhang; Yuhang Zheng; Yinfeng Gao; Pengfei Li; Teng Zhang; Zhongpu Xia; Peng Jia; XianPeng Lang; Dongbin Zhao; |
| 356 | Towards Performance Consistency in Multi-Level Model Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To verify whether our findings are practical, we introduce a validation framework termed \underline Neu ral \underline Lig and (NeuLig). |
Qi Li; Runpeng Yu; Xinchao Wang; |
| 357 | Demeter: A Parametric Model of Crop Plant Morphology from The Real World Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present Demeter, a data-driven parametric model that encodes key factors of a plant morphology, including topology, shape, articulation, and deformation into a compact learned representation. |
Tianhang Cheng; Albert J. Zhai; Evan Z. Chen; Rui Zhou; Yawen Deng; Zitong Li; Kejie Zhao; Janice Shiu; Qianyu Zhao; Yide Xu; Xinlei Wang; Yuan Shen; Sheng Wang; Lisa Ainsworth; Kaiyu Guan; Shenlong Wang; |
| 358 | Multi-scenario Overlapping Text Segmentation with Depth Awareness Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While existing research has primarily addressed the overlapping problem in documents, its applicability to other scenes remains limited. To bridge this gap, we propose a new task of multi-scenario overlapping text segmentation and introduce a corresponding real dataset in both English and Chinese, spanning various contexts such as printed text, bills, artistic designs, and house numbers. |
Yang Liu; Xudong Xie; Yuliang Liu; Xiang Bai; |
| 359 | Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This is due to the increased complexity of decision boundaries as the number of categories grows, which complicates the optimization process. To mitigate this, we propose the Mixture of Feature Experts (MoFE) module, which partitions features into subspaces, effectively capturing complex data distributions and refining decision boundaries. |
Shizhen Zhao; Jiahui Liu; Xin Wen; Haoru Tan; Xiaojuan Qi; |
| 360 | DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. |
Jiazhe Guo; Yikang Ding; Xiwu Chen; Shuo Chen; Bohan Li; Yingshuang Zou; Xiaoyang Lyu; Feiyang Tan; Xiaojuan Qi; Zhiheng Li; Hao Zhao; |
| 361 | VMBench: A Benchmark for Perception-Aligned Video Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based these findings, we introduce VMBench–a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. |
Xinran Ling; Chen Zhu; Meiqi Wu; Hangyu Li; Xiaokun Feng; Cundian Yang; Aiming Hao; Jiashu Zhu; Jiahong Wu; Xiangxiang Chu; |
| 362 | ObjectGS: Object-aware Scene Reconstruction and Scene Understanding Via Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose ObjectGS, an object-aware framework that unifies 3D scene reconstruction with semantic understanding. |
Ruijie Zhu; Mulin Yu; Linning Xu; Lihan Jiang; Yixuan Li; Tianzhu Zhang; Jiangmiao Pang; Bo Dai; |
| 363 | Feature Extraction and Representation of Pre-training Point Cloud Based on Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The pretrain-finetune paradigm of pre-training a model on large amounts of image and text data and then fine-tuning the model for a specific task has led to significant progress in many 2D image and natural language processing tasks.Similarly, the use of pre-training methods in point cloud data can also enhance the working performance and generalization ability of the model.Therefore, in this paper, we propose a pre-training framework based on a diffusion model called PreDifPoint. |
Chang Qiu; Feipeng Da; Zilei Zhang; |
| 364 | TACO: Taming Diffusion for In-the-wild Video Amodal Completion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper tackles the task of Video Amodal Completion (VAC), which aims to generate the complete object consistently throughout the video given a visual prompt specifying the object of interest. Leveraging the rich, consistent manifolds learned by pre-trained video diffusion models, we propose a conditional diffusion model, TACO, that repurposes these manifolds for VAC. |
Ruijie Lu; Yixin Chen; Yu Liu; Jiaxiang Tang; Junfeng Ni; Diwen Wan; Gang Zeng; Siyuan Huang; |
| 365 | Holistic Tokenizer for Autoregressive Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, because most visual tokenizers map local image patches into latent tokens, global information is limited. To address this, we introduce Hita, a novel image tokenizer for autoregressive (AR) image generation. |
Anlin Zheng; Haochen Wang; Yucheng Zhao; Weipeng Deng; Tiancai Wang; Xiangyu Zhang; Xiaojuan Qi; |
| 366 | Toward Fair and Accurate Cross-Domain Medical Image Segmentation: A VLM-Driven Active Domain Adaptation Paradigm Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Emerging Active Domain Adaptation (ADA) approaches offer more effective enhancements, but all ignore fairness issues. Therefore, in this work, we propose the first fairness-aware ADA paradigm that simultaneously achieves both enhanced fairness and superior overall performance. |
Hongqiu Wang; Wu Chen; Xiangde Luo; Zhaohu Xing; Lihao Liu; Jing Qin; Shaozhi Wu; Lei Zhu; |
| 367 | ORION: A Holistic End-to-End Autonomous Driving Framework By Vision-Language Instructed Action Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation.ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. |
Haoyu Fu; Diankun Zhang; Zongchuang Zhao; Jianfeng Cui; Dingkang Liang; Chong Zhang; Dingyuan Zhang; Hongwei Xie; Bing Wang; Xiang Bai; |
| 368 | HypDAE: Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose Hyperbolic Diffusion Autoencoders (HypDAE), a novel approach that operates in hyperbolic space to capture hierarchical relationships among images from seen categories. |
Lingxiao Li; Kaixuan Fan; Boqing Gong; Xiangyu Yue; |
| 369 | Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs By Learning Language-Agnostic Speech Representations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. |
Jeong Hun Yeo; Minsu Kim; Chae Won Kim; Stavros Petridis; Yong Man Ro; |
| 370 | GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, when computed independently at different layers, CAVs often exhibit inconsistencies, making cross-layer comparisons unreliable. To address this issue, we propose the Global Concept Activation Vector (GCAV), a novel framework that unifies CAVs into a single, semantically consistent representation. |
Zhenghao He; Sanchit Sinha; Guangzhi Xiong; Aidong Zhang; |
| 371 | LiT: Delving Into A Simple Linear Diffusion Transformer for Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate how to convert a pre-trained Diffusion Transformer (DiT) into a linear DiT, as its simplicity, parallelism, and efficiency for image generation. |
Jiahao Wang; Ning Kang; Lewei Yao; Mengzhao Chen; Chengyue Wu; Songyang Zhang; Shuchen Xue; Yong Liu; Taiqiang Wu; Xihui Liu; Kaipeng Zhang; Shifeng Zhang; Wenqi Shao; Zhenguo Li; Ping Luo; |
| 372 | CA2C: A Prior-Knowledge-Free Approach for Robust Label Noise Learning Via Asymmetric Co-learning and Co-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a novel LNL approach, termed CA2C (Combined Asymmetric Co-learning and Co-training), which alleviates the reliance on prior knowledge through an integration of complementary learning paradigms. |
Mengmeng Sheng; Zeren Sun; Tianfei Zhou; Xiangbo Shu; Jinshan Pan; Yazhou Yao; |
| 373 | PixTalk: Controlling Photorealistic Image Processing and Editing with Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose the first approach that introduces language and explicit control into the image processing and editing pipeline. |
Marcos V. Conde; Zihao Lu; Radu Timofte; |
| 374 | Vivid4D: Improving 4D Reconstruction from Monocular Video By Video Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views — synthesizing multi-view videos from a monocular input. |
Jiaxin Huang; Sheng Miao; Bangbang Yang; Yuewen Ma; Yiyi Liao; |
| 375 | Unsupervised Visual Chain-of-Thought Reasoning Via Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. |
Kesen Zhao; Beier Zhu; Qianru Sun; Hanwang Zhang; |
| 376 | Mobile Video Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces the first mobile-optimized image-to-video diffusion model. |
Haitam Ben Yahia; Denis Korzhenkov; Ioannis Lelekas; Amir Ghodrati; Amirhossein Habibian; |
| 377 | Refer to Any Segmentation Mask Group With Vision-Language Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. |
Shengcao Cao; Zijun Wei; Jason Kuen; Kangning Liu; Lingzhi Zhang; Jiuxiang Gu; HyunJoon Jung; Liang-Yan Gui; Yu-Xiong Wang; |
| 378 | GeoSplatting: Towards Geometry Guided Gaussian Splatting for Physically-based Inverse Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they usually suffer from inaccuracies in normal estimation that subsequently degrade light transport, resulting in noisy material decomposition and flawed relighting results. To address this, we propose GeoSplatting, a novel approach that augments 3DGS with explicit geometry guidance for precise light transport modeling. |
Kai Ye; Chong Gao; Guanbin Li; Wenzheng Chen; Baoquan Chen; |
| 379 | NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we explore the task of generating expansive outdoor scenes, ranging from castles to high-rises. |
Han-Hung Lee; Qinghong Han; Angel X. Chang; |
| 380 | Aligning Global Semantics and Local Textures in Generative Video Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nevertheless, solely relying on the knowledge embedded in the pre-trained video diffusion models might limit the generalization ability of local details (e.g., texture). In this paper, we address this issue by exploring the visual cues from a high-quality (HQ) image reference to facilitate visual details generation in video enhancement. |
Zhikai Chen; Fuchen Long; Zhaofan Qiu; Ting Yao; Wengang Zhou; Jiebo Luo; Tao Mei; |
| 381 | Towards Higher Effective Rank in Parameter-Efficient Fine-tuning Using Khatri-Rao Product Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We further show that full-rank methods can reduce LoRA’s approximation error on these matrix types for an equal parameter count.Our evaluation then extends beyond synthetic tasks where we observe that LoRA’s restricted work subspace can produce high norm updates, leading to over-fitting and poor out-of-distribution generalization. We address these limits by introducing KRAdapter, a novel PEFT algorithms that uses properties of the Kathri-Rao matrix product to produce weight matrices of higher effective rank and lower norm than related PEFT algorithms.We show the performance improvements of KRAdapter on vision-language models up to 1B parameters and 8B %32Bfor LLMs where we report from 20 to 25 points of accuracy improvements over LoRA when reasoning on commonsense tasks unseen during training. |
Paul Albert; Frederic Z. Zhang; Hemanth Saratchandran; Anton van den Hengel; Ehsan Abbasnejad; |
| 382 | Reangle-A-Video: 4D Video Generation As Video-to-Video Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. |
Hyeonho Jeong; Suhyeon Lee; Jong Chul Ye; |
| 383 | PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Conventional deep learning methods assume perfect pixel-wise alignment and rely on per-pixel reconstruction losses, leading to spectral distortion, double edges, and blurring when misalignment is present. To address this, we propose PAN-Crafter, a modality-consistent alignment framework that explicitly mitigates the misalignment gap between PAN and MS modalities. |
Jeonghyeok Do; Sungpyo Kim; Geunhyuk Youk; Jaehyup Lee; Munchurl Kim; |
| 384 | Bridging The Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, TDSM aligns skeleton features with text prompts by incorporating text features into the reverse diffusion process, where skeleton features are denoised under text guidance, forming a unified skeleton-text latent space for robust matching. To enhance discriminative power, we introduce a triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing them apart for different action classes. |
Jeonghyeok Do; Munchurl Kim; |
| 385 | Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work argues that precisely identifying the accident participants and capturing their related behaviors are of critical importance. In this regard, we propose a novel diffusion model Causal-VidSyn for synthesizing egocentric traffic accident videos. |
Lei-Lei Li; Jianwu Fang; Junbin Xiao; Shanmin Pang; Hongkai Yu; Chen Lv; Jianru Xue; Tat-Seng Chua; |
| 386 | SEGS-SLAM: Structure-enhanced 3D Gaussian Splatting SLAM with Appearance Embedding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, they struggle with abrupt appearance variations, leading to inconsistent visual quality. To address these problems, we propose SEGS-SLAM, a structure-enhanced 3D Gaussian Splatting SLAM, which achieves high-quality photorealistic mapping. |
Tianci Wen; Zhiang Liu; Yongchun Fang; |
| 387 | CityNav: A Large-Scale Dataset for Real-World Aerial Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While recent cross-modal training approaches have significantly improved navigation performance in both indoor and outdoor scenarios, aerial navigation over real-world cities remains underexplored primarily due to limited datasets and the difficulty of integrating visual and geographic information. To fill this gap, we introduce CityNav, the first large-scale real-world dataset for aerial VLN. |
Jungdae Lee; Taiki Miyanishi; Shuhei Kurita; Koya Sakamoto; Daichi Azuma; Yutaka Matsuo; Nakamasa Inoue; |
| 388 | HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present a unified Driving World Model named HERMES. |
Xin Zhou; Dingkang Liang; Sifan Tu; Xiwu Chen; Yikang Ding; Dingyuan Zhang; Feiyang Tan; Hengshuang Zhao; Xiang Bai; |
| 389 | From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, to avoid the early low-performance model leading to the wrong selection of hard samples, we propose a model pre-start concept, which focuses on automatically selecting a portion of easy samples and helping the model have basic task-specific learning capabilities. |
Chuang Yu; Jinmiao Zhao; Yunpeng Liu; Sicheng Zhao; Yimian Dai; Xiangyu Yue; |
| 390 | UIP2P: Unsupervised Instruction-based Image Editing Via Edit Reversibility Constraint Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an unsupervised instruction-based image editing approach that removes the need for ground-truth edited images during training. |
Enis Simsar; Alessio Tonioni; Yongqin Xian; Thomas Hofmann; Federico Tombari; |
| 391 | Generative Active Learning for Long-tail Trajectory Prediction Via Controllable Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Generative Active Learning for Trajectory prediction (GALTraj), the first method to successfully deploy generative active learning into trajectory prediction. |
Daehee Park; Monu Surana; Pranav Desai; Ashish Mehta; Reuben MV John; Kuk-Jin Yoon; |
| 392 | IMG: Calibrating Diffusion Models Via Implicit Multimodal Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. |
Jiayi Guo; Chuanhao Yan; Xingqian Xu; Yulin Wang; Kai Wang; Gao Huang; Humphrey Shi; |
| 393 | CARP: Visuomotor Policy Learning Via Coarse-to-Fine Autoregressive Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce **C**oarse-to-Fine **A**uto**R**egressive **P**olicy (**CARP**), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-to-fine, next-scale approach. |
Zhefei Gong; Pengxiang Ding; Shangke Lyu; Siteng Huang; Mingyang Sun; Wei Zhao; Zhaoxin Fan; Donglin Wang; |
| 394 | Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Motivated by these lessons, we created AbdomenAtlas 2.0—a dataset of 10,134 CT scans with a total of 13,223 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, and uterus) and 6,511 control scans. |
Qi Chen; Xinze Zhou; Chen Liu; Hao Chen; Wenxuan Li; Zekun Jiang; Ziyan Huang; Yuxuan Zhao; Dexin Yu; Junjun He; Yefeng Zheng; Ling Shao; Alan Yuille; Zongwei Zhou; |
| 395 | Deep Adaptive Unfolded Network Via Spatial Morphology Stripping and Spectral Filtration for Pan-sharpening Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Besides, validating pan-sharpening performance in high-level semantic tasks is intractable for the absence of datasets. To tackle these issues, we propose a deep adaptive unfolded network via spatial morphology stripping and spectral filtration for pan-sharpening, which is conceptualized as a linear inverse problem regularized by spatial and spectral priors. |
Hebaixu Wang; Jiayi Ma; |
| 396 | 4D Gaussian Splatting SLAM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Instead of removing dynamic objects as distractors and reconstructing only static environments, this paper proposes an efficient architecture that incrementally tracks camera poses and establishes the 4D Gaussian radiance fields in unknown scenarios by using a sequence of RGB-D images. |
Yanyan Li; Youxu Fang; Zunjie Zhu; Kunyi Li; Yong Ding; Federico Tombari; |
| 397 | EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Most existing methods focus on offline perception from one or a few views and cannot be applied to embodied agents that demand to gradually perceive the scene through progressive embodied exploration. In this paper, we formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it. |
Yuqi Wu; Wenzhao Zheng; Sicheng Zuo; Yuanhui Huang; Jie Zhou; Jiwen Lu; |
| 398 | AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distils structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning. |
Sanjoy Chowdhury; Hanan Gani; Nishit Anand; Sayan Nag; Ruohan Gao; Mohamed Elhoseiny; Salman Khan; Dinesh Manocha; |
| 399 | AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial Attack, Compositional Reasoning, and Modality-specific Dependency. |
Sanjoy Chowdhury; Sayan Nag; Subhrajyoti Dasgupta; Yaoting Wang; Mohamed Elhoseiny; Ruohan Gao; Dinesh Manocha; |
| 400 | DLFR-Gen: Diffusion-based Video Generation with Dynamic Latent Frame Rate Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we exploit the inherent temporal non-uniformity of real-world videos, and observe that videos exhibit dynamic information density, with high-motion segments demanding greater detail preservation than static scenes. |
Zhihang Yuan; Rui Xie; Yuzhang Shang; Hanling Zhang; Siyuan Wang; Shengen Yan; Guohao Dai; Yu Wang; |
| 401 | GestureLSM: Latent Shortcut Based Co-Speech Gesture Generation with Spatial-Temporal Modeling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Additionally, their autoregressive/diffusion-based pipelines show slow generation speed due to dozens of inference steps. To address these two challenges, we propose GestureLSM, a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling. |
Pinxin Liu; Luchuan Song; Junhua Huang; Haiyang Liu; Chenliang Xu; |
| 402 | Visual-RFT: Visual Reinforcement Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce.Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is possibly one key direction in reproducing o1.While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. |
Ziyu Liu; Zeyi Sun; Yuhang Zang; Xiaoyi Dong; Yuhang Cao; Haodong Duan; Dahua Lin; Jiaqi Wang; |
| 403 | MINERVA: Evaluating Complex Video Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. |
Arsha Nagrani; Sachit Menon; Ahmet Iscen; Shyamal Buch; Ramin Mehran; Nilpa Jha; Anja Hauth; Yukun Zhu; Carl Vondrick ; Mikhail Sirotenko; Cordelia Schmid; Tobias Weyand; |
| 404 | LeGrad: An Explainability Method for Vision Transformers Via Feature Formation Sensitivity Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, because of their modeling of long-range dependencies through self-attention mechanisms, the explainability of these models remains a challenge. To address this, we propose LeGrad, an explainability method specifically designed for ViTs. |
Walid Bousselham; Angie Boggust; Sofian Chaybouti; Hendrik Strobelt; Hilde Kuehne; |
| 405 | ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on the emerging Ego-Exo object correspondence task, which aims to understand object relations across ego-exo perspectives through segmentation. |
Yuqian Fu; Runze Wang; Bin Ren; Guolei Sun; Biao Gong; Yanwei Fu; Danda Pani Paudel; Xuanjing Huang; Luc Van Gool; |
| 406 | PS-Mamba: Spatial-Temporal Graph Mamba for Pose Sequence Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose PS-Mamba, a novel framework that refines human pose sequences by integrating spatial-temporal graph learning with state space modeling. |
Haoye Dong; Gim Hee Lee; |
| 407 | FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Through an in-depth analysis of previous approaches, we identify two key insights: (1) IL arises from identity information embedded within motion features, and (2) this identity information can be leveraged to address RA. Building on these findings, this paper introduces FixTalk, a novel framework designed to simultaneously resolve both issues for high-quality talking head generation. |
Shuai Tan; Bill Gong; Bin Ji; Ye Pan; |
| 408 | VideoAds for Fast-Paced Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. |
Zheyuan Zhang; Wanying Dou; Linkai Peng; Hongyi Pan; Ulas Bagci; Boqing Gong; |
| 409 | PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these, we propose Primitive-Aware Semantic Grounding (PASG), a closed-loop framework that introduces: (1) Automatic primitive extraction through geometric feature aggregation, enabling cross-category detection of keypoints and axes; (2) VLM-driven semantic anchoring that dynamically couples geometric primitives with functional affordances and task-relevant description; (3) A spatial-semantic reasoning benchmark and a fine-tuned VLM (Qwen2.5VL-PA). |
Zhihao Zhu; Yifan Zheng; Siyu Pan; Yaohui Jin; Yao Mu; |
| 410 | IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose IGL-Nav, an Incremental 3D Gaussian Localization framework for efficient and 3D-aware image-goal navigation. |
Wenxuan Guo; Xiuwei Xu; Hang Yin; Ziwei Wang; Jianjiang Feng; Jie Zhou; Jiwen Lu; |
| 411 | MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, due to the physical difference of metalens, there is a large gap in data acquisition and algorithm research. In light of this, we aim to bridge this unexplored gap, advancing the novel metalens endoscopy. |
Wuyang Li; Wentao Pan; Xiaoyuan Liu; Zhendong Luo; Chenxin Li; Hengyu Liu; Din Ping Tsai; Mu Ku Chen; Yixuan Yuan; |
| 412 | LLaVA-KD: A Framework of Distilling Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, we introduce Multimodal Distillation (MDist) to transfer teacher model’s robust representations across both visual and linguistic modalities, and Relation Distillation (RDist) to transfer teacher model’s ability to capture visual token relationships.Additionally, we propose a three-stage training scheme to fully exploit the potential of the proposed distillation strategy: 1) Distilled Pre-Training to strengthen the alignment between visual-linguistic representations in s-MLLMs, 2) Supervised Fine-Tuning to equip the s-MLLMs with multimodal understanding capacity, and 3) Distilled Fine-Tuning to refine s-MLLM’s knowledge.Our approach significantly improves s-MLLMs performance without altering the model architecture. |
Yuxuan Cai; Jiangning Zhang; Haoyang He; Xinwei He; Ao Tong; Zhenye Gan; Chengjie Wang; Zhucun Xue; Yong Liu; Xiang Bai; |
| 413 | HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, progress in this area has been significantly limited due to the lack of publicly available, high-quality datasets, especially for full-body human captures. To address this critical gap, we introduce the HumanOLAT dataset, the first publicly accessible large-scale dataset providing multi-view One-Light-at-a-Time (OLAT) captures of full-body humans. |
Timo Teufel; Pulkit Gera; Xilong Zhou; Umar Iqbal; Pramod Rao; Jan Kautz; Vladislav Golyanik; Christian Theobalt; |
| 414 | Benchmarking Multimodal CoT Reward Model Stepwise By Visual Program Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, significant challenges exist when transitioning reward signal to the multimodal domain, including labor-intensive annotations, over-reliance on one-step rewards, and inadequate evaluation. To address these issues, we propose SVIP, a novel approach to train a step-level multi-dimensional Chain-of-Thought (CoT) reward model automatically. |
Minghe Gao; Xuqi Liu; Zhongqi Yue; Yang Wu; Shuang Chen; Juncheng Li; Siliang Tang; Fei Wu; Tat-Seng Chua; Yueting Zhuang; |
| 415 | JailbreakDiffBench: A Comprehensive Benchmark for Jailbreaking Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the lack of standardized evaluation makes it difficult to assess the robustness of diffusion model system. To address this, we introduce JailbreakDiffBench, a comprehensive benchmark for systematically evaluating the safety of diffusion models against various attacks and under different defenses. |
Xiaolong Jin; Zixuan Weng; Hanxi Guo; Chenlong Yin; Siyuan Cheng; Guangyu Shen; Xiangyu Zhang; |
| 416 | Uncertainty-Driven Expert Control: Enhancing The Reliability of Medical Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these training-dependent strategies are costly and still lack sufficient alignment with clinical expertise. To address these issues, we propose an expert-in-the-loop framework named Expert-Controlled Classifier-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training. |
Xiao Liang; Di Wang; Zhicheng Jiao; Ronghan Li; Pengfei Yang; Quan Wang; Tat-Seng Chua; |
| 417 | InvRGB+L: Inverse Rendering of Complex Scenes with Unified Color and LiDAR Reflectance Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present InvRGB+L, a novel inverse rendering model that reconstructs large, relightable, and dynamic scenes from a single RGB+LiDAR sequence. |
Xiaoxue Chen; Bhargav Chandaka; Chih-Hao Lin; Ya-Qin Zhang; David Forsyth; Hao Zhao; Shenlong Wang; |
| 418 | Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Amodal3R, a conditional image-to-3D model designed to reconstruct plausible 3D geometry and appearance from partial observations. |
Tianhao Wu; Chuanxia Zheng; Frank Guan; Andrea Vedaldi; Tat-Jen Cham; |
| 419 | CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While LLM-based approaches offer generalized reasoning capabilities, their challenges in spatial planning and unstable inference latency hinder their direct application in cooperative driving. To address these limitations, we propose CoLMDriver, the first full-pipeline LLM-based cooperative driving system, enabling effective language-based negotiation and real-time driving control. |
Changxing Liu; Genjia Liu; Zijun Wang; Jinchang Yang; Siheng Chen; |
| 420 | HORT: Monocular Hand-held Objects Reconstruction with Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While more recent methods directly reconstruct point clouds with diffusion models, the multi-step denoising makes high-resolution reconstruction inefficient. To address these limitations, we propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. |
Zerui Chen; Rolandos Alexandros Potamias; Shizhe Chen; Cordelia Schmid; |
| 421 | VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism. |
Sihan Yang; Runsen Xu; Chenhang Cui; Tai Wang; Dahua Lin; Jiangmiao Pang; |
| 422 | Long-Context State-Space Video World Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. |
Ryan Po; Yotam Nitzan; Richard Zhang; Berlin Chen; Tri Dao; Eli Shechtman; Gordon Wetzstein; Xun Huang; |
| 423 | Dynamic Multimodal Prototype Learning in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These ambiguities lead to textual prototypes that are insufficient to capture visual concepts, resulting in limited performance. To address this issue, we introduce **ProtoMM**, a training-free framework that constructs multimodal prototypes to adapt VLMs during the test time. |
Xingyu Zhu; Shuo Wang; Beier Zhu; Miaoge Li; Yunfan Li; Junfeng Fang; Zhicai Wang; Dongsheng Wang; Hanwang Zhang; |
| 424 | TrackVerse: A Large-Scale Object-Centric Video Dataset for Image-Level Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To explore unsupervised object representation learning grounded in object dynamics–beyond static appearance–we introduce TrackVerse, a large-scale video dataset of 31.9 million object tracks spanning over 1,000 categories, each capturing the motion, appearance, and evolving states of an object over time. |
Yibing Wei; Samuel Church; Victor Suciu; Jinhong Lin; Cheng-En Wu; Pedro Morgado; |
| 425 | DAViD: Modeling Dynamic Affordance of 3D Objects Using Pre-trained Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel framework for learning Dynamic Affordance across various target object categories. |
Hyeonwoo Kim; Sangwon Baik; Hanbyul Joo; |
| 426 | The Source Image Is The Best Attention for Infrared and Visible Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper reveals, unprecedentedly, the intrinsic "attention properties" of infrared images, which directly arise from their physical characteristics (i.e., heat distribution) and can be linked to attention mechanisms naturally, as observed in the gradient-weighted class activation mapping (Grad-CAM) visualization analysis of image classification models. To incorporate this property into IVF for better fusion, we propose the source infrared cross attention (I-SCA) and further extend it to the visible modality, subsequently introducing the source visible cross attention (V-SCA). |
Song Wang; Xie Han; Liqun Kuang; Boying Wang; Zhongyu Chen; Zherui Qiao; Fan Yang; Xiaoxia Liu; Bingyu Zhang; Zhixun Wang; |
| 427 | FastVAR: Linear Visual Autoregressive Modeling Via Cached Token Pruning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, existing VAR paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution. To address this challenge, we propose FastVAR, a post-training acceleration method for efficient resolution scaling with VARs. |
Hang Guo; Yawei Li; Taolin Zhang; Jiangshan Wang; Tao Dai; Shu-Tao Xia; Luca Benini; |
| 428 | DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose DocThinker, a rule-based Reinforcement Learning (RL) framework for dynamic inference-time reasoning. |
Wenwen Yu; Zhibo Yang; Yuliang Liu; Xiang Bai; |
| 429 | Trace3D: Consistent Segmentation Lifting Via Gaussian Instance Tracing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods often suffer from inconsistent 2D masks across viewpoints and produce noisy segmentation boundaries as they neglect these semantic cues to refine the learned Gaussians. To overcome this, we introduce Gaussian Instance Tracing (GIT), which augments the standard Gaussian representation with an instance weight matrix across input views. |
Hongyu Shen; Junfeng Ni; Yixin Chen; Weishuo Li; Mingtao Pei; Siyuan Huang; |
| 430 | Rethinking Layered Graphic Design Generation with A Top-Down Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite this, non-layered designs still inspire human designers, influencing their choices in layouts and text styles, ultimately guiding the creation of layered designs. Motivated by this observation, we propose Accordion, a graphic design generation framework taking the first attempt to convert AI-generated designs into editable layered designs, meanwhile refining nonsensical AI-generated text with meaningful alternatives guided by user prompts. |
Jingye Chen; Zhaowen Wang; Nanxuan Zhao; Li Zhang; Difan Liu; Jimei Yang; Qifeng Chen; |
| 431 | SITE: Towards Spatial Intelligence Thorough Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models’ spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). |
Wenqi Wang; Reuben Tan; Pengyue Zhu; Jianwei Yang; Zhengyuan Yang; Lijuan Wang; Andrey Kolobov; Jianfeng Gao; Boqing Gong; |
| 432 | Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. |
Zhi Hou; Tianyi Zhang; Yuwen Xiong; Haonan Duan; Hengjun Pu; Ronglei Tong; Chengyang Zhao; Xizhou Zhu; Yu Qiao; Jifeng Dai; Yuntao Chen; |
| 433 | FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models Via Visual Registers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most existing high-resolution MLLMs rely on a cropping-based approach to process images, which leads to fragmented visual encoding and a sharp increase in redundant tokens. To tackle these issues, we propose the FALCON model. |
Renshan Zhang; Rui Shao; Gongwei Chen; Miao Zhang; Kaiwen Zhou; Weili Guan; Liqiang Nie; |
| 434 | Unified Open-World Segmentation with Multi-Modal Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present COSINE, a unified open-world segmentation model that Consolidates Open-vocabulary Segmentation and IN-context sEgmentation with multi-modal prompts (e.g., text and image). |
Yang Liu; Yufei Yin; Chenchen Jing; Muzhi Zhu; Hao Chen; Yuling Xi; Bo Feng; Hao Wang; Shiyu Li; Chunhua Shen; |
| 435 | OuroMamba: A Data-Free Quantization Framework for Vision Mamba Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). |
Akshat Ramachandran; Mingyu Lee; Huan Xu; Souvik Kundu; Tushar Krishna; |
| 436 | SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. |
Jiahui Wang; Zuyan Liu; Yongming Rao; Jiwen Lu; |
| 437 | Augmented and Softened Matching for Unsupervised Visible-Infrared Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While existing UVI-ReID methods have made substantial efforts during the optimization phase to enhance the model’s robustness to color variations, they often overlook the impact of color variations on the acquisition of pseudo-labels. To address this, in this paper, we focus on improving the robustness of pseudo-labels to color variations through data augmentation and propose an augmented and softened matching (ASM) method. |
Zhiqi Pang; Chunyu Wang; Lingling Zhao; Junjie Wang; |
| 438 | FlexGen: Flexible Multi-View Generation from Text and Image Inputs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce FlexGen, a flexible framework designed to generate controllable and consistent multi-view images, conditioned on a single-view image, or a text prompt, or both. |
Xinli Xu; Wenhang Ge; Jiantao Lin; Jiawei Feng; Lie Xu; Hanfeng Zhao; Shunsi Zhang; Ying-Cong Chen; |
| 439 | GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this area remains under-explored due to the inherent ambiguities in physical property estimation. To address these challenges, we introduce GaussianProperty, a training-free framework that assigns physical properties of materials to 3D Gaussians. |
Xinli Xu; Wenhang Ge; Dicong Qiu; ZhiFei Chen; Dongyu Yan; Zhuoyun Liu; Haoyu Zhao; Hanfeng Zhao; Shunsi Zhang; Junwei Liang; Ying-Cong Chen; |
| 440 | Towards Comprehensive Lecture Slides Understanding: Large-scale Dataset and Effective Method Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Furthermore, complex and flexible text relations can hinder the understanding of the internal logic of slides. To address this challenge, we propose a novel method, named SlideParser, which includes an auxiliary branch to predict text relations within slides and enhance attention between related texts, thereby improving slides understanding. |
Enming Zhang; Yuzhe Li; Yuliang Liu; Yingying Zhu; Xiang Bai; |
| 441 | CoA-VLA: Improving Vision-Language-Action Models Via Visual-Text Chain-of-Affordance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: OpenAI’s recent model, O1, showcased impressive capabilities in solving complex problems by utilizing extensive reasoning chains. This prompts an important question: can robot models achieve better performance in multi-task, complex environments by reviewing prior observations and then providing task-specific reasoning to guide action prediction?In this paper, we introduce Chain-of-Affordance (CoA-VLA), a novel approach to scaling robot models by incorporating reasoning in the format of sequential robot affordances to facilitate task completion. |
Jinming Li; Yichen Zhu; Zhibin Tang; Junjie Wen; Minjie Zhu; Xiaoyu Liu; Chengmeng Li; Ran Cheng; Yaxin Peng; Yan Peng; Feifei Feng; |
| 442 | ViSpeak: Visual Instruction Feedback in Streaming Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. |
Shenghao Fu; Qize Yang; Yuan-Ming Li; Yi-Xing Peng; Kun-Yu Lin; Xihan Wei; Jian-Fang Hu; Xiaohua Xie; Wei-Shi Zheng; |
| 443 | WalkVLM: Aid Visually Impaired People Walking By Vision Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce the first large-scale dataset dedicated to walking assistance, comprising 12,000 video-annotation pairs, to provide a unified benchmark for training and evaluating systems to help visually-impaired individuals walk. |
Zhiqiang Yuan; Ting Zhang; Yeshuang Zhu; Jiapei Zhang; Ying Deng; Zexi Jia; Peixiang Luo; Xiaoyue Duan; Jie Zhou; Jinchao Zhang; |
| 444 | CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper explores a general, training-free, and model-agnostic acceleration strategy via multi-core parallelism. |
Jiaqi Han; Haotian Ye; Puheng Li; Minkai Xu; James Zou; Stefano Ermon; |
| 445 | Graph Domain Adaptation with Dual-branch Encoder and Two-level Alignment for Whole Slide Image-based Survival Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These differences generally result in large gaps in distribution between different WSI domains and thus, the survival analysis models trained on one domain may fail to transfer to another. To address this issue, we propose a Dual-branch Encoder and Two-level Alignment (DETA) framework to explore both feature and category-level alignment between different WSI domains. |
Yuntao Shou; Xiangyong Cao; Peiqiang Yan; Qiao Hui; Qian Zhao; Deyu Meng; |
| 446 | Structured Policy Optimization: Enhance Large Vision-Language Model Via Self-referenced Dialogue Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For large vision-language models (LVLMs), direct preference optimization (DPO) can over-emphasize linguistic nuances while overlooking visual context. To address this challenge, we introduce structured policy optimization (SPO) — a novel preference optimization method that simultaneously aligns preference instructions, responses, and dialogue interactions to improve multi-modal understanding and reasoning capabilities. |
Guohao Sun; Can Qin; Yihao Feng; Zeyuan Chen; Ran Xu; Sohail Dianat; Majid Rabbani; Raghuveer Rao; Zhiqiang Tao; |
| 447 | WonderTurbo: Generating Interactive 3D World in 0.72 Seconds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, a critical challenge in current 3D generation technologies lies in achieving real-time interactivity. To address this issue, we introduce WonderTurbo, the first real-time interactive 3D scene generation framework capable of generating novel perspectives of 3D scenes within 0.72 seconds. |
Chaojun Ni; Xiaofeng Wang; Zheng Zhu; Weijie Wang; Haoyun Li; Guosheng Zhao; Jie Li; Wenkang Qin; Guan Huang; Wenjun Mei; |
| 448 | TAPNext: Tracking Any Point (TAP) As Next Token Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. |
Artem Zholus; Carl Doersch; Yi Yang; Skanda Koppula; Viorica Patraucean; Xu Owen He; Ignacio Rocco; Mehdi S. M. Sajjadi; Sarath Chandar; Ross Goroshin; |
| 449 | 3D Mesh Editing Using Masked LRMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a novel approach to mesh shape editing, building on recent progress in 3D reconstruction from multi-view images. |
Will Gao; Dilin Wang; Yuchen Fan; Aljaz Bozic; Tuur Stuyck; Zhengqin Li; Zhao Dong; Rakesh Ranjan; Nikolaos Sarafianos; |
| 450 | Inverse Image-Based Rendering for Light Field Generation from Single Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In spite of its effectiveness of light flow computations, obtaining light fields requires either computational costs or specialized devices like a bulky camera setup and a specialized microlens array. In an effort to broaden its benefit and applicability, in this paper, we propose a novel view synthesis method for light field generation from only single images, named inverse image-based rendering. |
Hyunjun Jung; Hae-Gon Jeon; |
| 451 | EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these approaches do not effectively model the motions of dynamic objects (e.g., the motion speed of pedestrians is clearly different from that of vehicles), resulting in suboptimal scene decomposition. To address this, we propose Explicit Motion Decomposition (EMD), which models the motions of dynamic objects by introducing learnable motion embeddings to the Gaussians, enhancing the decomposition in street scenes. |
Xiaobao Wei; Qingpo Wuwu; Zhongyu Zhao; Zhuangzhe Wu; Nan Huang; Ming Lu; Ningning Ma; Shanghang Zhang; |
| 452 | Ultra-Precision 6DoF Pose Estimation Using 2-D Interpolated Discrete Fourier Transform Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel two-dimensional interpolated Discrete Fourier Transform (2D-IpDFT) method for robust 6DoF pose estimation using periodic patterns. |
Guowei Shi; Zian Mao; Peisen Huang; |
| 453 | DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, HERA, with hybrid guidance to overcome these limitations. |
Yuxuan Luo; Zhengkun Rong; Lizhen Wang; Longhao Zhang; Tianshu Hu; |
| 454 | From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, as task diversity and complexity increase, EVIT faces significant challenges in resolving data conflicts. To address this limitation, we propose the Dual Low-Rank Adaptation (Dual-LoRA), a holistic-to-local framework that enhances the adapter’s capacity to address data conflict through dual structural optimization. |
Pengkun Jiao; Bin Zhu; Jingjing Chen; Chong-Wah Ngo; Yu-Gang Jiang; |
| 455 | AnyI2V: Animating Any Conditional Image with Motion Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although some methods incorporate ControlNet to introduce image-based conditioning, they often lack explicit motion control and require computationally expensive training. To address these limitations, we propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories. |
Ziye Li; Hao Luo; Xincheng Shuai; Henghui Ding; |
| 456 | Fair Generation Without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While existing bias mitigation methods demonstrate effectiveness, they often encounter attribute entanglement, where adjustments to attributes relevant to the bias (i.e., target attributes) unintentionally alter attributes unassociated with the bias (i.e., non-target attributes), causing undesirable distribution shifts. To address this challenge, we introduce Entanglement-Free Attention (EFA), a method that accurately incorporates target attributes (e.g., White, Black, and Asian) while preserving non-target attributes (e.g., background) during bias mitigation. |
Jeonghoon Park; Juyoung Lee; Chaeyeon Chung; Jaeseong Lee; Jaegul Choo; Jindong Gu; |
| 457 | DPoser-X: Diffusion Model As Robust 3D Whole-body Human Pose Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To enhance performance on downstream applications, we introduce a novel truncated timestep scheduling method specifically designed for pose data characteristics. |
Junzhe Lu; Jing Lin; Hongkun Dou; Ailing Zeng; Yue Deng; Xian Liu; Zhongang Cai; Lei Yang; Yulun Zhang; Haoqian Wang; Ziwei Liu; |
| 458 | Learning Efficient and Generalizable Human Representation with Human Gaussian Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While conventional methods require per-instance optimization, recent feed-forward methods have been proposed to generate 3D Gaussians with a learnable network.However, these methods predict independent Gaussians for each frame prediction without fully capturing the relations of Gaussians from different frames, which are hard to be animated by novel poses. To address this, we propose Human Gaussian Graph (HGG) to generate generalizable and animatable Gaussian representations. |
Yifan Liu; Shengjun Zhang; Chensheng Dai; Yang Chen; Hao Liu; Chen Li; Yueqi Duan; |
| 459 | PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This process greatly broadens available geometries by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a new data synthesis pipeline that uses multiple depth foundation models as scale manipulators. |
Haotian Wang; Aoran Xiao; Xiaoqin Zhang; Meng Yang; Shijian Lu; |
| 460 | Self-Supervised Monocular 4D Scene Reconstruction for Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the lack of high-quality labeled datasets in this field has hindered the effectiveness of current supervised learning methods. In this work, we aim to address this issue by exploring an self-supervised dynamic scene reconstruction approach. |
Chengbo Yuan; Geng Chen; Li Yi; Yang Gao; |
| 461 | RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce **`RefEdit-Bench`**, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly.To overcome this limitation, we introduce **`RefEdit`** — an instruction-based editing model trained on our scalable synthetic data generation pipeline. |
Bimsara Pathiraja; Maitreya Patel; Shivam Singh; Yezhou Yang; Chitta Baral; |
| 462 | PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We formulate the motor system of an interactive avatar as a generative motion model that can drive the body to move through 3D space in a perpetual, realistic, controllable, and responsive manner. |
Yan Zhang; Yao Feng; Alpár Cseke; Nitin Saini; Nathan Bajandas; Nicolas Heron; Michael J. Black; |
| 463 | SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, these methods perform unlearning at a coarse granularity, often leading to excessive forgetting and reduced model utility. To address this issue, we introduce SAUCE, a novel method that leverages sparse autoencoders (SAEs) for fine-grained and selective concept unlearning in VLMs. |
Jiahui Geng; Qing Li; |
| 464 | Integrating Task-Specific and Universal Adapters for Pre-Trained Model-based Class-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the aforementioned challenges, we propose integrating Task-Specific and Universal Adapters (TUNA) in this paper. |
Yan Wang; Da-Wei Zhou; Han-Jia Ye; |
| 465 | VIPerson: Flexibly Generating Virtual Identity for Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel pedestrian generation pipeline, VIPerson, to generate camera-realistic pedestrian images with flexible Virtual Identities for the Person ReID task. |
Xiao-Wen Zhang; Delong Zhang; Yi-Xing Peng; Zhi Ouyang; Jingke Meng; Wei-Shi Zheng; |
| 466 | PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce Perspective Encoding (PE) to encode the camera intrinsics of the cropped images. |
Xiaoyang Hao; Han Li; |
| 467 | Radiant Foam: Real-Time Differentiable Ray Tracing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This has yielded a significant improvement in rendering speeds due to the efficiency of rasterization algorithms and hardware, but has come at a cost: the approximations that make rasterization efficient also make implementation of light transport phenomena like reflection and refraction much more difficult. We propose a novel scene representation which avoids these approximations, but keeps the efficiency and reconstruction quality of splatting by leveraging a decades-old efficient volumetric mesh ray tracing algorithm which has been largely overlooked in recent computer vision research. |
Shrisudhan Govindarajan; Daniel Rebain; Kwang Moo Yi; Andrea Tagliasacchi; |
| 468 | Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: They therefore have limited capability in recognizing diverse abnormality details that deviate from these general abnormal patterns in various ways. To address this limitation, we propose FAPrompt, a novel framework designed to learn Fine-grained Abnormality Prompts for accurate ZSAD. |
Jiawen Zhu; Yew-Soon Ong; Chunhua Shen; Guansong Pang; |
| 469 | 3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present the first large-scale 3D real car dataset, termed 3DRealCar, which offers three key features: (1) High-Volume: 2,500 cars meticulously scanned using smartphones to capture RGB images and point clouds with real-world dimensions; (2) High-Quality: Each car is represented by an average of 200 dense, high-resolution 360-degree RGB-D views, enabling high-fidelity 3D reconstruction; (3) High-Diversity: The dataset encompasses a diverse collection of cars from over 100 brands, captured under three distinct lighting conditions (reflective, standard, and dark). |
Xiaobiao Du; Yida Wang; Haiyang Sun; Zhuojie Wu; Hongwei Sheng; Shuyun Wang; Jiaying Ying; Ming Lu; Tianqing Zhu; Kun Zhan; Xin Yu; |
| 470 | RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, 3D-based texture synthesis methods aim to address these inconsistencies, but they often neglect 2D diffusion model priors, making them challenging to apply to real-world objects. To overcome these limitations, we propose RomanTex, a multiview-based texture generation framework that integrates a multi-attention network with an underlying 3D representation, facilitated by our novel 3D-aware Rotary Positional Embedding. |
Yifei Feng; Mingxin Yang; Shuhui Yang; Sheng Zhang; Jiaao Yu; Zibo Zhao; Yuhong Liu; Jie Jiang; Chunchao Guo; |
| 471 | D3QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose to leverage Discrete Distribution Discrepancy-aware Quantization Error (D^3QE) for autoregressive-generated image detection that exploits the distinctive patterns and the frequency distribution bias of the codebook existing in real and fake images. |
Yanran Zhang; Bingyao Yu; Yu Zheng; Wenzhao Zheng; Yueqi Duan; Lei Chen; Jie Zhou; Jiwen Lu; |
| 472 | Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. |
Divyansh Srivastava; Xiang Zhang; He Wen; Chenru Wen; Zhuowen Tu; |
| 473 | GAS: Generative Avatar Synthesis from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a unified and generalizable framework for synthesizing view-consistent and temporally coherent avatars from a single image, addressing the challenging task of single-image avatar generation. |
Yixing Lu; Junting Dong; Youngjoong Kwon; Qin Zhao; Bo Dai; Fernando De la Torre; |
| 474 | DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. |
Yuntao Chen; Yuqi Wang; Zhaoxiang Zhang; |
| 475 | Make Your Training Flexible: Towards Deployment-Efficient Video Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We hence introduce a novel paradigm for lossless adaptation across scenarios, enabling models to maintain optimal performance under high-resource conditions while seamlessly transferring to low-resource environments. |
Chenting Wang; Kunchang Li; Tianxiang Jiang; Xiangyu Zeng; Yi Wang; Limin Wang; |
| 476 | Multi-identity Human Image Animation with Structural Video Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of human appearance and pose condition and model the distribution of 3D-aware dynamics. To address these limitations, we present Structural Video Diffusion, a novel framework designed for generating realistic multi-human videos. |
Zhenzhi Wang; Yixuan Li; Yanhong Zeng; Yuwei Guo; Dahua Lin; Tianfan Xue; Bo Dai; |
| 477 | RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. |
Kaidong Zhang; Rongtao Xu; Pengzhen Ren; Junfan Lin; Hefeng Wu; Liang Lin; Xiaodan Liang; |
| 478 | Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. |
Ying Ba; Tianyu Zhang; Yalong Bai; Wenyi Mo; Tao Liang; Bing Su; Ji-Rong Wen; |
| 479 | Griffon V2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Such limitation further restricts the model’s potential to achieve nuanced visual and language referring in domains such as GUI Agents, counting, etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. |
Yufei Zhan; Shurong Zheng; Yousong Zhu; Hongyin Zhao; Fan Yang; Ming Tang; Jinqiao Wang; |
| 480 | Diffusion-Based Imaginative Coordination for Bimanual Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While video prediction has been recently studied for representation learning and control, leveraging its ability to capture rich dynamic and behavioral information, its potential for enhancing bimanual coordination remains underexplored. To bridge this gap, we propose a unified diffusion-based framework for the joint optimization of video and action prediction. |
Huilin Xu; Jian Ding; Jiakun Xu; Ruixiang Wang; Jun Chen; Jinjie Mai; Yanwei Fu; Bernard Ghanem; Feng Xu; Mohamed Elhoseiny; |
| 481 | FaceXFormer: A Unified Transformer for Facial Analysis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce FaceXFormer, an end-to-end unified transformer model capable of performing ten facial analysis tasks within a single framework. |
Kartik Narayan; Vibashan VS; Rama Chellappa; Vishal M. Patel; |
| 482 | SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, they generally exhibit worse accuracy than encoder-decoder-based methods (EDTRs) due to struggling with text irregularity and linguistic missing. To address these challenges, we propose SVTRv2, a CTC model endowed with the ability to handle text irregularities and model linguistic context. |
Yongkun Du; Zhineng Chen; Hongtao Xie; Caiyan Jia; Yu-Gang Jiang; |
| 483 | Inference-Time Diffusion Model Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these models still exhibit a performance gap compared to their pre-trained diffusion model counterparts, exacerbated by distribution shifts and accumulated errors during multi-step sampling. To address this, we introduce Distillation++, a novel inference-time distillation framework that reduces this gap by incorporating teacher-guided refinement during sampling. |
Geon Yeong Park; Sang Wan Lee; Jong Chul Ye; |
| 484 | Scaling Omni-modal Pretraining with Multimodal Context: Advancing Universal Representation Learning Across Modalities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work introduces Multimodal Context (MiCo), a scalable pretraining framework designed to advance omni-modal intelligence–an AI system capable of understanding and learning from multiple modalities to achieve universal representation learning. |
Yiyuan Zhang; Handong Li; Jing Liu; Xiangyu Yue; |
| 485 | Learning Beyond Still Frames: Scaling Vision-Language Models with Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, while such datasets improve static image-text understanding, they fail to develop the temporal and motion comprehension needed for video understanding. To address these gaps, we propose incorporating video pretraining into VLMs to improve the model’s ability to capture temporal dynamics and general visual perception, which requires reconciling spatial redundancy with strict temporal causality. |
Yiyuan Zhang; Handong Li; Jing Liu; Xiangyu Yue; |
| 486 | LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation,which features (i) comprehensive tasks, encompassing 2,100 extensive prompts across 20 fine-grained task dimensions, and (ii) large-scale human-preference annotations, including 100K mean-opinion scores (MOSs) and 50K question-answering (QA) pairs annotated on 50,400 images generated from 24 T2I models.Based on EvalMi-50K, we propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions including perceptual quality, text-image correspondence, and task-specific accuracy.Extensive experimental results show that LMM4LMM achieves state-of-the-art performance on EvalMi-50K, and exhibits strong generalization ability on other AI-generated image evaluation benchmark datasets, manifesting the generality of both the EvalMi-50K dataset and LMM4LMM metric. |
Jiarui Wang; Huiyu Duan; Yu Zhao; Juntong Wang; Guangtao Zhai; Xiongkuo Min; |
| 487 | Uncover Treasures in DCT: Advancing JPEG Quality Enhancement By Exploiting Latent Correlations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this challenge, we identify two critical types of correlations within the DCT coefficients of JPEG images. Building on this insight, we propose an Advanced DCT-domain JPEG Quality Enhancement (AJQE) method that fully exploits these correlations. |
Jing Yang; Qunliang Xing; Mai Xu; Minglang Qiao; |
| 488 | Unraveling The Effects of Synthetic Data on End-to-End Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Additionally, recent simulators designed for closed-loop evaluation provide limited interaction with other vehicles, failing to simulate complex real-world traffic dynamics. To address these issues, we introduce SceneCrafter, a realistic, interactive, and efficient AD simulator based on 3D Gaussian Splatting (3DGS). |
Junhao Ge; Zuhong Liu; Longteng Fan; Yifan Jiang; Jiaqi Su; Yiming Li; Zhejun Zhang; Siheng Chen; |
| 489 | VPO: Aligning Text-to-Video Generation Models with Prompt Optimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. |
Jiale Cheng; Ruiliang Lyu; Xiaotao Gu; Xiao Liu; Jiazheng Xu; Yida Lu; Jiayan Teng; Zhuoyi Yang; Yuxiao Dong; Jie Tang; Hongning Wang; Minlie Huang; |
| 490 | GFPack++: Attention-Driven Gradient Fields for Optimizing 2D Irregular Packing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose GFPack++, a deeply investigated framework that adopts attention-based geometry and relation encoding, enabling more comprehensive modeling of complex packing relationships. |
Tianyang Xue; Lin Lu; Yang Liu; Mingdong Wu; Hao Dong; Yanbin Zhang; Renmin Han; Baoquan Chen; |
| 491 | Proxy-Bridged Game Transformer for Interactive Extreme Motion Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these approaches fall short in modeling extreme motions like lindy-hop dances, as they require a more comprehensive understanding of cross-person dependencies. To bridge this gap, we introduce Proxy-bridged Game Transformer (PGformer), a Transformer-based foundation model that captures the interactions driving extreme multi-person motions. |
Yanwen Fang; Wenqi Jia; Xu Cao; Peng-Tao Jiang; Guodong Li; Jintai Chen; |
| 492 | Cross-Category Subjectivity Generalization for Style-Adaptive Sketch Re-ID Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose Adaptive Incremental Prompt-tuning (AIP), the first approach that explores cross-category subjective style generalization for sketch re-ID. |
Zechao Hu; Zhengwei Yang; Hao Li; Zheng Wang; Yixiong Zou; |
| 493 | Enhancing Image Restoration Transformer Via Adaptive Translation Equivariance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Attention mechanisms in modern restoration transformers undermine this property, adversely impacting both training convergence and generalization. To alleviate this issue, we propose two key strategies for incorporating translation equivariance: slide indexing and component stacking. |
JiaKui Hu; Zhengjian Yao; Lujia Jin; Hangzhou He; Yanye Lu; |
| 494 | Auto-Regressively Generating Multi-View Consistent Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the Multi-View AutoRegressive (MV-AR) method, which leverages an autoregressive model to progressively generate consistent multiview images from arbitrary prompts. |
JiaKui Hu; Yuxiao Yang; Jialun Liu; Jinbo Wu; Chen Zhao; Yanye Lu; |
| 495 | Preacher: Paper-to-Video Agentic System Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. |
Jingwei Liu; Ling Yang; Hao Luo; Fan Wang; Hongyan Li; Mengdi Wang; |
| 496 | Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: II) weak pose robustness (mask generators fail due to articulated poses and miss rare regions like waist, while human parsers remain limited by predefined categories). To address these gaps, we propose Pose-Star, a framework that dynamically recomposes body structures (e.g., neck, chest, etc.) into anatomy-aware masks (e.g., chest-length) for user-defined edits. |
Yuran Dong; Mang Ye; |
| 497 | Principles of Visual Tokens for Efficient Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While most methods succeed in reducing the cost of the model and maintaining accuracy, an interesting pattern arises: most methods do not outperform the baseline of randomly discarding tokens. In this paper we take a closer look at this phenomenon and observe 5 principles of the nature of visual tokens. |
Xinyue Hao; Gen Li; Shreyank N Gowda; Robert B. Fisher; Jonathan Huang; Anurag Arnab; Laura Sevilla-Lara; |
| 498 | Progressive Test Time Energy Adaptation for Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a model-agnostic, progressive test-time energy adaptation approach for medical image segmentation. |
Xiaoran Zhang; Byung-Woo Hong; Hyoungseob Park; Daniel H. Pak; Anne-Marie Rickmann; Lawrence H. Staib; James S. Duncan; Alex Wong; |
| 499 | ImHead: A Large-scale Implicit Morphable Model for Localized Head Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, we retain a single compact identity space and introduce an intermediate region-specific latent representation to enable local edits. |
Rolandos Alexandros Potamias; Stathis Galanakis; Jiankang Deng; Athanasios Papaioannou; Stefanos Zafeiriou; |
| 500 | D-Attn: Decomposed Attention for Large Vision-and-Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Decomposed Attention (D-Attn), a more flexible attention architecture for LVLMs, which enables modification of visual token operations without affecting textual-to-textual attention. |
Chia-Wen Kuo; Sijie Zhu; Fan Chen; Xiaohui Shen; Longyin Wen; |
This table only includes 500 papers selected by our daily digest algorithm. To continue with the full list (~2,700 papers), please visit Paper Digest: ICCV-2025 (Full List).